Normalize two values drawn from same dataset?
Background: I have the following data A
and B
:
A
:
Total Followers:
False 7158961 # 9374155
True 2215194
Total Tweets:
False 1113 # 1559
True 446
Followers per tweet (False):
False_A = 7158961 / 1113
False_A
Output:
6432
Followers per tweet (True):
True_A = 2215194 / 446
True_A
Output:
4966
B
:
Total Followers:
False 8481276 # 9374155
True 892879
Total Tweets:
False 1368 # 1559
True 191
Followers per tweet (False):
False_B = 8481276 / 1368
False_B
Output:
6199
Followers per tweet (True):
True_B = 892879 / 191
True_B
Output:
4674
Questions:
1) Given that these values are from the same dataset (e.g. followers = 9374155
; tweets = 1559
), is it valid to directly compare True_A (4966)
to True_B (4674)
and thus state the following: True_A
has more followers per tweet compared to True_B
?
2) Or do I need to normalize?
3) If I need to normalize, how would I do so?
See also questions close to this topic

Python Statsmodels: is there any way to program exogenous variables into the VAR framework the package provides?
Can't post my code as I'm typing on my phone (current employer doesn't allow posting on stackexchange from the computers) I'm trying to estimate and forecast with a VAR using statsmodels in python, just wondering if there's any way I can introduce exogenous variables into the mix? If not, would anyone be able to provide a framework to do this manually?
Cheers!

Using EWMA in pandas for large dataset
I am dealing with data of the following format:
Groups Terms A  Dress, Cap, Skirts ..... B  Cricket, Basketball, Baseball ....
for each of the terms in a group, the following data is available for range of dates :
Dates , frequency ( for Dress) 20170101, 2 20170105, 3 20170107, 1 ... 20170110, 6
I am loading the dataframe for each of the terms and upsampling it in order to get the frequency values for the missing days, and then using emwa on the data.
 df = pd.read_csv('A_dress.csv', parse_dates=['date'],header=0, index_col=0, squeeze=True) up_df = df.resample('D').ffill() ewma_df = pd.ewma(up_df, span = 3, how ="mean") 
Further, I am using Facebook Prophet lib to forecast the frequency for a term for next 7 days.
I want to know if there is a way to do the above process for all the data in one go ( either for all the words in a group or for all the words in all the groups) in order to reduce the processing time. In short, what is the best way to optimise the code. I am dealing with data that has millions of terms across multiple groups.

How to find list of classes/objects that Caffe net was trained with (and also their indices)?
The Caffe Model Zoo (http://caffe.berkeleyvision.org/model_zoo.html) provides various pretrained DNN models that I would like to use. However, I'm struggling to find the list of classes and objects that these nets were trained on and their corresponding indices.
Does anyone know where to find this? I've looked into the readme files and the actual protobuf but found nothing.
Thanks!

Vastly different input scale neural network
To be straightforward, I am designing a neural network for a multivariate regression.
My main qualm is the input scale of my variables. Obviously, standardizing the data and normalizing between 1 and 1 is preferable; however, this situation is different.
I am dealing with state vectors [x, y, z, xdot, ydot, zdot, xdd, ydd, zdd] in Earth coordinates. Therefore, if normalized to Earth's radius, the positional components will be [1, 1] while the other columns will be at least one order of magnitude smaller. The velocity will be centered around 0.5, and the acceleration terms will be nearly 0.
My first thought was to normalize each column irrespective of the others, but that is problematic. Each column is the derivative of its elder and normalizing individually will cause information loss for this (which is highly relevant to the problem).
I have tried using float64 in my operations to prevent clipping near these small values, but that did not seem to help. What tactics should I attempt next? currently have embedding layers, followed by recurrent operations, followed by the output layer projection.

Normalize gradient magnitude to unit length in tensorflow
How can we normalize the gradient magnitude to a unit length in tensorflow?
I am trying to do something like
gradients = tf.gradients(self.loss, _params) gradients_norm = tf.norm(gradients , name='norm') final_gradients= [(gradients/gradients_norm , var) for grad, var in gradients]
Any clue? Thank you

R ggplot2 histogram overlays with normalized values for each histogram
I would like to create a histogram plot comparing three groups. However, I'd like to normalize each histogram by the total number of counts within each group, not by the total number of counts. Here is the code that I have.
library(ggplot2) library(reshape2) # Creates dataset set.seed(9) df< data.frame(values = c(runif(400,20,50),runif(300,40,80),runif(600,0,30)),labels = c(rep("med",400),rep("high",300),rep("low",600))) levs < c("low", "med", "high") df$labels < factor(df$labels, levels = levs) ggplot(df, aes(x=values, fill=labels)) + geom_histogram(aes(y=..density..), breaks= seq(0, 80, by = 2), alpha=0.2, position="identity")
Which generates a histogram which appears to be normalized by density.
However, I decided to cross check this density plot against my manual validation of that density. To do that I used the below code:
# Separates the low medium and high groups df1 < df[df$labels == "low",] df2 < df[df$labels == "med",] df3 < df[df$labels == "high",] # creates histogram for each group that is normalized by the total number of counts hist_temp < hist(df1$values, breaks=seq(0,80, by=2)) tdf < data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) < c("bins","counts") tdf$norm < tdf$counts/(sum(tdf$counts)) low1 < tdf hist_temp < hist(df2$values, breaks=seq(0,80, by=2)) tdf < data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) < c("bins","counts") tdf$norm < tdf$counts/(sum(tdf$counts)) med1 < tdf hist_temp < hist(df3$values, breaks=seq(0,80, by=2)) tdf < data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts) colnames(tdf) < c("bins","counts") tdf$norm < tdf$counts/(sum(tdf$counts)) high1 < tdf # Combines normalized histograms for each data frame and melts them into a single vector for plotting Tdata < data.frame(low1$bins,low1$norm,med1$norm,high1$norm) colnames(Tdata) < c("bin","low", "med", "high") Tdata< melt(Tdata,id = "bin") levs < c("low", "med", "high") Tdata$variable < factor(Tdata$variable, levels = levs) # Plot the data ggplot(Tdata, aes(group=variable, colour= variable)) + geom_line(aes(x = bin, y = value))
As you can see those are quite different from each other and I can't figure out why. The Y axis should be the same for both of them but it's not. So, assuming I didn't do some stupid math error, I believe I want the histogram to look like the line plot and I can't figure out a way to make that happen. Any help is appreciated and thank you in advance.
Edited to add further examples of what doesn't work:
I have also tried using the ..count../(sum(..count..)) approach with this code:
# Histogram where each histogram is divided by the total count of all groups ggplot(df, aes(x=values, fill=labels, group=labels)) + geom_histogram(aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha=0.2, position="identity")
Which just normalizes to the total count of all histograms. This also does not reflect what I see in the line plot. Also, I've tried substituting ..ncount.. for ..count.. (in the numerator, denominator, and numerator and denominator) and that also does not recreate the results shown in the line graph.
Additionally, I've tried using "position=stack" rather than identity using the below code:
ggplot(df, aes(x=values, fill=labels, group=labels)) + geom_histogram(aes(y=..density..), breaks= seq(0, 80, by = 2), alpha=0.2, position="stack")
Which also does not reflect the values shown in the line graph.
Progress made! Using the approach outlined at this post by Joran I can now generate the histogram that is the same as the line graph. Below is the code:
# Plot where each histogram is normalized by its own counts. ggplot(df,aes(x=values, fill=labels, group=labels)) + geom_histogram(data=subset(df, labels == 'high'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + geom_histogram(data=subset(df, labels == 'med'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + geom_histogram(data=subset(df, labels == 'low'), aes(y=(..count../sum(..count..))), breaks= seq(0, 80, by = 2), alpha = 0.2) + scale_fill_manual(values = c("blue","red","green"))
However, I am STILL having trouble reordering the data so that the legend reads "low" then "med" then "high", instead of alphabetical order. I've already set the levels of the factors. (See first block of code). Any thoughts?