Can KruskalWallis test be used to test significance of multiple groups within multiple factors?
I have tried to read what I can on KruskalWallis and while I have found some useful information, I still seem to not find the answer to my question. I am trying to use the KruskalWallis test to determine the significance of multiple groups, within multiple factors, in predicting a set of dependent variables.
Here is an example of my data:
ID Date Point Season Grazing Cattle_Type AvgVOR PNatGr NatGrHt
181 7/21/2015 B22 late pre Large 0.8 2 20
182 7/21/2016 B32 early post Small 1.0 4 24
In this example, my dependent variables are "AcgVor", "PNatGR" and"NatGrHt" while the independent variables (factors) are "Season', 'Grazing", and "Cattle_Type". As you can see, each of my factors has 2 group levels each.
What I am trying to accomplish is to run a nonparamatric test that looks at the separate and combined importance of my factor groups to each of my dependent variables. I chose KrukalWallis and it seems to work for testing one of my grouping factors at a time.
Here is the result for AvgVor ~ Grazing
kruskal.test(AvgVOR ~ Grazing, data = Veg)
KruskalWallis rank sum test
data: AvgVOR by Grazing
KruskalWallis chisquared = 94.078, df = 1, pvalue < 2.2e16
This tells me that AvGVor is significantly different according to whether they were recorded pre or post grazing.
Is there a way to build a similar model using KruskalWallis that includes all of my grouping factors? Even if I have to run a separate one for each dependent variable.
I attempted the following code, but it is flawed.
lapply(Veg[,c("Grazing", "Cattle_Type", "Season")]),function(AvgVOR) kruskal.test(AvgVOR ~ Veg)
See also questions close to this topic

R Package Dependencies in Conflict
I currently have Mac 10.13.1 High Sierra and when I go to install the ff package for R on Anaconda using the Terminal command conda install c condaforge rff I get this... Solving environment: failed UnsatisfiableError: The following specifications were found to be in conflict:  rdrr  rff Use "conda info " to see the dependencies for each package.
To determine the dependencies I typed in conda info but I a new to R so not sure what to do or look for when I get this.....
active environment : None user config file : /Users/johnchristospanagiotopoulos/.condarc populated config files : /Users/johnchristospanagiotopoulos/.condarc conda version : 4.4.7 condabuild version : 3.0.27 python version : 2.7.14.final.0 base environment : /Users/johnchristospanagiotopoulos/anaconda2 (writable) channel URLs : https://repo.continuum.io/pkgs/main/osx64 https://repo.continuum.io/pkgs/main/noarch https://repo.continuum.io/pkgs/free/osx64 https://repo.continuum.io/pkgs/free/noarch https://repo.continuum.io/pkgs/r/osx64 https://repo.continuum.io/pkgs/r/noarch https://repo.continuum.io/pkgs/pro/osx64 https://repo.continuum.io/pkgs/pro/noarch package cache : /Users/johnchristospanagiotopoulos/anaconda2/pkgs /Users/johnchristospanagiotopoulos/.conda/pkgs envs directories : /Users/johnchristospanagiotopoulos/anaconda2/envs /Users/johnchristospanagiotopoulos/.conda/envs platform : osx64 useragent : conda/4.4.7 requests/2.18.4 CPython/2.7.14 Darwin/17.2.0 OSX/10.13.1 UID:GID : 501:20 netrc file : None offline mode : False
I have looked to other places to resolve this issue, but I am new to R. If anyone has advice on how to get rid of error like these it would be much appreciated.

How do I know eval function result in Error in R
The eval function in R can evaluate one R expression, my question is how can I know whether it would return error when the R expression is invalid. Thanks

How do I call a list from the shiny server in the shiny UI?
Situation: In the server environment I define a list. I want to call this list in the UIenvironment.
Here is the code:
library(shiny) ui=fluidPage( selectizeInput( 'chooser', 'Choose an Item', choices = mylist, multiple = TRUE ) ) server=function(input,output){ mylist=c("Fork", "Tree", "Truck", "Spoon", "Rocket") } shinyApp(ui, server)
Unfortunately, this does not work. I get:
Error in lapply(obj, function(val) { : object 'mylist' not found
Question: What do I need to change the code to to make it work?

How to accurately calculate a running average in JavaScript without summing entire set
Say I am tracking the time it takes for a function to execute, and am showing the average, updating each time a function completes.
The array of times would be like:
var completionTimes = [123, 1234, 128, 1000, ...]
But it could get very large, into the millions or billions of runs. Averaging that every frame would be expensive.
var sum = completionTimes.reduce(function(a, b) { return a + b }) var avg = sum / completionTimes.length
Wondering if there is a trick of some sort to perform this running average without having to sum up all the values each time. Wondering if there is a way to do this without loss of accuracy/precision, but if not, knowing how to do it with small loss of accuracy works too.
Maybe there is a way to sort them and group them into chunks, average the chunks, then do it that way. Not sure what best practices are here.

Is there a C++ library for Truncated gamma distribution?
I have been looking for C++ libraries for Normal Distribution, Gamma Distribution, Truncated Normal Distribution and Truncated Gamma Distribution. I found libraries for the first three here but couldn't find any for Truncated Gamma Distribution. Is there a Truncated Gamma Distribution library for C++?
Thanks.

How do I implement multiple linear regression in Python?
I am trying to write a multiple linear regression model from scratch to predict the key factors contributing to number of views of a song on Facebook. About each song we collect this information, i.e. variables I'm using:
df.dtypes clicked int64 listened_5s int64 listened_20s int64 views int64 percentage_listened float64 reactions_total int64 shared_songs int64 comments int64 avg_time_listened int64 song_length int64 likes int64 listened_later int64
i'm using number of views as my dependent variable and all other variables in a dataset as independent ones. The model is posted down below:
import scipy.stats as stats import matplotlib import matplotlib.pyplot as plt import sklearn from sklearn import linear_model from sklearn.cross_validation import train_test_split #df_x > new dataframe of independent variables df_x = df.drop(['views'], 1) #df_y > new dataframe of dependent variable views df_y = df.ix[:, ['views']] names = [i for i in list(df_x)] regr = linear_model.LinearRegression() x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2) #Fitting the model to the training dataset regr.fit(x_train,y_train) regr.intercept_ print('Coefficients: \n', regr.coef_) print("Mean Squared Error(MSE): %.2f" % np.mean((regr.predict(x_test)  y_test) ** 2)) print('Variance Score: %.2f' % regr.score(x_test, y_test)) regr.coef_[0].tolist()
Output here:
regr.intercept_ array([1173904.20950487]) MSE: 19722838329246.82 Variance Score: 0.99
Looks like something went miserably wrong.
Trying the OLS model:
import statsmodels.api as sm from statsmodels.sandbox.regression.predstd import wls_prediction_std model=sm.OLS(y_train,x_train) result = model.fit() print(result.summary())
Output:
Rsquared: 0.992 Fstatistic: 6121. coef std err t P>t [95.0% Conf. Int.] clicked 0.3333 0.012 28.257 0.000 0.310 0.356 listened_5s 0.4516 0.115 3.944 0.000 0.677 0.227 listened_20s 1.9015 0.138 13.819 0.000 1.631 2.172 percentage_listened 7693.2520 1.44e+04 0.534 0.594 2.06e+04 3.6e+04 reactions_total 8.6680 3.561 2.434 0.015 1.672 15.664 shared_songs 36.6376 3.688 9.934 0.000 43.884 29.392 comments 34.9031 5.921 5.895 0.000 23.270 46.536 avg_time_listened 1.702e+05 4.22e+04 4.032 0.000 8.72e+04 2.53e+05 song_length 6309.8021 5425.543 1.163 0.245 1.7e+04 4349.413 likes 4.8448 4.194 1.155 0.249 3.395 13.085 listened_later 2.3761 0.160 14.831 0.000 2.691 2.061 Omnibus: 233.399 DurbinWatson: 1.983 Prob(Omnibus): 0.000 JarqueBera (JB): 2859.005 Skew: 1.621 Prob(JB): 0.00 Kurtosis: 14.020 Cond. No. 2.73e+07 Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.73e+07. This might indicate that there are strong multicollinearity or other numerical problems.
It looks like somethings went seriously wrong just by looking at this output.
I believe that something went wrong with training/testing sets and creating two different data frames x and y but can't figure out what. This problem must be solvable by using multiple regression. Shall it not be linear? Could you please help me figure out what went wrong?

In R how do you label the xaxis for T2 plots using mqcc?
I am using the QCR library to draw the hotelling's T2 plot. I do not see any method/options in the document to update/modify the xlabels. I would like my hotelling Plot to be labeled. Anyone have a work around for this?
For Example
mqcc(mtcars)
gives the following plot

Data Science Analyze questionaires with multiple responses
Let's say we have a questionaires with several responses possible for certain questions. Example
What are your reasons of going to work early and leave early ? (3 choices possible) a. Avoid traffic in public transport b. Have time with family in the evening c. Sport in the evening d. You are a morning person e. Job nature etc.
What would be your strategy to normalize and analyze this questionaire, taking into consideration that people may put from 03 responses. Should we convert this to nominal data ?

How to generate multivariate normal distribution in J
Can anyone tell me how to generate multivariate distribution in J given the mean value vector and the covariance matrix? For example, in Python, np.random.multivariate_normal([0,0],[[1,.75],[.75,1]],1000) generate multivariate distribution with [0,0] as mean value vector and [[1,.75],[.75,1]] as the variancecovariance matrix? Thanks

Maximum Likelihood Hypothesis Definition?
I need the definition of ML Hypothesis ; I know that the maximum likelihood estimation is a statistic method to perform fitting parameters for different models. So can I say that the ML Hypothesis is the hypothesis with parameters found using ML estimation ? Thanks for the help

How to compare two Probability Density Functions?
I have two Probability Density Functions and I want to know if their distributions are similar or not. I know that KS test in R can do this, but when I run the code, an error occurs. Thanks for any help.
set.seed(100) a=density(sample(x=1:30,size = 30,replace = T)) b=density(sample(x=1:40,size = 35,replace = T)) plot(a) lines(b) ks.test(a,b) Error in ks.test(a, b) : 'y' must be numeric or a function or a string naming a valid function

Testing the null hypothesis of zero skew
I need to test the null hypothesis that my steady returns have a zero skewness with a confidence level of 95%. Do you have any ideas which formula I can take for this kind of test ? I tried the Agostino test for skewness, but think it's not the best way, because I can't set a confidence level.
library(moments) ?agostino.test

how can I implement kruskalwallis test in sparkscala
I am trying to implement kruskalwallis test in sparkscala. I know it is possible to use Python Pandas so I found this code1
but can anyone suggest some sources or some codes plz! 
If ignoring repeated measures data in kruskal wallis test, is it assumption violated?
This data is repeated measures data and groups are included.(Mixed Design)
I want to Nonparametric test between Group effect (KruskalWallis Test) and Within effect(Friedman Test) separatedly. (Don't want Parametric Test. and Friedman Test is testing Later.)
Like testing 'difference between Grand mean effect' & 'difference between Column mean effect'.
Data is Stacked by repeated measures.
If ignoring repeated measures data in KruskalWallis test, Is it assumption violated? (Assumption: Random Sample or Independent Sample)
My data is below here.
# From Table 1 of Deal et al (1979): Role of respiratory heat exchange in production of exerciseinduced asthma. J Appl Physiol 46:467475 dat < data.frame(ID = c(1,2,3,4,5,6,7,8), temp1 = c(74.5,75.5,68.9,57,78.3,54,72.5,80.8), temp2 = c(81.5,84.6,71.6,61.3,84.9,62.8,68.3,89.9), temp3 = c(83.6,70.6,55.9,54.1,64,63,67.8,83.2), temp4 = c(68.6,87.3,61.9,59.2,62.2,58,71.5,83), temp5 = c(73.1,73,60.5,56.6,60.1,56,65,85.7), temp6 = c(79.4,75,61.8,58.8,78.7,51.5,67.7,79.6)) # stacked Data dat.s < stack(dat[,1]) SUBJNO < rep(seq(1,8),6) GROUP < rep(rep(seq(1,4),2),6) dat.s2 < cbind(dat.s,SUBJNO) dat.s2 < cbind(dat.s,GROUP) head(dat.s2) # KruskallWallis Test # I think this is like difference between grand mean # Is it violated assumption 'Random Samplle (or Independent Sample'? kruskal.test(values ~ GROUP, data = dat.s2) # I think this is like difference between column mean. # maybe this is not violated assumption (Random Sample or Independent Sample) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp1',]) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp2',]) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp3',]) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp4',]) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp5',]) kruskal.test(values ~ GROUP, data = dat.s2[dat.s2$ind=='temp6',])
Thanks a lot!

KruskalWallis in R with large dataset (n = 106000)  Any alternative?
I am using the KruskalWallis test on my dataset, but it's now been running for about 20 minutes. My dataset contains 160000 observations.
Does anyone has experience with running this test on these kind of large datasets? Will the test be completed eventually and how long does it take? Are there any alternatives?