Statistical model to test a list of proportions
I am trying to test the accuracy of two methods in determining the makeup of a standard sample. My standard sample has a makeup of 17 species with this breakdown (5%, 5%, 5%,..., 5%, 10%, 15%)= 100%. The two methods each give me a list of 17 percent values that also equal 100%. I want to test to see which method gives the most accurate results using a statistical model. I am analyzing this in R and have been using a chisquare goodness of fit with a list of percents to test against and also calculating RSS. However, these do not seem to be the best/accurate methods. I have pasted the code that I have been using for the chisquared GOF test
pvalNA< c(0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.1, 0.15)
chip1NA< chisq.test(x=propCountNA$pCn1, p=pvalNA, rescale.p = T)
where propCountNA$pCn1 is a vector of counts that is getting tested to see if the proportions are equal to the pvalNA vector. When I had kept propCountNA$pCn1 as a list of proportions rather than counts, I wasn't sure if I was getting the correct answers so I had multiplied the proportions by number of rows in the dataset.
See also questions close to this topic

Unable to load recipes package
I install the
recipes
library just fine using:install.packages("recipes", dependencies = c("Depends", "Suggests"))
This gives me the following error:
Error in library(recipes) : there is no package called ‘recipes’
I having a bigger issue, I am unable to load
caret
but it seems the reason I cannot isrecipes
, so I assume if I solve therecipes
issue I will be able to loadcaret
. 
Design Matrix using model.matrix function for gene expression
Need some help. My data looks like this:
Identifier Sample1 Sample2 Sample3 ...Sample10 Gene1 10.85 9.33 11.04 ... 10.093 Gene2 5.94 7.95 6.46 ... 6.33 ... Gene99 3.93 4.12 7.86 ... 9.45
Samples 1 to 4 are normal, 5 to 10 are abnormal.
The data is stored in a data frame called DF. Need to create a design matrix using a model.matrix function, the idea is to use this information to fit a linear model to be able to identify the differential genes.
I have no clue how to create the design matrix. I have read the documentation, but it leads me nowhere. The function's syntax doesn't seem to be tailored towards the format that I have.
Any tips are appreciated.

Check for each unique value we have same unique id
I have a excel sheet which looks like:
Col1 Col2 IJ123 A2B1 IJ123 A2B1 IJ456 C2C2 IJ456 c2c2 IJ456 D1e2 IJ789 LJ87 IJ456 IJ789 LJ98 x = data.frame( Col1 = c("IJ123", "IJ123", "IJ456", "IJ456", "IJ456", "IJ789", "IJ456", "IJ789"), Col2 = c("A2B1", "A2B1", "C2C2", "c2c2", "D1e2", "LJ87", NA, "LJ98") )
I want to add one more column and check (for each Unique
Col2
Value) whether the assigned values inCol1
areTRUE
orFALSE
.Output:
Col1 Col2 Result IJ123 A2B1 TRUE IJ123 A2B1 TRUE IJ456 C2C2 TRUE IJ456 c2c2 TRUE IJ456 D1e2 FALSE IJ789 LJ87 TRUE (Because Col2 count=1 for this value) IJ456 C2C2 IJ789 LJ98 TRUE (Because Col2 count=1 for this value)
Logic:
 If there are more than 1 of the value in col2, then check that the corresponding col1 values just for those col2 values are the same
 If there is only one of the col2 values then check that the col1 is unique but only against the col1 values of multipleoccurring col2 values.
 Some field are blank in
Col2
for those if we have have duplicateCol1
value than showCol2
value mapped to thoseCol1
in Result (see Row 7).
For this i have a excel formula
=IF(COUNTIF($B$2:$B$8,B2)=1,SUMPRODUCT((($A$2:$A$8=A2)*(COUNTIF($B$2:$B$8,$B$2:$B$8))>1))=0,COUNTIFS($B$2:$B$8,B2,$A$2:$A$8,"<>"&A2)=0)
but its working very slow since waiting for ~4 hours it only complete 28% processing on ~0.2 million data.I have uploaded the file in
csv
format on R and want to carry out the same exercise on R for faster processing. 
ODR residual variance and reduced chi^2  do the beta uncertainties represent confidence intervals?
I'm looking for the relationship between residual variance of an Orthogonal Distance Regression (ODR) fit (as implemented in
scipy.odr
) and reduced chi^2.I'm getting similar values for some of my test cases, which I attribute to the uncertainty in the x values which I do not take into account  I saw one answer validating this is indeed reduced chi^2, so I just want to assert this.
Finally, how is the standard beta deviation of the parameters related to their confidence intervals, in terms of percentiles? That is, by how much is the residual variance (chi^2?) allowed to vary when computing this value (
sd_beta
), and what is the associated interval? Can it be changed?My data is of the form
x,dx,y,dy
, with a simple linear fit being done. The answers I find are either too simple (people just looking for what to quote as an uncertainty) or too complex (the mathematics involved in the general procedure). I'm looking for something in between, relating to the specific algorithm. 
SciPy statistical analysis  dataframe of ordinal and paired discrete variables
I have a dataframe that has around 100,000 rows and I am trying to observe the significance of the ratio between two variables: occurrence (x>thresh) and an additional engagement event following each occurrence (x>thresh&engage), and I use a rolling window to find the number of occurrences throughout the dataframe. After each pass, the window increases by 1, and I do this to see if different window sizes have an effect on observing an engagement event following the occurrence(s) of a threshold event within a given window.
I do not have a background in statistics, but so far, I've identified chi squared and wilcoxon as possible tests of significance that SciPy offers which could help me reach a conclusion. Unfortunately, these do not tell me how significant each ratio is, but how significant the distribution of values in the dataframe are.
sample dataframes:
count x>thresh x>thresh&engage ratio 0 0 0 NaN 1 1871 841 0.44949225 2 3928 1908 0.48574338 3 2991 1502 0.50217319 4 1126 560 0.49733570 WilcoxonResult(statistic=0.0, pvalue=0.043114446783075355) 0 0 0 NaN 1 1797 805 0.44796884 2 4476 2136 0.47721180 3 4183 2105 0.50322735 4 1876 922 0.49147122 5 449 217 0.48329621 WilcoxonResult(statistic=0.0, pvalue=0.027707849358079864) 0 0 0 NaN 1 1733 774 0.44662435 2 4954 2346 0.47355672 3 5272 2623 0.49753414 4 2745 1359 0.49508197 5 876 439 0.50114155 6 172 71 0.41279070 WilcoxonResult(statistic=0.0, pvalue=0.017960477526078766)
ideal dataframes:
count x>thresh x>thresh&engage ratio p_value 0 0 0 NaN NaN 1 1871 841 0.44949225 0.20130 2 3928 1908 0.48574338 0.13021 3 2991 1502 0.50217319 0.10130 4 1126 560 0.49733570 0.12135 0 0 0 NaN NaN 1 1797 805 0.44796884 0.10024 2 4476 2136 0.47721180 0.07254 3 4183 2105 0.50322735 0.02332 4 1876 922 0.49147122 0.13277 5 449 217 0.48329621 0.14730 0 0 0 NaN NaN 1 1733 774 0.44662435 0.10100 2 4954 2346 0.47355672 0.04100 3 5272 2623 0.49753414 0.00123 4 2745 1359 0.49508197 0.10171 5 876 439 0.50114155 0.02114 6 172 71 0.41279070 0.12181
Any help is appreciated. Thank you!

Param not changing for std::chi_squared_distribution
As per the answer to this question I have attempted to change the parameter of a distribution in
<random>
by using.param()
. Below is a toy example where I'm trying to do this.For both a chisquared and a normal distribution I have a function that generates two values, the second where the parameter has been changed by
.param()
. I run both functions multiple times and print out the mean outcome for both. As expected the normal function produces mean outcomes of 0 and 10. Unexpectedly the chisquared function produces mean outcomes of 4 and 4, instead of my expectation of 4 and 3. Why are my expectations off for the chisquared distribution?#include <iostream> #include <random> #include <vector> using namespace std; vector<double> chisqtest(mt19937_64 &gen) { vector<double> res(2); chi_squared_distribution<double> chisq_dist(4); res[0] = chisq_dist(gen); chisq_dist.param(std::chi_squared_distribution<double>::param_type (3)); res[1] = chisq_dist(gen); return res; } vector<double> normtest(mt19937_64 &gen) { vector<double> res(2); normal_distribution<double> norm_dist(0,1); res[0] = norm_dist(gen); norm_dist.param(std::normal_distribution<double>::param_type (10,1)); res[1] = norm_dist(gen); return res; } int main() { unsigned int n = 100000; mt19937_64 gen(1); vector<double> totals = {0,0}, res(2); for(unsigned int i = 0; i < n; i++){ res = chisqtest(gen); totals[0] += res[0]; totals[1] += res[1]; } cout << totals[0]/n << " " << totals[1]/n << "\n"; vector<double> totals2 = {0,0}, res2; for(unsigned int i = 0; i < n; i++){ res2 = normtest(gen); totals2[0] += res2[0]; totals2[1] += res2[1]; } cout << totals2[0]/n << " " << totals2[1]/n << "\n"; }