Outliers for all numerical values to mean SAS
I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/ 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or  3x standard deviation, or in worst case remove them.
1 answer

You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted; set df; if test > 3 then test=3; if test < 3 then test =3; run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
See also questions close to this topic

Left Join collapses data
I am working with some bonds data and I'm looking to left join the interest rate projections. my data set for the bonds date looks like:
data have; input ID Vintage Reference_Rate Base2017; Datalines; 1 2017 LIBOR_001M 0.01 1 2018 LIBOR_001M 0.01 1 2019 LIBOR_001M 0.01 1 2020 LIBOR_001M 0.01 2 2017 LIBOR_003M 0.012 2 2018 LIBOR_003M 0.012 2 2019 LIBOR_003M 0.012 2 2020 LIBOR_003M 0.012 3 2017 LIBOR_006M 0.014 3 2018 LIBOR_006M 0.014 3 2019 LIBOR_006M 0.014 3 2020 LIBOR_006M 0.014 ; run;
the second dataset which I am looking to left join (or even full join) looks like
data have2; input Reference_rate Base2018 Base2019 Base2020; datalines; LIBOR_001M 0.011 0.012 0.013 LIBOR_003M 0.013 0.014 0.015 LIBOR_006M 0.015 0.017 0.019 ; run;
the dataset I've been getting collapses the vintage into 1 and messes up the rest of the analysis I've been running such that it looks like
data dontwant; input ID Vintage Reference_rate Base2017 Base2018 Base2019 Base2020; datalines; 1 2017 LIBOR_001M 0.01 0.011 0.012 0.013 2 2017 LIBOR_003M 0.012 0.013 0.014 0.015 3 2017 LIBOR_006M 0.014 0.015 0.017 0,019 run;
the dataset I would like looks like this
data want; input input Reference_rate Base2018 Base2019 Base2020; datalines; 1 2017 LIBOR_001M 0.01 0.011 0.012 0.013 1 2018 LIBOR_001M 0.01 0.011 0.012 0.013 1 2019 LIBOR_001M 0.01 0.011 0.012 0.013 1 2020 LIBOR_001M 0.01 0.011 0.012 0.013 2 2017 LIBOR_003M 0.012 0.013 0.014 0.015 2 2018 LIBOR_003M 0.012 0,013 0.014 0.015 2 2019 LIBOR_003M 0.012 0.013 0.014 0.015 2 2020 LIBOR_003M 0.012 0.013 0.014 0.015 3 2017 LIBOR_006M 0.014 0.015 0.017 0.019 3 2018 LIBOR_006M 0.014 0.015 0.017 0.019 3 2019 LIBOR_006M 0.014 0.015 0.017 0.019 3 2020 LIBOR_006M 0.014 0.015 0.017 0.019 ; run;
the code I have been using is a pretty standard proc sql
PROC SQL; CREATE TABLE want AS SELECT a.*, b.* FROM have A LEFT JOIN have2 B ON A.reference_rate = B.reference_rate ORDER BY reference_rate; QUIT;

Recreating SAS mixed model output (including F tests) in R
I recently took an ANOVA class in SAS, and am rewriting my code in R. Thus far, translating random effect (and mixed effect) models from SAS to R has eluded me. The output I get from R is very different from SAS: the SS and F value are different, and I can't get F tests for the random effects. The closest I've been able to get are Chisq, using rand(). So perhaps I'm doing it all wrong in R.
The following is the SAS code and output, and then the attempt I made in R.
*TwoWay ANOVA, with one random effect and interaction term; *import dataset as "pesticide"; proc glm data=pesticide; class locations chemicals; model numberkilled = locations chemicals locations*chemicals / solution; random locations locations*chemicals / test; run; quit;
The following is the attempted R code.
#data step pesticide < read.csv("ex1710.txt") colnames(pesticide) < c("location", "chemical", "number_killed") pesticide$location < as.factor(pesticide$location) pesticide$chemical < as.factor(pesticide$chemical) #ANOVA library(lmerTest); library(car) model < lmer(number_killed ~ chemical + (1location) + (1chemical:location), data=pesticide) Anova(model, type=3, test="F")
Output is next. There are no F tests for the random effect and interaction term (which is also random), and the SS and F value are different from SAS.
Analysis of Deviance Table (Type III Wald F tests with KenwardRoger df) Response: number_killed F Df Df.res Pr(>F) (Intercept) 587.069 1 16 4.879e14 *** chemical 48.108 3 12 5.800e07 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In summary, I'm don't know how to properly do mixed effect models in R. Fixed effects models are all ok.

SAS Proc Transpose data 4
In SAS I have something like this...
ID survey Q1 q1_2 Q2 q2_2 Q3 q3_2 1 1 1 0 1 1 2 0 1 1 2 2 1 1 0
I’m not sure if transposing is the right way to go but I’d like to get something like this.
ID survey Q Response 1 1 1 1 2 0 3 1 2 1 0 2 1 3 1 2 2 1 1 2 1 3 0
Where Q1 and Q1_2 are the same question presented in two different surveys given over time

How to mutate a subset of a group in r
I'm having trouble mutating my df in R. My df looks like this
df< I class part datetime value indicator <int> <chr> <chr> <S3: POSIXct> <dbl> <dbl> 1 1 A part1 20161215 10:43:08 0.12 0 2 1 A part2 20151116 13:52:07 0.15 0 3 1 A part3 20151116 15:37:27 1.20 0 4 2 A part1 20151116 15:43:03 0.78 1 5 2 A part2 20151116 16:01:03 0.14 1 6 2 A part3 20151105 07:10:02 1.40 1 ... ... ... ... ... ... ...
I am trying to remove the extreme outliers for part 1 in the group indicator (0 or 1)
I tried this
remove_outliers < function(x, na.rm = TRUE, ...) { qnt < quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) H < 3.0 * IQR(x, na.rm = na.rm) y < x y[x < (qnt[1]  H)] < NA y[x > (qnt[2] + H)] < NA y } dfNew < df %>% group_by(indicator, part) %>% mutate(value = remove_outliers(value[part="part1"])) %>% ungroup()
this removes all of the values. How can i remove the extreme outliers within the group indicator for only part1?

R: linear fit on "binned" data and after having filtered out outliers
I have two vector of 1096 numbers (one is the daily average concentration of NOx, and the other of O3, both measured in 3 years in the same measurement station) the NOx distribution is lognormal and the O3 has a normal distribution.
I have done the lm fit with the scatter plot but my professor told me that scatter plots are not very significative for this purpose. She said that the correct procedure would be: "bin" the data, filter out any outliers, and than use the linear model.
I tried to do that, but the results were really ugly... I don't have enough R experience to find the solution on my own.
If anyone could explain me, even with an example, how to bin the data, and filter the outliers, would be very helpful.

What is the best way to identify deviation in series of numbers?
I’ve a series of numbers like 23,45,36,54,67,32,45,78,6,54,44,23,65,32… for say 30 days. Now i would like to identity the deviation or anomaly in the series? All i want is to identify whether the number for the current day is in line or an outlier compared to previous 30 days data. one way of doing is this if the value is say 1 or 2 standard deviations from the mean based on the past 30 days, then i can safely assume that there is some deviation in the series. But i’m not comfortable using this method as the standard deviation and mean are prone to large numbers. I feel the technique based on the interquartile rage is safer than the std. Can any one suggest me the appropriate method / statistical technique to accomplish this?