Boosted Regression Trees prediction errors
I am trying to run a Boosted Regression tree as I want to predict future scenarios for (G) but after I build the model and I try to actually predict my dependent variable (G) I get an error.
loyn<read.csv("FullData.csv", strip.white=T, header=T, sep=',')
library(gbm)
head(loyn)
#G Site P O T Coral.Cover OCC PPC
#1 23.023128 Lizard 27.494517 3.702791 25.470 4.2 8.2 20.9
#2 11.547282 Lizard 20.413183 3.664939 25.430 4.2 8.2 20.9
loyn.gbm<gbm(G~Site+P+O+T+loyn$Coral.Cover+OCC+PPC, data=loyn, distribution="gaussian", train=0.75, interaction.depth=3, shrinkage=0.001, bag.fraction=0.5,cv.folds=3, n.minobsinnode=2, n.trees=10000)
par(mfrow=c(1,2))
(best.iter<gbm.perf(loyn.gbm, method="cv", overlay=T, oobag.curve=T))
summary(loyn.gbm, n.trees=best.iter)
# here I am asking what value G would have if Site=5, P=20, O=2.7, T=32, #loyn$Coral.Cover=4, OCC=2, PPC=20
new<predict(loyn.gbm, newdata=data.frame(Site=5, P=20, O=2.7, T=32, loyn$Coral.Cover=4, OCC=2, PPC=20),n.trees=best.iter)
But it keeps saying
Error: unexpected '=' in "new<predict(loyn.gbm, newdata=data.frame(Site=5, P=20, O=2.7, T=32, loyn$Coral.Cover="
I have checked and it is not a typo. Any help would be much appreciated. Thanks!
See also questions close to this topic

Unable to load recipes package
I install the
recipes
library just fine using:install.packages("recipes", dependencies = c("Depends", "Suggests"))
This gives me the following error:
Error in library(recipes) : there is no package called ‘recipes’
I having a bigger issue, I am unable to load
caret
but it seems the reason I cannot isrecipes
, so I assume if I solve therecipes
issue I will be able to loadcaret
. 
Design Matrix using model.matrix function for gene expression
Need some help. My data looks like this:
Identifier Sample1 Sample2 Sample3 ...Sample10 Gene1 10.85 9.33 11.04 ... 10.093 Gene2 5.94 7.95 6.46 ... 6.33 ... Gene99 3.93 4.12 7.86 ... 9.45
Samples 1 to 4 are normal, 5 to 10 are abnormal.
The data is stored in a data frame called DF. Need to create a design matrix using a model.matrix function, the idea is to use this information to fit a linear model to be able to identify the differential genes.
I have no clue how to create the design matrix. I have read the documentation, but it leads me nowhere. The function's syntax doesn't seem to be tailored towards the format that I have.
Any tips are appreciated.

Check for each unique value we have same unique id
I have a excel sheet which looks like:
Col1 Col2 IJ123 A2B1 IJ123 A2B1 IJ456 C2C2 IJ456 c2c2 IJ456 D1e2 IJ789 LJ87 IJ456 IJ789 LJ98 x = data.frame( Col1 = c("IJ123", "IJ123", "IJ456", "IJ456", "IJ456", "IJ789", "IJ456", "IJ789"), Col2 = c("A2B1", "A2B1", "C2C2", "c2c2", "D1e2", "LJ87", NA, "LJ98") )
I want to add one more column and check (for each Unique
Col2
Value) whether the assigned values inCol1
areTRUE
orFALSE
.Output:
Col1 Col2 Result IJ123 A2B1 TRUE IJ123 A2B1 TRUE IJ456 C2C2 TRUE IJ456 c2c2 TRUE IJ456 D1e2 FALSE IJ789 LJ87 TRUE (Because Col2 count=1 for this value) IJ456 C2C2 IJ789 LJ98 TRUE (Because Col2 count=1 for this value)
Logic:
 If there are more than 1 of the value in col2, then check that the corresponding col1 values just for those col2 values are the same
 If there is only one of the col2 values then check that the col1 is unique but only against the col1 values of multipleoccurring col2 values.
 Some field are blank in
Col2
for those if we have have duplicateCol1
value than showCol2
value mapped to thoseCol1
in Result (see Row 7).
For this i have a excel formula
=IF(COUNTIF($B$2:$B$8,B2)=1,SUMPRODUCT((($A$2:$A$8=A2)*(COUNTIF($B$2:$B$8,$B$2:$B$8))>1))=0,COUNTIFS($B$2:$B$8,B2,$A$2:$A$8,"<>"&A2)=0)
but its working very slow since waiting for ~4 hours it only complete 28% processing on ~0.2 million data.I have uploaded the file in
csv
format on R and want to carry out the same exercise on R for faster processing. 
convert cost function to statsmodels formula
I want to fit some data to a curve, using this as a cost function:
def cost_func(x): return ((unknown_concx[1]*(x[0]*conc_A+ (1x[0])*conc_B))**2).sum()
It works when using scipy.optimize, but I want to use statsmodels instead. However I'm struggling with defining a statsmodels formula. Do you have any ideas how to do this?
I tried something like this, but it does not work with this x*A + (1x)*B:
result = sm.ols(formula="A ~ I(B + C) 1", data=df).fit()

How to give input to 1D convolution of CNN in keras?
I'm solving a regression problem with Convolutional Neural Network(CNN) using Keras library. I have gone through many examples but failed to understand the concept of input shape to 1D Convolution
My Dataset Size consists of stream of sensor generated value have 4 columns(3 sensor values & 1 target variable) and 1 millon rows(have 18000 segments)
Here are the 5 segments of sensor signal for visualization, Each segment has its own meaningI want to give segment wise sensor values as input to the 1D Convolution layer but problem is that segments are of varibale length. This is my convolutional neural network architecture
I tired to build my CNN model but confused
model = Sequential() model.add(Conv1D(5, 7, activation='relu',input_shape=input_shape)) model.add(MaxPooling1D(pool_length=4)) model.add(Conv1D(4, 7, activation='relu')) model.add(Dense(100, activation='relu')) model.add(Dense(num_classes, activation='softmax'))
So, How can I give input to 1D convolution of CNN in keras? OR should I set fixed size input to 1D convolution? but how?
If anyone have a working example of keras 1D CNN(like my case), please share.

Using ‘SelectKBest’ with ‘chi2’ method to determine best variables in Logistic Regression
I have tried to determine the best variable in the logistic regression. I have tried the following code, but the output doesn't help.
for i in k_range: select = SelectKBest(k=i, score_func=chi2).fit(XTrain_all, YTrain) x_new = select.transform(XTrain_all) k_scores=select.scores_ print(k_scores)

How to store a h2o function as an h2o object for deployment?
I used
h2o.predict_leaf_node_assignment(model, frame)
to get the
leaf node assignments
of mygbm
model. Is it possible that I store this function as anh2o
object and then use it for deployment on new data entries? 
h2o GBM: leaf predictions
I'm performing a gridsearch for GBM in h2o for a continuous outcome with continuous predictors. I'm using cross validation for training and then predict on a test set.
I'm using the function .predict_leaf_node_assignment:
best_gbm.predict_leaf_node_assignment(test_frame_h2o) (where best_gbm is the best gbm model I got from gridsearch)
and get the following table where we can see the leaf node assignments per tree T1, T2, T3 etc.
Question 1:
How can I get the values of T1, T2, T3 etc. per leaf in the below table and not the location of the leaf?
Question 2:
If there is a way to get the values for T1, T2, T3 etc. what do they actually reflect? Is the T1 the first prediction and then T2, T3, T4 are the corrections? Or T1 is the prediction and then T2 is T1 corrected etc.?
Thanks.
Edit: I tried to download mojo in python as explained in this page so that I can look into the different trees. http://docs.h2o.ai/h2o/lateststable/h2odocs/productionizing.html?highlight=mojo
In "Step 2: Compile and run the MOJO" the 2nd part of this step is given only in R: "Create your main program in the experiment folder by creating a new file called main.java (for example, using “vim main.java”). Include the following contents. Note that this file references the GBM model created above using R."
Can I do this in python? I have tried to copy for example the command "import java.io.*" in the jupyter notebook but it throws an error (ModuleNotFoundError: No module named 'java').

Multiclass text classification using R
I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.
About the dataset:
The dataset has two columns: "Test_name" and "Description"
There are six labels in the Test_Name column and their corresponding description in the "Description" column.
My approach towards the problem
DATA PREPARATION
Creat a word vector for description.
Build a corpus using the word vector.
Preprocessing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
Model Building
Using H2o package, build a gbm model.
Results obtained
Four of the class labels are classified well but the rest two are poorly classified.
below is the output:
Extract training frame with `h2o.getFrame("train")` MSE: (Extract with `h2o.mse`) 0.1197392 RMSE: (Extract with `h2o.rmse`) 0.3460335 Logloss: (Extract with `h2o.logloss`) 0.3245868 Mean PerClass Error: 0.3791268 Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`) Body Fluid Analysis = 401 / 2,759 Cytology Test = 182 / 1,087 Diagnostic Imaging = 117 / 3,907 Doctors Advice = 32 / 752 Organ Function Test = 461 / 463 Patient Related = 101 / 113 Totals = 1,294 / 9,081
The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?