Boosted Regression Trees prediction errors
I am trying to run a Boosted Regression tree as I want to predict future scenarios for (G) but after I build the model and I try to actually predict my dependent variable (G) I get an error.
loyn<-read.csv("FullData.csv", strip.white=T, header=T, sep=',') library(gbm) head(loyn) #G Site P O T Coral.Cover OCC PPC #1 23.023128 Lizard 27.494517 3.702791 25.470 4.2 8.2 20.9 #2 11.547282 Lizard 20.413183 3.664939 25.430 4.2 8.2 20.9 loyn.gbm<-gbm(G~Site+P+O+T+loyn$Coral.Cover+OCC+PPC, data=loyn, distribution="gaussian", train=0.75, interaction.depth=3, shrinkage=0.001, bag.fraction=0.5,cv.folds=3, n.minobsinnode=2, n.trees=10000) par(mfrow=c(1,2)) (best.iter<-gbm.perf(loyn.gbm, method="cv", overlay=T, oobag.curve=T)) summary(loyn.gbm, n.trees=best.iter) # here I am asking what value G would have if Site=5, P=20, O=2.7, T=32, #loyn$Coral.Cover=4, OCC=2, PPC=20 new<-predict(loyn.gbm, newdata=data.frame(Site=5, P=20, O=2.7, T=32, loyn$Coral.Cover=4, OCC=2, PPC=20),n.trees=best.iter)
But it keeps saying
Error: unexpected '=' in "new<-predict(loyn.gbm, newdata=data.frame(Site=5, P=20, O=2.7, T=32, loyn$Coral.Cover="
I have checked and it is not a typo. Any help would be much appreciated. Thanks!
See also questions close to this topic
Unable to load recipes package
I install the
recipeslibrary just fine using:
install.packages("recipes", dependencies = c("Depends", "Suggests"))
This gives me the following error:
Error in library(recipes) : there is no package called ‘recipes’
I having a bigger issue, I am unable to load
caretbut it seems the reason I cannot is
recipes, so I assume if I solve the
recipesissue I will be able to load
Design Matrix using model.matrix function for gene expression
Need some help. My data looks like this:
Identifier Sample1 Sample2 Sample3 ...Sample10 Gene1 10.85 9.33 11.04 ... 10.093 Gene2 5.94 7.95 6.46 ... 6.33 ... Gene99 3.93 4.12 7.86 ... 9.45
Samples 1 to 4 are normal, 5 to 10 are abnormal.
The data is stored in a data frame called DF. Need to create a design matrix using a model.matrix function, the idea is to use this information to fit a linear model to be able to identify the differential genes.
I have no clue how to create the design matrix. I have read the documentation, but it leads me nowhere. The function's syntax doesn't seem to be tailored towards the format that I have.
Any tips are appreciated.
Check for each unique value we have same unique id
I have a excel sheet which looks like:
Col1 Col2 IJ-123 A2B1 IJ-123 A2B1 IJ-456 C2C2 IJ-456 c2c2 IJ-456 D1e2 IJ-789 LJ87 IJ-456 IJ-789 LJ98 x = data.frame( Col1 = c("IJ-123", "IJ-123", "IJ-456", "IJ-456", "IJ-456", "IJ-789", "IJ-456", "IJ-789"), Col2 = c("A2B1", "A2B1", "C2C2", "c2c2", "D1e2", "LJ87", NA, "LJ98") )
I want to add one more column and check (for each Unique
Col2Value) whether the assigned values in
Col1 Col2 Result IJ-123 A2B1 TRUE IJ-123 A2B1 TRUE IJ-456 C2C2 TRUE IJ-456 c2c2 TRUE IJ-456 D1e2 FALSE IJ-789 LJ87 TRUE (Because Col2 count=1 for this value) IJ-456 C2C2 IJ-789 LJ98 TRUE (Because Col2 count=1 for this value)
- If there are more than 1 of the value in col2, then check that the corresponding col1 values just for those col2 values are the same
- If there is only one of the col2 values then check that the col1 is unique but only against the col1 values of multiple-occurring col2 values.
- Some field are blank in
Col2for those if we have have duplicate
Col1value than show
Col2value mapped to those
Col1in Result (see Row 7).
For this i have a excel formula
=IF(COUNTIF($B$2:$B$8,B2)=1,SUMPRODUCT(--(($A$2:$A$8=A2)*(COUNTIF($B$2:$B$8,$B$2:$B$8))>1))=0,COUNTIFS($B$2:$B$8,B2,$A$2:$A$8,"<>"&A2)=0)but its working very slow since waiting for ~4 hours it only complete 28% processing on ~0.2 million data.
I have uploaded the file in
csvformat on R and want to carry out the same exercise on R for faster processing.
convert cost function to statsmodels formula
I want to fit some data to a curve, using this as a cost function:
def cost_func(x): return ((unknown_conc-x*(x*conc_A+ (1-x)*conc_B))**2).sum()
It works when using scipy.optimize, but I want to use statsmodels instead. However I'm struggling with defining a statsmodels formula. Do you have any ideas how to do this?
I tried something like this, but it does not work with this x*A + (1-x)*B:
result = sm.ols(formula="A ~ I(B + C) -1", data=df).fit()
How to give input to 1D convolution of CNN in keras?
I'm solving a regression problem with Convolutional Neural Network(CNN) using Keras library. I have gone through many examples but failed to understand the concept of input shape to 1D Convolution
My Dataset Size consists of stream of sensor generated value have 4 columns(3 sensor values & 1 target variable) and 1 millon rows(have 18000 segments)
Here are the 5 segments of sensor signal for visualization, Each segment has its own meaning
I want to give segment wise sensor values as input to the 1D Convolution layer but problem is that segments are of varibale length. This is my convolutional neural network architecture
I tired to build my CNN model but confused
model = Sequential() model.add(Conv1D(5, 7, activation='relu',input_shape=input_shape)) model.add(MaxPooling1D(pool_length=4)) model.add(Conv1D(4, 7, activation='relu')) model.add(Dense(100, activation='relu')) model.add(Dense(num_classes, activation='softmax'))
So, How can I give input to 1D convolution of CNN in keras? OR should I set fixed size input to 1D convolution? but how?
If anyone have a working example of keras 1D CNN(like my case), please share.
Using ‘SelectKBest’ with ‘chi2’ method to determine best variables in Logistic Regression
I have tried to determine the best variable in the logistic regression. I have tried the following code, but the output doesn't help.
for i in k_range: select = SelectKBest(k=i, score_func=chi2).fit(XTrain_all, YTrain) x_new = select.transform(XTrain_all) k_scores=select.scores_ print(k_scores)
How to store a h2o function as an h2o object for deployment?
to get the
leaf node assignmentsof my
gbmmodel. Is it possible that I store this function as an
h2o-object and then use it for deployment on new data entries?
h2o GBM: leaf predictions
I'm performing a gridsearch for GBM in h2o for a continuous outcome with continuous predictors. I'm using cross validation for training and then predict on a test set.
I'm using the function .predict_leaf_node_assignment:
best_gbm.predict_leaf_node_assignment(test_frame_h2o) (where best_gbm is the best gbm model I got from gridsearch)
and get the following table where we can see the leaf node assignments per tree T1, T2, T3 etc.
How can I get the values of T1, T2, T3 etc. per leaf in the below table and not the location of the leaf?
If there is a way to get the values for T1, T2, T3 etc. what do they actually reflect? Is the T1 the first prediction and then T2, T3, T4 are the corrections? Or T1 is the prediction and then T2 is T1 corrected etc.?
Edit: I tried to download mojo in python as explained in this page so that I can look into the different trees. http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html?highlight=mojo
In "Step 2: Compile and run the MOJO" the 2nd part of this step is given only in R: "Create your main program in the experiment folder by creating a new file called main.java (for example, using “vim main.java”). Include the following contents. Note that this file references the GBM model created above using R."
Can I do this in python? I have tried to copy for example the command "import java.io.*" in the jupyter notebook but it throws an error (ModuleNotFoundError: No module named 'java').
Multiclass text classification using R
I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.
About the dataset:
The dataset has two columns: "Test_name" and "Description"
There are six labels in the Test_Name column and their corresponding description in the "Description" column.
My approach towards the problem
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
Using H2o package, build a gbm model.
Four of the class labels are classified well but the rest two are poorly classified.
below is the output:
Extract training frame with `h2o.getFrame("train")` MSE: (Extract with `h2o.mse`) 0.1197392 RMSE: (Extract with `h2o.rmse`) 0.3460335 Logloss: (Extract with `h2o.logloss`) 0.3245868 Mean Per-Class Error: 0.3791268 Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`) Body Fluid Analysis = 401 / 2,759 Cytology Test = 182 / 1,087 Diagnostic Imaging = 117 / 3,907 Doctors Advice = 32 / 752 Organ Function Test = 461 / 463 Patient Related = 101 / 113 Totals = 1,294 / 9,081
The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?