xgboost with tree_method = 'hist' in R
According to a benchmarking of GBM vs. xgboost vs. LightGBM (https://www.kaggle.com/nschneider/gbmvsxgboostvslightgbm) it is possible to implement xgboost with the argument
tree_method = 'hist'
in R.
However doing so gives me always an error:
Error in xgb.iter.update(bst$handle, dtrain, iteration  1, obj) :
Invalid Input: 'hist', valid values are: {'approx', 'auto', 'exact'}
What am I missing?
See also questions close to this topic

R loops  iterate a list of strings, expand as function input
I am a python person, and am having trouble working with a for loop. I have a list representing the names of particular columns in a dataframe containing colums (Sample_Name_Column, ComparisonColumn, MeasureA, MeasureB, MeasureC, MeasureD) which I want to use for a linear mixed effects model (using the nlme library). So I wrote a simple loop to try and do that:
list < c("MeasureA","MeasureB","MeasureC","MeasureD") for (i in list){ model = lme(i ~ ComparisonColumn, random=~1Sample_Name_Column, data=sampleDataSheet, method="REML") }
but of course this fails.
Error in model.frame.default(formula = ~i + ComparisonColumn + Sample_Name_Column, : variable lengths differ (found for 'ComparisonColumn')
The function lme doesn't expand the variable; is looking for a column i as the input. Yet other functions like print() or length() do. Odd. Anyway, I've found some posts that use .asformula and reformulate here but I'm having an awful lot of trouble getting it working.
for (i in groupList) { model = lme(as.formula(paste0(i, " ~ ComparisonColumn, random=~1Sample_Name_Column")), data=sampleDataSheet, method="REML") }
I get a little further (because the iterable has been correctly inserted):
Error in parse(text = x, keep.source = FALSE) : <text>:1:26: unexpected ',' 1: MeasureA ~ ComparisonColumn, ^
but something is wrong here too.
I should add that running the model directly works:
model = lme(MeasureA ~ ComparisonColumn, random=~1Sample_Name_Column, data=sampleDataSheet, method="REML") Linear mixedeffects model fit by REML Data: sampleDataSheet Logrestrictedlikelihood: 462.6646 Fixed: MeasureA ~ ComparisonColumn (Intercept) ComparisonColumnTreatmentA 0.81377249 0.08312908 Random effects: Formula: ~1  Sample_Name_Column (Intercept) Residual StdDev: 0.1800545 0.5348801 Number of Observations: 564 Number of Groups: 16
I've gotten a bit of the way, but can some kind soul please help me out to finish it off?
thanks, K

Calculate the sum of the counts of a factor variable, as a subset of a dataframe in R
I am trying to get a summary of how many people in my data have had surgery and then gone on to die; to calculate the mortality rate for surgery patients.
My data looks like this
df < data.frame( y1988 = rep(c('Y', 'Y', 'Y', 'M', 'D', 'Y', 'Y', 'D', 'X', 'D'), 25), y1989 = rep(c('Y', 'M', 'D', 'Y', 'X', 'Y', 'X', 'Y', 'Y', 'Y'), 25), y1990 = rep(c('D', 'Y', 'D', 'X', 'Y', 'M', 'D', 'Y', 'Y', 'Y'), 25), y1991 = rep(c('D', 'Y', 'Y', 'M', 'D', 'Y', 'Y', 'X', 'D', 'Y'), 25), age = rep(20:69, 5), ID = (1:250) )
What I want to do is get a sum of the number of 'D' and divide this by the number of 'Y' for age per year (y1988 to y1991).
If I were to do this manually, I would subset the dataframe for each age, and then divide the sum of 'D' by the sum of 'Y', eg
a21 < filter(df, age == 21) a21$mort1988 < sum(a21$y1988 == 'D') / sum(a21$y1988 == 'Y') a21$mort1989 < sum(a21$y1989 == 'D') / sum(a21$y1989 == 'Y')
etc
This seems absurd, is there an efficient way to do this?

How to change character time to numeric time in R
I have a character variable like this:
"1h 3m 6s 0h 13m 30s 0h 15m 12s"
The desired output is converting to numeric seconds. What should I do?

why "xgbtree" works in caret in R gui but does not work in R script of AZURE ML?
I have been spending hours to figure out what the problem is and so far no luck. I have been using train model in caret with "xgbtree" method in R and have no problem with the model. I had to install the "xgboost" package first though. I wanted to run the exact same line of code in R script in AZURE but I keep getting the error "Model xgbTree is not in caret's builtin library" there. I installed the "xgboost" and "magrittr" packages in AZURE and it did not give me any error on them. It could also retrieve those packages after wards but it keeps giving me the error when it gets to the train line of my code. I am desperately looking to see how I can fix it. Here is my code in AZURE:
history < maml.mapInputPort(1) # class: data.frame install.packages("src/RPaZURE/magrittr.zip", lib = ".", repos = NULL, verbose = TRUE) install.packages("src/RPaZURE/xgboost.zip", lib = ".", repos = NULL, verbose = TRUE) library(magrittr,lib.loc = ".") library(xgboost,lib.loc = ".") library(dplyr) library(caret) trainIndex < createDataPartition(history$x, p = .7, list = FALSE,times = 1) Train < history[ trainIndex,] Test < history[trainIndex,] Train[] < lapply(Train,as.numeric) Test[] < lapply(Test,as.numeric) set.seed(123) fitControl < trainControl(method = 'cv', number = 10,savePredictions=TRUE) grid_xgboost < expand.grid(nrounds = 1000,eta = c(0.01, 0.1),max_depth = c(2, 4, 6, 8, 10),gamma = 1 ,colsample_bytree = 1,min_child_weight = 10,subsample = 1) fit < train(x = as.matrix(Train %>% select(x)),y = Train$x, method = 'xgbTree',tuneGrid=grid_xgboost, trControl=fitControl, metric="RMSE")
Any help is appreciated.

XGboost model consistently obtaining 100% accuracy?
I'm working with Airbnb's data, available here on Kaggle , and predicting the countries users will book their first trips to with an XGBoost model and almost 600 features in R. Running the algorithm through 50 rounds of 5fold cross validation, I obtained 100% accuracy each time. After fitting the model to the training data, and predicting on a held out test set, I also obtained 100% accuracy. These results can't be real. There must be something wrong with my code, but so far I haven't been able to figure it out. I've included a section of my code below. It's based on this article. Following along with the article (using the article's data + copying the code), I receive similar results. However applying it to Airbnb's data, I consistently obtain 100% accuracy. I have no clue what is going on. Am I using the xgboost package incorrectly? Your help and time is appreciated.
# set up the data # train is the data frame of features with the target variable to predict full_variables < data.matrix(train[,1]) # country_destination removed full_label < as.numeric(train$country_destination)  1 # training data train_index < caret::createDataPartition(y = train$country_destination, p = 0.70, list = FALSE) train_data < full_variables[train_index, ] train_label < full_label[train_index[,1]] train_matrix < xgb.DMatrix(data = train_data, label = train_label) # test data test_data < full_variables[train_index, ] test_label < full_label[train_index[,1]] test_matrix < xgb.DMatrix(data = test_data, label = test_label) # 5fold CV params < list("objective" = "multi:softprob", "num_class" = classes, eta = 0.3, max_depth = 6) cv_model < xgb.cv(params = params, data = train_matrix, nrounds = 50, nfold = 5, early_stop_round = 1, verbose = F, maximize = T, prediction = T) # out of fold predictions out_of_fold_p < data.frame(cv_model$pred) %>% mutate(max_prob = max.col(., ties.method = "last"),label = train_label + 1) head(out_of_fold_p) # confusion matrix confusionMatrix(factor(out_of_fold_p$label), factor(out_of_fold_p$max_prob), mode = "everything")
Sample of the data I used for this can be found here by running this code:
library(RCurl) x < getURL("https://raw.githubusercontent.com/loshita/Senior_project/master/train.csv") y < read.csv(text = x)

cross_val_score for xgboost with "early_stopping_rounds" returns "IndexError"
I am working on a regression model in python (v3.6) using sklearn and xgboost. I want to calculate sklearn.cross_val_score with early_stopping_rounds. The following code returns an error:
xgb_model = xgb.XGBRegressor(n_estimators=600, learning_rate=0.06) xgb_cv = cross_val_score(xgb_model, train_x, train_y, cv=5, scoring='neg_mean_absolute_error', fit_params={'early_stopping_rounds':3}) IndexError: list index out of range
Also, if I try to pass the parameter as 'xgbregressor__early_stopping_rounds' (as found online in some related topics), the following error shows up:
TypeError: fit() got an unexpected keyword argument 'xgbregressor__early_stopping_rounds'
If I run the same model without "fit_params", everything works fine. Is there any way I can avoid this error while using cross_val_score?

Difference between : and = in Python?
I was reading a Kaggle code this morning and I found the following expression:
merge: pd.DataFrame = pd.concat([train, test]) submission: pd.DataFrame = test[['test_id']]
My question is, why is he using : instead of =? Is an error or it has some meaning?
Thanks for your help!
EDIT: Reference code link : https://www.kaggle.com/tunguz/moreeffectiveridgelgbmscriptlb044823/code

filling missing of a column with different values based on specific conditions
I am new to machine learning. I am solving the titanic problem using python. I want to fill missing values of age column with different values. How to do that?. For example I want to fill missing values for age column for the conditions Female, Class 1, Embarked="c" by 36. how to do this in a short way?

Cannot feed value of shape (64, 7) for Tensor 'targets/Y:0', which has shape '(?,)'
I'm working on Kaggle's fer2013 dataset. Here's a link to the dataset.
I'm using TFLearn framework, I convert the Labels(7 class labels) to hot_shot and everything works fine until I run it in the networks and I get the error: Cannot feed value of shape (64, 7) for Tensor 'targets/Y:0', which has shape '(?,)'
I read previous similar questions and I understand that I'm trying to feed the network a tensor of shape which is different than what it expects, my problem here is I don't know how to reshape what it expects, or at least the shape of what it expects so I can reshape my tensor to.
Here's my code.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.image as mpimg #Read csv file data = pd.read_csv('fer2013.csv') #Number of samples n_samples = len(data) n_samples_train = 28709 n_samples_test = 3589 n_samples_validation = 3589 IMG_SIZE = 48 #Pixel width and height w = 48 h = 48 #Separating labels and features respectively y = data['emotion'] X = np.zeros((n_samples, w, h,1)) for i in range(n_samples): X[i] = np.fromstring(data['pixels'][i], dtype=int, sep=' ').reshape(w, h,1) #Training set X_train = X[:n_samples_train] y_train = y[:n_samples_train] X_val = X[n_samples_train : (n_samples_train + n_samples_test)] y_val = y[n_samples_train : (n_samples_train + n_samples_test)] n_values = np.max(y_train) + 1 y_hot_shot_train = np.eye(n_values)[y_train] n_values_val = np.max(y_val) + 1 y_hot_shot_val = np.eye(n_values_val)[y_val] import tflearn from tflearn.layers.conv import conv_2d, max_pool_2d from tflearn.layers.core import input_data, dropout, fully_connected from tflearn.layers.estimator import regression from tflearn.data_augmentation import ImageAugmentation LR = 0.001 imgaug = ImageAugmentation() imgaug.add_random_flip_leftright() imgaug.add_random_rotation(max_angle=25.) convnet = input_data(shape=[None, IMG_SIZE, IMG_SIZE, 1],data_augmentation=imgaug, name='input') convnet = conv_2d(convnet, 32, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = conv_2d(convnet, 64, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = conv_2d(convnet, 128, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = conv_2d(convnet, 64, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = conv_2d(convnet, 64, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = conv_2d(convnet, 32, 5, activation='relu') convnet = max_pool_2d(convnet, 5) convnet = fully_connected(convnet, 1024, activation='relu') convnet = dropout(convnet, 0.8) convnet = fully_connected(convnet, 7, activation='softmax') convnet = regression(convnet, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets', to_one_hot = True, n_classes=7) model = tflearn.DNN(convnet, tensorboard_dir='log') MODEL_NAME = 'SentimentAnalysis{}{}.model'.format(LR, '6convbasic') model.fit({'input': X_train}, {'targets': y_hot_shot_train}, n_epoch=6,batch_size=64, validation_set=({'input': X_val}, {'targets': y_hot_shot_val}), snapshot_step=500, show_metric=True, run_id=MODEL_NAME)

Crossvalidation predictions for lightGBM
Is there a simple way to recover crossvalidation predictions from the model built using
lgb.cv
fromlightGBM
?I am doing a grid search combined with cross validation. Ultimately I would like to obtain the predictions for each of the defined holdout folds so I can also stack a few models.

Lightgbm with Tweedie
I'm trying to run lightgbm with a Tweedie distribution. I believe this code should be sufficient to see the problem:
lgb_train=lgb.Dataset(X_train,y_train,weight=W_train,categorical_feature=cat_features) lgb_test=lgb.Dataset(X_test,y_test,weight=W_test,reference=lgb_train,categorical_feature=cat_features) params = { 'boosting': 'gbdt', 'application': 'tweedie', 'metric': 'tweedie', 'tweedie_variance_power':1.5, 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 1, 'bagging_fraction': 1, 'bagging_freq': 0, 'verbose': 1, 'early_stopping_round':5, 'num_iterations':1000 } mod1=lgb.train(params,lgb_train,valid_sets=[lgb_test])
This runs fine when using application and metric as poisson, however with a Tweedie I get this traceback:
LightGBMError: b'No object function provided'  LightGBMError Traceback (most recent call last) <ipythoninput34884061fea80fd> in <module>() 18 } 19 > 20 mod1=lgb.train(params,lgb_train,valid_sets=[lgb_test]) C:\Python\anaconda3\lib\sitepackages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks) 197 evaluation_result_list=None)) 198 > 199 booster.update(fobj=fobj) 200 201 evaluation_result_list = [] C:\Python\anaconda3\lib\sitepackages\lightgbm\basic.py in update(self, train_set, fobj) 1437 _safe_call(_LIB.LGBM_BoosterUpdateOneIter( 1438 self.handle, > 1439 ctypes.byref(is_finished))) 1440 self.__is_predicted_cur_iter = [False for _ in range_(self.__num_dataset)] 1441 return is_finished.value == 1 C:\Python\anaconda3\lib\sitepackages\lightgbm\basic.py in _safe_call(ret) 46 """ 47 if ret != 0: > 48 raise LightGBMError(_LIB.LGBM_GetLastError()) 49 50 LightGBMError: b'No object function provided'
I assume I'm missing a paramter, but I'm pretty sure I've put in place everything referencing a Tweedie in the docs.
Please could you help?
Cheers

How are decision trees grown in Catboost?
XGBoost grows decision trees by dept(level)wise algorithm so the trees are symmetric while Light GBM grows decision trees by leafwise algorithm so the trees are not symmetric.
The question is "What is the algorithm that Catboost uses to grow decision trees? Is it deptwise or leafwise?".