Custom estimator for gridsearch
I have X_train with shape(100,6) and y_train with shape(100,3) and I want to use SVR to predict y_test. Unfortunately, SVR does not support many dimensional target variable, so I implemented my class:
from sklearn.base import BaseEstimator, clone
class VectorRegression(BaseEstimator):
def __init__(self, estimator, **parameters):
self.estimator = estimator(**parameters)
def fit(self, X, y):
n, m = y.shape
# Fit a separate regressor for each column of y
self.estimators_ = [clone(self.estimator).fit(X, y[:, i]) for i in range(m)]
return self
def predict(self, X):
# Join regressors' predictions
res = [est.predict(X)[:, np.newaxis] for est in self.estimators_]
return np.hstack(res)
Now I want to find best parameters for SVR using GridSearchCV, but I do not know how to assign this parameters to my VectorRegression estimator, because it is already declared when plugged into GridSearch.
Tell me please how to change the interface to make VectorRegression be able plugged into GridSearch.
See also questions close to this topic

Keras SimpleRNN  Shape MFCC vectors
I'm currently trying to implement a Recurrent Neural Network in Keras. The data consists of a collection of 45.000 whereby each entry is a collection (of variable length) of MFCC vectors with each 13 coefficients:
spoken = numpy.load('spoken.npy') print(spoken[0]) # Gives: example_row = [ [ 5.67170000e01 1.79430000e01 7.27360000e+00 9.59300000e02 9.30140000e02 1.62960000e01 4.11620000e01 3.00590000e01 6.86360000e02 1.07130000e+00 1.07090000e01 5.00890000e01 7.51750000e01], [.....] ] print(spoken.shape) # Gives: (45000,0) print(spoken[0].shape) # Gives (N, 13) > N amount of MFCC vectors
I'm struggling to understand how I need to reshape this Numpy array in order to feed it to the SimpleRNN of Keras:
model = Sequential() model_spoken.add(SimpleRNN(units=10, activation='relu', input_shape=?)) .....
Therefore, my question is how do I need to reshape a collection of variable length MFCC vectors so that I can feed it to the SimpleRNN object of Keras?

Print elements of string multiple times based on position
I'm trying to print each element individually, which is fine but also repeat each element based on position eg. "abcd" = ABbCccDddd etc
So my problems are making print statements print x times based off their position in the string. I've tried a few combinations using len and range but i often encounter errors because i'm using strings not ints.
Should i be using len and range here? I'd prefer if you guys didn't post finished code, just basically how to go about that specific problem (if possible) so i can still go about figuring it out myself.
user_string = input() def accum(s): for letter in s: pos = s[0] print(letter.title()) pos = s[0 + 1] accum(user_string)

slicing strings with variables
I am making a python program where I need to inspect individual four letter parts of a variable. If the variable is for example help_me_code_please it should output ease then e_pl etc , I attempted it with
a=0 b=3 repetitions=5 word="10011100110000111010" for x in range (1,repetitions): print(word[a:b]) a=a+4 b=b+4
however it just outputs empty lines. Thanks so much for any help in advanced.

Value Error: ConvLSTM2D
I tried to use ConvLSTM2D architecture, but got value error!
import numpy as np, scipy.ndimage, matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import Dense, Dropout, Activation, Flatten from keras.layers import Convolution2D, ConvLSTM2D, MaxPooling2D, UpSampling2D from sklearn.metrics import accuracy_score, confusion_matrix, cohen_kappa_score from sklearn.preprocessing import MinMaxScaler, StandardScaler np.random.seed(123) raw = np.arange(96).reshape(8,3,4) data1 = scipy.ndimage.zoom(raw, zoom=(1,100,100), order=1, mode='nearest') #low res print (data1.shape) #(8, 300, 400) data2 = scipy.ndimage.zoom(raw, zoom=(1,100,100), order=3, mode='nearest') #high res print (data2.shape) #(8, 300, 400) X_train = data1.reshape(data1.shape[0], 1, data1.shape[1], data1.shape[2], 1) Y_train = data2.reshape(data2.shape[0], 1, data2.shape[1], data2.shape[2], 1) #(samples,time, rows, cols, channels) model = Sequential() input_shape = (data1.shape[0], data1.shape[1], data1.shape[2], 1) #samples, time, rows, cols, channels model.add(ConvLSTM2D(16, kernel_size=(3,3), activation='sigmoid',padding='same',input_shape=input_shape)) model.add(ConvLSTM2D(8, kernel_size=(3,3), activation='sigmoid',padding='same'))
Output is also an image, instead of classification
print (model.summary()) model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy']) model.fit(X_train, Y_train, batch_size=1, epochs=10, verbose=1) x,y = model.evaluate(X_train, Y_train, verbose=0) print (x,y) ValueError: Input 0 is incompatible with layer conv_lst_m2d_2: expected ndim=5, found ndim=4
How can I correct this ValueError? I think problem is with input shapes, but could not figure out what exactly is wrong.

How to compute Relative Error function in Python
As a beginner in Machine Learning and I try to practice some things, but I have struggles to write a function to compute a Relative Error Reduction with numpy and/or scikit.
So, which labels are predicted wrong?
I have three inputs:
Numpy array with pred_labels A
pred_A = np.array(['apple', 'apple', 'apple', 'apple'])
Numpy array with pred_labels B
pred_B = np.array(['pear', 'apple', 'apple', 'apple'])
Numpy array with true_labels T
true_T = np.array(['orange', 'apple', 'pear', 'pear'])
I feel pretty silly that I cannot come up with a solution. Does someone know how to write such a function?

Sklearn use train_test_split while controlling ratios between classes
When using scikitlearn's
train_test_split
tool, I would like to split the data while controlling the ratio between classes. Here's the problem:from sklearn.model_selection import train_test_split from collections import Counter x_coords = range(100) labels = ['a']*90 + ['b']*10 x_train, x_test, label_train, label_test = train_test_split(x_coords, labels, test_size=20) print(Counter(label_test))
which gives:
'a':18, 'b':2
. Another random split gives'a':20
.So 20 samples were randomly chosen, but labeling was ignored. I would like to have control over the ratios between the classes. I'm currently this manually, by using
train_test_split
to split each class separately into a train and test set, and then recombine all of the test and training sets. Does anyone know a more elegant, sklearnbased way to do this?I see here that this can be done so that the ratios between the classes in the original data is maintained, but I would like to specify the ratios myself.

How many combinations will GridSearchCV run for this?
Using sklearn to run a grid search on a random forest classifier. This has been running for longer than I thought, and I am trying to estimate how much time is left for this process. I thought the total number of fits it would do would be 3*3*3*3*5 = 405.
clf = RandomForestClassifier(n_jobs=1, oob_score=True, verbose=1) param_grid = {'n_estimators':[50,200,500], 'max_depth':[2,3,5], 'min_samples_leaf':[1,2,5], 'max_features': ['auto','log2','sqrt'] } gscv = GridSearchCV(estimator=clf,param_grid=param_grid,cv=5) gscv.fit(X.values,y.values.reshape(1,))
From the output, I see it cycling through the tasks where each set is the number of estimators:
[Parallel(n_jobs=1)]: Done 34 tasks  elapsed: 1.2min [Parallel(n_jobs=1)]: Done 184 tasks  elapsed: 5.3min [Parallel(n_jobs=1)]: Done 200 out of 200 tasks  elapsed: 6.2min finished [Parallel(n_jobs=8)]: Done 34 tasks  elapsed: 0.5s [Parallel(n_jobs=8)]: Done 184 tasks  elapsed: 3.0s [Parallel(n_jobs=8)]: Done 200 tasks out of 200 tasks  elapsed: 3.2s finished [Parallel(n_jobs=1)]: Done 34 tasks  elapsed: 1.1min [Parallel(n_jobs=1)]: Done 50 tasks out of 50 tasks  elapsed: 1.5min finished [Parallel(n_jobs=8)]: Done 34 tasks  elapsed: 0.5s [Parallel(n_jobs=8)]: Done 50 out of 50 tasks  elapsed: 0.8s finished
I counted up the number of "finished" and it is at 680 currently. I thought it would be done at 405. Is my calculation wrong?

Scikitlearn classifier with custom scorer dependent on a training feature
I am trying to train a RandomForestClassifier with a custom scorer whose output needs to be dependent on one of the features.
The X dataset contains 18 features:
The y is the usual array of 0s and 1s:
The RandomForestClassifier with custom scorer is used within a GridSearchCV instance: GridSearchCV(classifier, param_grid=[...], scoring=custom_scorer).
Custom scorer is defined via Scikitlearn function make_scorer: custom_scorer = make_scorer(custom_scorer_function, greater_is_better=True).
This framework is very straightforward if the custom_scorer_function is dependent only on y_true and y_pred. However in my case I need to define a scorer which makes use of one of the 18 features contained in the X dataset, i.e. depending on the values of y_pred and y_true the custom score will be a combination of them and the feature.
My question is how can I pass the feature into the custom_scorer_function given that its standard signature accepts y_true and y_pred?
I am aware it accepts extra **kwargs, but passing the entire feature array in this way doesn't solve the problem as this function is invoked for each couple of y_true and y_pred values (would need to extract the individual feature value corresponding to them to make this working, which I am not sure can be done).
I have tried to augment the y_true array packing that feature into it and unpacking it within the custom_scorer_function (1st column are the actual labels, 2nd columns are the feature values I need to calculate the custom scores):
However doing so violates the requirements of the classifier of having a 1D labels array and triggers the following error.
ValueError: Unknown label type: 'continuousmultioutput'
Any help is much appreciated.
Thank you.