Implementation of Isolated forest in python
I am new to machine learning and trying to learn and implement the isolation forest algorithm in python
My Input contains 40 features and train contains 4000 and test contains 1000 records.Can someone help with a sample code and how to plot the output
See also questions close to this topic

Empty factor levels were dropped for columns when using MLR package
I have a question here, when I try to use "makeClassifTask" from MLR package to do a SVM, a warning said Empty factor levels were dropped for columns. My codes are:
install.packages("mlr") library(mlr) set.seed(1) sample=sample(2,nrow(cleaned_caravan_train),replace=T) train=cleaned_caravan_train[sample==1,] test=cleaned_caravan_train[sample==2,] makeClassifTask(data=train,target = "CARAVAN")
An example from the MLR package works very well:
install.packages("mlbench") library(mlbench) data("BostonHousing") data("Ionosphere") makeClassifTask(data=iris,target="Species")
I don't understand what the different between those.

Dialogflow: selecting specific value for the action while training
Is it possible to select a specific value for the action in the training or intent tab?
For ex.: I have an entity PLACES and there are a lot of places in the city, I try to keep tons of synonyms per each.
Let's say there is a place called City Museum and synonyms are "museum, city museum, cit mus, meseum" and so on, with mistakes or other aliases.
Currently, I have to add them manually, as while training there is no way to select a specific value for the entity. As I select proper intent, then entity, but Dialogflow creates a new value for the words it doesn't yet know, rather than adding them to the list.
Is there any way to do it this way?

what is wrong with my cosine similarity? Tensorflow
I want to use cosine similarity in my Neural network, instead of the standard dot product.
I've had a look at the dot product and at the cosine similarity.
In the example above they use
a = tf.placeholder(tf.float32, shape=[None], name="input_placeholder_a") b = tf.placeholder(tf.float32, shape=[None], name="input_placeholder_b") normalize_a = tf.nn.l2_normalize(a,0) normalize_b = tf.nn.l2_normalize(b,0) cos_similarity=tf.reduce_sum(tf.multiply(normalize_a,normalize_b)) sess=tf.Session() cos_sim=sess.run(cos_similarity,feed_dict={a:[1,2,3],b:[2,4,6]})
However, I tried doing it my own way
x = tf.placeholder(tf.float32, [None, 3], name = 'x') # input has 3 features w1 = tf.placeholder(tf.float32, [10, 3], name = 'w1') # 10 nodes in the first hidden layer cos_sim = tf.divide(tf.matmul(x, w1), tf.multiply(tf.norm(x), tf.norm(w1))) with tf.Session() as sess: sess.run(cos_sim, feed_dict = {x = np.array([[1,2,3], [4,5,6], [7,8,9], w1: np.random.uniform(0,1,size = (10,3) )})
Is my way wrong? Also, what is going on in the matrix multiplication? Are we actually multiplying the weights of one node for the inputs of different samples (within one feature)?

How to visualize output of intermediate layers of convolutional neural network in keras?
recently I created basic CNN model for cats and dogs classification (very basic). How can I visualize the output of these layers using keras? I used Tensorflow backend for keras.
 Classification Tree

Grouping similar data to maximize intragroup correlation and minimize intergroup correlation
so this is my problem. I have daily return data of 2000 stocks, and below is a small sample of it (s1 to s8, day1 to day15)
I'll call my data "df".
> df[1:15,1:8] s1 s2 s3 s4 s5 s6 s7 s8 1 0.026410 0.001030 0.0027660 0.0126500 0.030110 0.001476 0.008271 0.005299 2 0.018990 0.013680 0.0092050 0.0008402 0.002739 0.014170 0.006091 0.011920 3 0.004874 0.024140 0.0002107 0.0084770 0.006825 0.001448 0.002724 0.003132 4 0.019300 0.004649 0.0223400 0.0080200 0.008197 0.015270 0.004064 0.008149 5 0.010350 0.010650 0.0087780 0.0059960 0.001390 0.006454 0.018990 0.002822 6 0.028650 0.010490 0.0157200 0.0004123 0.019750 0.005902 0.004261 0.019110 7 0.004203 0.002682 0.0099840 0.0070060 0.025670 0.014550 0.016700 0.011580 8 0.042170 0.019490 0.0023140 0.0083030 0.018170 0.021160 0.006864 0.009438 9 0.017250 0.026600 0.0031630 0.0069090 0.035990 0.008429 0.001500 0.011830 10 0.037400 0.022370 0.0088460 0.0012690 0.050820 0.025300 0.028040 0.023790 11 0.091140 0.018830 0.0052160 0.0403000 0.001410 0.007050 0.024340 0.013110 12 0.051620 0.004791 0.0336000 0.0094320 0.018320 0.019490 0.044080 0.024020 13 0.007711 0.002158 0.0177400 0.0090470 0.004346 0.001562 0.096030 0.015840 14 0.041440 0.001072 0.0168400 0.0180300 0.012980 0.015280 0.059780 0.014730 15 0.042620 0.025560 0.0180200 0.0115200 0.033320 0.015150 0.014580 0.012710
I need a way to group them so that the intragroup correlation is maximized and intergroup correlation is minimized.
So for example, I can group them into two groups randomly as following: (s1, s2, s3, s4) and (s5, s6, s7, s8) The problem is, some of the stocks might be correlated with each other, and some might not.
So my solution was to:
get a correlation matrix (assuming Pearson's method works fine)
cor_df < cor(df)
melt(flatten) the correlation list in descending order and remove duplicates and rows with correlation coefficient = 1 (used reshape library)
cor_df_melt < melt(cor_df) names(cor_df_melt)[1] < "x1" names(cor_df_melt)[2] < "x2" names(cor_df_melt)[3] < "corr" cor_df_ordered < cor_df_melt[order(cor_df_sample_melt["corr"]),]
Then I numbered the flattened matrix, removed duplicates(even numbered ones) and rows with correlation coefficient = 1
cor_df_numbered < cbind(row=c(1:nrow(cor_df_ordered)),cor_df_ordered) cor_df_ready < cor_df_numbered[cor_df_numbered$row%%2==0&cor_df_numbered$corr%%2!=1,2:4]
After this, my data frame with nicely ordered correlation coefficients for each pair in descending order was ready as follows:
> cor_df_ready x1 x2 corr 63 s7 s8 0.49223783 57 s1 s8 0.42518667 50 s2 s7 0.42369762 49 s1 s7 0.40824283 58 s2 s8 0.40395569 42 s2 s6 0.40394894 54 s6 s7 0.39408677 62 s6 s8 0.38536734 34 s2 s5 0.36882709 53 s5 s7 0.36066870 45 s5 s6 0.35734278 59 s3 s8 0.34295713 51 s3 s7 0.34163733 61 s5 s8 0.33264868 9 s1 s2 0.32812763 41 s1 s6 0.31221715 18 s2 s3 0.30692909 43 s3 s6 0.29390325 33 s1 s5 0.28845243 35 s3 s5 0.27859972 17 s1 s3 0.25039209 52 s4 s7 0.12989487 60 s4 s8 0.12095196 25 s1 s4 0.10902471 26 s2 s4 0.09471694 44 s4 s6 0.08039435 36 s4 s5 0.06957264 27 s3 s4 0.06027389
(btw i have no idea why the row number is disordered like that... can anyone explain?)
From here, my intuition was for the top pair with the highest correlation coefficient 0.49223783 (s7, s8), they had to be in the same group.
So from my cor_df_ready data frame, I chose all pairs with "s7" included and extracted the 4 stocks that appear at the top of the list (s7, s8, s2, s1) and named them group 1.
I then excluded all rows including (s7, s8, s2, s1) from my cor_df_ready, and repeated the process to come up with the second group (s3, s4, s5, s6).
well in this example I didn't have to repeat the process as there was only one last set remaining.
Then, I got the correlation matrix for each group and added the sum of every correlation coefficient:
group1_cor < cor(group1) group2_cor < cor(group2) cor_sum < sum(group1_cor) + sum(group2_cor)
then I got the mean of each row in each group, and calculated the sum of the correlation matrix for the two group means, and named it cor_sum_mean.
Lastly, I calculated for: cor_sum_mean/cor_sum
The intuition was, maximized correlation within group would maximize cor_sum where the minimized correlation between groups would also minimize cor_sum_mean.
I want to get as big cor_sum as possible(intragroup correlation) and as small cor_sum_mean as possible(intragroup correlation).
Using my method for the whole data, I divided 2000 stocks into 10 groups and what I got was
#cor_sum = 131923.1 #cor_sum_mean = 83.1731 #cor_sum_mean/cor_sum = 0.0006305
I KNOW I can get the cor_sum_mean/cor_sum down to 0.000542 (or even smaller), but I am simply stuck.
I searched google, stackoverflow, crossvalidated, and I was getting the idea that machine learning/time series clustering/classification could be the answer that I'm looking for.
The following two preposted questions seemed helpful, but I'm only starting to learn data science so I'm having a hard time understanding them....
https://stats.stackexchange.com/questions/9475/timeseriesclustering/19042#19042
https://stats.stackexchange.com/questions/3238/timeseriesclusteringinr
Can anyone please explain or direct me what to look for in specific?
This was a long question... Thanks for reading!