Using of DBSCAN model in ELKI: define the cluster of new objects
I'm using the DBSCAN algorithm in ELKI, and once I created my model on the training data, how I can perform the prediction of a cluster, based on this model Clustering<Model> c
, for new observation?
My initial data is double[][] array, while every new observation is double[] array. I used the example given in this answer and it works perfectly but how I can reuse the Clustering<Model> c
on new test data? I have to do my method for prediction myself or there is existing method like "predict" or something else?
See also questions close to this topic

What difference between Flink time window and slide time window?
I'm investigation how Apache Flink works and trying to understand time windows in Flink.

getting the binary data represented by the hexadecimal string back in java vs python
I know that in python binascii.unhexlify(initValue) return the binary data represented by the hexadecimal string back.
I am trying to convert binascii.unhexlify(initValue) to java.
I tried the following code lines in java but I am getting different results then the code in python:
DatatypeConverter.parseHexBinary(value);
I run the following example:
my input  hexadecimal string:
value = '270000f31d32d1051400000000000000000000000006000000000000000000000000000000000000'
when running in python:
result = binascii.unhexlify(value)
I am getting:
result = "'\x00\x00\xf3\x1d2\xd1\x05\x14\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
when running in java:
byte[] bytes = DatatypeConverter.parseHexBinary(value);
I am getting:
bytes = [39, 0, 0, 13, 29, 50, 47, 5, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1.why I am getting different results?
 why do I get the output in python with '\' marks?

Hibernate soft delete sets foreign key to null
I have a parent entity like this:
@SQLDelete(sql = "UPDATE parent_table SET deleted = true WHERE id = ?") public class Parent { private boolean deleted; @OneToMany(cascade = CascadeType.ALL) @JoinColumn(name = "parent_id") private List<Child> children; // other stuff } @SQLDelete(sql = "UPDATE child_table SET deleted = true WHERE id = ?") public class Child { private boolean deleted; // stuff }
As you can see, its a unidirectional @OneToMany mapping and both entities use soft delete with the
@SQLDelete
annotation. I'm trying to soft delete the parent and in turn want the child to be soft deleted as well.When I try to soft delete , it sets the
deleted
flag to true in both tables and that's what I want.
However, theparent_id
in thechild_table
is set tonull
when I perform the delete. Why is this happening and how can I stop this ?The delete operation :
Parent parent= entityManager.find(Parent.class, id); entityManager.remove(parent);

Grouping similar data to maximize intragroup correlation and minimize intergroup correlation
so this is my problem. I have daily return data of 2000 stocks, and below is a small sample of it (s1 to s8, day1 to day15)
I'll call my data "df".
> df[1:15,1:8] s1 s2 s3 s4 s5 s6 s7 s8 1 0.026410 0.001030 0.0027660 0.0126500 0.030110 0.001476 0.008271 0.005299 2 0.018990 0.013680 0.0092050 0.0008402 0.002739 0.014170 0.006091 0.011920 3 0.004874 0.024140 0.0002107 0.0084770 0.006825 0.001448 0.002724 0.003132 4 0.019300 0.004649 0.0223400 0.0080200 0.008197 0.015270 0.004064 0.008149 5 0.010350 0.010650 0.0087780 0.0059960 0.001390 0.006454 0.018990 0.002822 6 0.028650 0.010490 0.0157200 0.0004123 0.019750 0.005902 0.004261 0.019110 7 0.004203 0.002682 0.0099840 0.0070060 0.025670 0.014550 0.016700 0.011580 8 0.042170 0.019490 0.0023140 0.0083030 0.018170 0.021160 0.006864 0.009438 9 0.017250 0.026600 0.0031630 0.0069090 0.035990 0.008429 0.001500 0.011830 10 0.037400 0.022370 0.0088460 0.0012690 0.050820 0.025300 0.028040 0.023790 11 0.091140 0.018830 0.0052160 0.0403000 0.001410 0.007050 0.024340 0.013110 12 0.051620 0.004791 0.0336000 0.0094320 0.018320 0.019490 0.044080 0.024020 13 0.007711 0.002158 0.0177400 0.0090470 0.004346 0.001562 0.096030 0.015840 14 0.041440 0.001072 0.0168400 0.0180300 0.012980 0.015280 0.059780 0.014730 15 0.042620 0.025560 0.0180200 0.0115200 0.033320 0.015150 0.014580 0.012710
I need a way to group them so that the intragroup correlation is maximized and intergroup correlation is minimized.
So for example, I can group them into two groups randomly as following: (s1, s2, s3, s4) and (s5, s6, s7, s8) The problem is, some of the stocks might be correlated with each other, and some might not.
So my solution was to:
get a correlation matrix (assuming Pearson's method works fine)
cor_df < cor(df)
melt(flatten) the correlation list in descending order and remove duplicates and rows with correlation coefficient = 1 (used reshape library)
cor_df_melt < melt(cor_df) names(cor_df_melt)[1] < "x1" names(cor_df_melt)[2] < "x2" names(cor_df_melt)[3] < "corr" cor_df_ordered < cor_df_melt[order(cor_df_sample_melt["corr"]),]
Then I numbered the flattened matrix, removed duplicates(even numbered ones) and rows with correlation coefficient = 1
cor_df_numbered < cbind(row=c(1:nrow(cor_df_ordered)),cor_df_ordered) cor_df_ready < cor_df_numbered[cor_df_numbered$row%%2==0&cor_df_numbered$corr%%2!=1,2:4]
After this, my data frame with nicely ordered correlation coefficients for each pair in descending order was ready as follows:
> cor_df_ready x1 x2 corr 63 s7 s8 0.49223783 57 s1 s8 0.42518667 50 s2 s7 0.42369762 49 s1 s7 0.40824283 58 s2 s8 0.40395569 42 s2 s6 0.40394894 54 s6 s7 0.39408677 62 s6 s8 0.38536734 34 s2 s5 0.36882709 53 s5 s7 0.36066870 45 s5 s6 0.35734278 59 s3 s8 0.34295713 51 s3 s7 0.34163733 61 s5 s8 0.33264868 9 s1 s2 0.32812763 41 s1 s6 0.31221715 18 s2 s3 0.30692909 43 s3 s6 0.29390325 33 s1 s5 0.28845243 35 s3 s5 0.27859972 17 s1 s3 0.25039209 52 s4 s7 0.12989487 60 s4 s8 0.12095196 25 s1 s4 0.10902471 26 s2 s4 0.09471694 44 s4 s6 0.08039435 36 s4 s5 0.06957264 27 s3 s4 0.06027389
(btw i have no idea why the row number is disordered like that... can anyone explain?)
From here, my intuition was for the top pair with the highest correlation coefficient 0.49223783 (s7, s8), they had to be in the same group.
So from my cor_df_ready data frame, I chose all pairs with "s7" included and extracted the 4 stocks that appear at the top of the list (s7, s8, s2, s1) and named them group 1.
I then excluded all rows including (s7, s8, s2, s1) from my cor_df_ready, and repeated the process to come up with the second group (s3, s4, s5, s6).
well in this example I didn't have to repeat the process as there was only one last set remaining.
Then, I got the correlation matrix for each group and added the sum of every correlation coefficient:
group1_cor < cor(group1) group2_cor < cor(group2) cor_sum < sum(group1_cor) + sum(group2_cor)
then I got the mean of each row in each group, and calculated the sum of the correlation matrix for the two group means, and named it cor_sum_mean.
Lastly, I calculated for: cor_sum_mean/cor_sum
The intuition was, maximized correlation within group would maximize cor_sum where the minimized correlation between groups would also minimize cor_sum_mean.
I want to get as big cor_sum as possible(intragroup correlation) and as small cor_sum_mean as possible(intragroup correlation).
Using my method for the whole data, I divided 2000 stocks into 10 groups and what I got was
#cor_sum = 131923.1 #cor_sum_mean = 83.1731 #cor_sum_mean/cor_sum = 0.0006305
I KNOW I can get the cor_sum_mean/cor_sum down to 0.000542 (or even smaller), but I am simply stuck.
I searched google, stackoverflow, crossvalidated, and I was getting the idea that machine learning/time series clustering/classification could be the answer that I'm looking for.
The following two preposted questions seemed helpful, but I'm only starting to learn data science so I'm having a hard time understanding them....
https://stats.stackexchange.com/questions/9475/timeseriesclustering/19042#19042
https://stats.stackexchange.com/questions/3238/timeseriesclusteringinr
Can anyone please explain or direct me what to look for in specific?
This was a long question... Thanks for reading!

I seek an existing routine to partititon the nodes of a graph data base base based on a property like distance
The assumptions are that every node is connected by a relationship R to every other node (like a distance), and there is a property of the relationship that measures the "distance" or whatever between any two nodes.
I am looking for three outputs:
The set of Nodes in each natural cluster which the data falls into with no restriction on the clusters
The set of Nodes in each cluster where the number of clusters are given in advance. ie Divide the nation into n states.
The set of Nodes in each cluster where the number of clusters are given in advance and the clusters must be of equal numbers of nodesie the same size. Divide the nation into equal sized regional voting districts.
The requirement throughout is that the clusters contain nodes that are "close" to each other based on the property of R. A geographic analogy is to divide a population into regions.
Any existing routines would help here. Thank you

How to determine in R the maximum cluster center and in being under the number of distincts points?
I have got a dataframe with 99 rows, 18 with unique function. When I want a scree plot with the code following, I can't go futher than 6 iterations or I get :
" Error in kmeans(errors.for.dim.normalized, centers = i) : more cluster centers than distinct data points. "
Code :
wss<(nrow(errors.for.dim.normalized)1)*sum(apply(errors.for.dim.normalized,2,var)) for (i in 2:6) wss[i] < sum(kmeans(errors.for.dim.normalized, centers=i)$withinss) plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Picture of the cluster plot by kmeans algorythm :
I understand with the picture than why it s limited but I don't know how to get the maximal value before getting this message without visualizingthe plot. I thought unique would make it but it didn't. (I am a R beginner and firsttime I am working on data)

large missing predicted values after mixed effect logistic regression model in R
After I ran a mixed effects logistic regression and predicted the outcome using this code:
died_ed < glmer(died_ed ~ age_1 + gender + race + insurance + injury + ais + blunt_pen + comorbid + iss +min_dist + pop_dens_new + age_mdn + male_pct + pop_wht_pct + pop_blk_pct + unemp_pct + pov_100x_npct + urban_pct+(1zip/state), data = trauma, family = binomial, control = glmerControl(optimizer = 'bobyqa'), na.action = na.exclude)
trauma_model$pr_died_ed < predict(died_ed, type = c('response'))
99% of the predicted values are missing (i.e, I only have 80 predicted values, the total n = 57846), not sure the reason why. R has some warning messages after I ran the model, not sure if this is related to that:
Warning messages: 1: Some predictor variables are on very different scales: consider rescaling 2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : unable to evaluate scaled gradient 3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Hessian is numerically singular: parameters are not uniquely determined
Any ideas?
Thanks!!

How to select the right matrix for chatbot prediction using python?
I'm working on a chatbot prediction model using python. I have a matrix of phrases and intent. The phrase's words are tokenized and located at he header of the matrix. Now I need to select the right row where it scores the highest to predict the right intent for the user's input.
Say the input is "How can you help me?" the intent that my model should predict is 'help' intent because it scores the best among other intent and it has the words in the phrases (highlighted intent).
I used TfidVectorizer to create my matrix from my json file.
Click the URL to see the sample matrix.

Decision Tree Prediction
I'm trying to do task on practical machine learning using Decision Tree.I'm getting a issue.
Input
CTG$NSPF = factor(CTG$NSP) str(CTG) table(CTG$NSPF) set.seed(9850) g=runif(nrow(CTG)) CTGr=CTG[order(g),] str(CTGr) head(CTGr) DT=rpart(NSPF~., data = CTGr[1:1800,], method = "class") rpart.plot::rpart.plot(DT) predDT=predict(DT,CTGr[1801:2130,],type = "class") table(CTGr[1801:2130,3],predicted=predDT)
Output Problem
Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?
The error is generated on this line
table(CTGr[1801:2130,3],predicted=predDT
 How to Start Elki by using mouse.csv?

ELKI: perform minmax normalization before running kmeans
I am using one of the kmeans implementations provided by ELKI in my Java project.
I would like to run a minmax normalization before actually running the kmeans but I cannot understand what is the right way of doing it using the library API. Can someone point me in the right direction?

How to obtain cluster details in ELKI command line tool
I am new to ELKI and using ELKI command line tool. I am using FastOPTICS.
More than visualization, I want to further analyze the clusters. Hence as mentioned in the ELKI tutorials I am exporting my results (a piece of it is mentioned below)
ID=501 0.45660137634625386 0.43280640922410835 Head reachdist=1000000.0 predecessor=null ID=579 0.45550648740195465 0.41171202752594527 Head reachdist=0.021122777303852494 predecessor=501 ID=616 0.4679994292151898 0.4128731663431996 Head reachdist=0.012546785982944529 predecessor=579 ID=590 0.47710199684411575 0.40228069309626585 Head reachdist=0.013966288946107984 predecessor=616 ID=586 0.48022262821967227 0.39871803725628563 Head reachdist=0.004736122550805979 predecessor=590 ID=649 0.46780169318472314 0.3947836158292342 Head reachdist=0.011945786525162505 predecessor=590 ID=530 0.46558804569679024 0.3808260349477101 Head reachdist=0.014132030967455411 predecessor=649 ID=699 0.44458360726584134 0.40254557139333424 Head reachdist=0.014259496081523392 predecessor=579 ID=653 0.43639037701069394 0.4112832888808655 Head reachdist=0.011978177194622415 predecessor=699
However, I do not understand what is meant by it. What I want is to get the details of how my sparse vectors are clustered.
For example, I want to know the vectors [0.1, 0.5, 0.7], [0.2, 0.9, 0.3], ... etc. are in one cluster and so on.
Please let me know how to get those details in ELKI.