Deciding on cluster size setting in Carrot2
I am using carrot2's STC (Suffix Tree Clustering) algorithm for clustering a bunch of documents. By default, the max number of clusters the algorithm forms is 16. Is there a way to decide the number of clusters generated ?.
Below is the code for invoking STC clusters.
ProcessingResult byDomainClusters = controller.process(documents, null, STCClusteringAlgorithm.class); List<Cluster> clustersByDomain = byDomainClusters.getClusters(); ConsoleFormatter.displayClusters(clustersByDomain);
However, the low number of clusters may also be caused by the characteristics of your input data (too few documents?). To verify this, try clustering your data with the Lingo algorithm.
See also questions close to this topic
Grouping similar data to maximize intra-group correlation and minimize inter-group correlation
so this is my problem. I have daily return data of 2000 stocks, and below is a small sample of it (s1 to s8, day1 to day15)
I'll call my data "df".
> df[1:15,1:8] s1 s2 s3 s4 s5 s6 s7 s8 1 -0.026410 -0.001030 -0.0027660 0.0126500 -0.030110 0.001476 -0.008271 -0.005299 2 -0.018990 -0.013680 -0.0092050 -0.0008402 -0.002739 -0.014170 -0.006091 -0.011920 3 0.004874 0.024140 -0.0002107 -0.0084770 -0.006825 -0.001448 -0.002724 -0.003132 4 0.019300 -0.004649 0.0223400 0.0080200 -0.008197 -0.015270 0.004064 -0.008149 5 0.010350 -0.010650 0.0087780 0.0059960 -0.001390 -0.006454 0.018990 0.002822 6 0.028650 0.010490 0.0157200 -0.0004123 0.019750 -0.005902 0.004261 0.019110 7 0.004203 -0.002682 -0.0099840 -0.0070060 -0.025670 -0.014550 -0.016700 -0.011580 8 -0.042170 -0.019490 -0.0023140 -0.0083030 -0.018170 0.021160 -0.006864 -0.009438 9 0.017250 0.026600 0.0031630 -0.0069090 0.035990 0.008429 0.001500 -0.011830 10 -0.037400 -0.022370 0.0088460 0.0012690 -0.050820 -0.025300 -0.028040 -0.023790 11 -0.091140 -0.018830 0.0052160 -0.0403000 0.001410 -0.007050 -0.024340 -0.013110 12 -0.051620 0.004791 0.0336000 -0.0094320 -0.018320 -0.019490 -0.044080 -0.024020 13 0.007711 0.002158 -0.0177400 0.0090470 -0.004346 -0.001562 -0.096030 0.015840 14 0.041440 -0.001072 -0.0168400 0.0180300 -0.012980 -0.015280 0.059780 0.014730 15 -0.042620 -0.025560 -0.0180200 -0.0115200 0.033320 -0.015150 -0.014580 -0.012710
I need a way to group them so that the intra-group correlation is maximized and inter-group correlation is minimized.
So for example, I can group them into two groups randomly as following: (s1, s2, s3, s4) and (s5, s6, s7, s8) The problem is, some of the stocks might be correlated with each other, and some might not.
So my solution was to:
get a correlation matrix (assuming Pearson's method works fine)
cor_df <- cor(df)
melt(flatten) the correlation list in descending order and remove duplicates and rows with correlation coefficient = 1 (used reshape library)
cor_df_melt <- melt(cor_df) names(cor_df_melt) <- "x1" names(cor_df_melt) <- "x2" names(cor_df_melt) <- "corr" cor_df_ordered <- cor_df_melt[order(-cor_df_sample_melt["corr"]),]
Then I numbered the flattened matrix, removed duplicates(even numbered ones) and rows with correlation coefficient = 1
cor_df_numbered <- cbind(row=c(1:nrow(cor_df_ordered)),cor_df_ordered) cor_df_ready <- cor_df_numbered[cor_df_numbered$row%%2==0&cor_df_numbered$corr%%2!=1,2:4]
After this, my data frame with nicely ordered correlation coefficients for each pair in descending order was ready as follows:
> cor_df_ready x1 x2 corr 63 s7 s8 0.49223783 57 s1 s8 0.42518667 50 s2 s7 0.42369762 49 s1 s7 0.40824283 58 s2 s8 0.40395569 42 s2 s6 0.40394894 54 s6 s7 0.39408677 62 s6 s8 0.38536734 34 s2 s5 0.36882709 53 s5 s7 0.36066870 45 s5 s6 0.35734278 59 s3 s8 0.34295713 51 s3 s7 0.34163733 61 s5 s8 0.33264868 9 s1 s2 0.32812763 41 s1 s6 0.31221715 18 s2 s3 0.30692909 43 s3 s6 0.29390325 33 s1 s5 0.28845243 35 s3 s5 0.27859972 17 s1 s3 0.25039209 52 s4 s7 0.12989487 60 s4 s8 0.12095196 25 s1 s4 0.10902471 26 s2 s4 0.09471694 44 s4 s6 0.08039435 36 s4 s5 0.06957264 27 s3 s4 0.06027389
(btw i have no idea why the row number is disordered like that... can anyone explain?)
From here, my intuition was for the top pair with the highest correlation coefficient 0.49223783 (s7, s8), they had to be in the same group.
So from my cor_df_ready data frame, I chose all pairs with "s7" included and extracted the 4 stocks that appear at the top of the list (s7, s8, s2, s1) and named them group 1.
I then excluded all rows including (s7, s8, s2, s1) from my cor_df_ready, and repeated the process to come up with the second group (s3, s4, s5, s6).
well in this example I didn't have to repeat the process as there was only one last set remaining.
Then, I got the correlation matrix for each group and added the sum of every correlation coefficient:
group1_cor <- cor(group1) group2_cor <- cor(group2) cor_sum <- sum(group1_cor) + sum(group2_cor)
then I got the mean of each row in each group, and calculated the sum of the correlation matrix for the two group means, and named it cor_sum_mean.
Lastly, I calculated for: cor_sum_mean/cor_sum
The intuition was, maximized correlation within group would maximize cor_sum where the minimized correlation between groups would also minimize cor_sum_mean.
I want to get as big cor_sum as possible(intra-group correlation) and as small cor_sum_mean as possible(intra-group correlation).
Using my method for the whole data, I divided 2000 stocks into 10 groups and what I got was
#cor_sum = 131923.1 #cor_sum_mean = 83.1731 #cor_sum_mean/cor_sum = 0.0006305
I KNOW I can get the cor_sum_mean/cor_sum down to 0.000542 (or even smaller), but I am simply stuck.
I searched google, stackoverflow, crossvalidated, and I was getting the idea that machine learning/time series clustering/classification could be the answer that I'm looking for.
The following two pre-posted questions seemed helpful, but I'm only starting to learn data science so I'm having a hard time understanding them....
Can anyone please explain or direct me what to look for in specific?
This was a long question... Thanks for reading!
I seek an existing routine to partititon the nodes of a graph data base base based on a property like distance
The assumptions are that every node is connected by a relationship R to every other node (like a distance), and there is a property of the relationship that measures the "distance" or whatever between any two nodes.
I am looking for three outputs:
The set of Nodes in each natural cluster which the data falls into with no restriction on the clusters
The set of Nodes in each cluster where the number of clusters are given in advance. ie Divide the nation into n states.
The set of Nodes in each cluster where the number of clusters are given in advance and the clusters must be of equal numbers of nodes--ie the same size. Divide the nation into equal sized regional voting districts.
The requirement throughout is that the clusters contain nodes that are "close" to each other based on the property of R. A geographic analogy is to divide a population into regions.
Any existing routines would help here. Thank you
How to determine in R the maximum cluster center and in being under the number of distincts points?
I have got a dataframe with 99 rows, 18 with unique function. When I want a scree plot with the code following, I can't go futher than 6 iterations or I get :
" Error in kmeans(errors.for.dim.normalized, centers = i) : more cluster centers than distinct data points. "
wss<-(nrow(errors.for.dim.normalized)-1)*sum(apply(errors.for.dim.normalized,2,var)) for (i in 2:6) wss[i] <- sum(kmeans(errors.for.dim.normalized, centers=i)$withinss) plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
I understand with the picture than why it s limited but I don't know how to get the maximal value before getting this message without visualizingthe plot. I thought unique would make it but it didn't. (I am a R beginner and firsttime I am working on data)
Key Value Store in Cassandra
I want to know how to apply key-value store using cassandra. If I have stream of data and I don't know the architecture and structure of it then how to save it using cassandra key-value store and with java code and how is actually work
- Live stock Quotes from PSX(Pakistan Stock Exchange) for Stock Trading Simulator
Solr with Carrot2 Clustering
I'm trying to integrate Solr with Carrot2 clustering engine. I successfully managed to do so via Solr following this link : Result Clustering I'm getting the same output as mentioned in the techproducts example. But I want to integrate it with my own web application, for that I came across the Document Clustering Server provided by Carrot2, but when I try to give Solr as source there and process it , it throws the error :
Problem accessing /dcs/rest. Reason:
Could not perform processing: org.apache.http.client.HttpResponseException: Not Found
Text Documents clustering in Carrot2 Banchmark gives only "Other Topics"
I am trying to cluster more than 100 documents in Carrot2 benchmark, by indexing those docs in Solr. But all I am getting is "Other Topics" in Carrot2 Benchmark and nothing else.
Can someone please suggest something on this?Please find ss of carrot2 screen
Here are the parameters and their value: Source: solr Algorithm: Lingo Query: : Service URL: http://localhost:8983/solr/******/select
Note: ******* = indexed directory name (I can't write actual name)
Clustering with Apache Solr and Carrot2
I am very new to both Apache Solr and Carrot2. I am trying to index lot of input files using Solr. The end goal is to cluster the documents.
I am not clear if the clustering is done by Solr or by carrot2 workbench?
Can anyone guide me in this?