Deciding on cluster size setting in Carrot2
I am using carrot2's STC (Suffix Tree Clustering) algorithm for clustering a bunch of documents. By default, the max number of clusters the algorithm forms is 16. Is there a way to decide the number of clusters generated ?.
Below is the code for invoking STC clusters.
ProcessingResult byDomainClusters = controller.process(documents, null, STCClusteringAlgorithm.class); List<Cluster> clustersByDomain = byDomainClusters.getClusters(); ConsoleFormatter.displayClusters(clustersByDomain);
However, the low number of clusters may also be caused by the characteristics of your input data (too few documents?). To verify this, try clustering your data with the Lingo algorithm.
See also questions close to this topic
Why "swapping the argument does not change the score" in normalized_mutual_info_score?
I am trying to evaluate cluster quality by Normalized Mutual Information(NMI) by using scikit learn's normalized_mutual_info_score() function. I understand the mathematical theory of NMI but a bit confused about how this function work.
The arguments are two array containing the labels of two clustering(
labels_pred) and classifications(
labels_true). What i understand about this two array is that, the labels are ordered, by that I means for example if,
labels_pred=[0,0,1,1]then document number
twoare lebeled as
fourthare labeled as
one. Now if
label_true=[0,0,0,1], that means the ground truth classification of document
one. So the classifier misclassified the third document. Is my understanding correct?
Now, look at the documentation, Where
labels_true = [0, 0, 0, 1, 1, 1]and
labels_pred = [0, 0, 1, 1, 2, 2], so according to my understanding, the cluster algorithm predicted 3 documents(first, second and fourth) correctly. However they say,
One can permute 0 and 1 in the predicted labels
normalized_mutual_info_score are symmetric: swapping the argument does not change the score
labels_pred = [1, 1, 0, 0, 2, 2], then only one document is correctly labeled. And according to them , this swapping will not change the NMI. Why is that? What is wrong in my understanding ?
Thanks for your precious time for reading my problem. I will highly appreciate any kind of help, Thanks.
How to: Kmeans array outputs as tables with decimal numbers rounded to 2 places
I'm a student working on a cluster analysis problem in a python notebook. Solved the analysis, but I'd like to present the output in a more intelligible fashion.
How do I output, or print the output, of multiple arrays into a table, with values rounded to 2 decimal places?
created data set to use in Kmeans train function:
<input filename = <data set csv file>.rdd.map(lambda line: array([line, line, line]))
execute Kmeans function:
<output file name> = KMeans.train(input filename, 3, maxIterations=10, runs=10, initializationMode="random")
print(<output file name>.centers)
[array([ 32.25641026, 2412.25641026, 41.8974359 ]), array([ 36.30337079, 961.19662921, 45.78651685]), array([ 25.24539877, 364.33128834, 35.53067485])]
what is the tsne initial pca step doing
Looking at the parameters to the Rtsne function:
There is a parameter called "pca" defined as "logical; Whether an initial PCA step should be performed (default: TRUE)"
Let's say you have a 10 dimensional feature set and you run TSNE. I was thinking you would scale the 10-D matrix and then pass it to Rtsne().
What does the pca indicated by the pca parameter do?
WOuld it take the 10-D matrix and run PCA on that? If so, would it pass all 10 dimensions in the PCA space to Rtsne?
Is there any info anywhere else about what this initial PCA step is?
Java conversion of Hindi string literals to English using google API version 2?
I have a string sample in java as
String sample = "बेड";
which I want to be converted into
I am trying to implement using the google api (dependency
Can anyone give a sample code to use this api to translate my string literals?
I am not converting the HTML pages text. I just want to convert the params coming in
Hindifrom my solr url, so that I can look for the results stored in
Englishfor the same word.
I tried this code :-
Translate translate = createTranslateService();
String target= "en";
Optional tgtLangOpt= Optional.of(target);
LanguageListOption targetOptions = LanguageListOption.targetLanguage(tgtLangOpt.or("en"));
List languages = translate.listSupportedLanguages(targetOptions);
Also I tried
List detections = translate.detect(ImmutableList.of("en"));
In bot cases i am getting an exception Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor; at com.google.api.gax.retrying.BasicRetryingFuture.(BasicRetryingFuture.java:77) at com.google.api.gax.retrying.DirectRetryingExecutor.createFuture(DirectRetryingExecutor.java:73) at com.google.cloud.RetryHelper.run(RetryHelper.java:73) at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51) at com.google.cloud.translate.TranslateImpl.listSupportedLanguages(TranslateImpl.java:60) at com.mysearch.search.RelatedInfoTesting.main(RelatedInfoTesting.java:56)
Java ArrayDeque instantiation from existing Collection vs Iteration
I am trying to evaluate two of creating an ArrayDeque with existing collection.. I see two options :
- Using ArrayDeque's constructor which accepts an existing Collection
- Iterate over the Collection and call dequeu.offer(element) while iterating over Collection
From my benchmarks I see first option running faster than second. Is there any reason for the first option to be better than second ?
Read any type of file in Java using a Single API or convert any type of file to text file
I have a requirement to read a document which might be a text / word doc/pdf etc. In java i know based on the mimetype we can recognize the type of file and move forward based on that but that will be lengthy. Is there any api in java that converts any document to text file so that i can maintain a single format and will be very easy for maintaining the code. I have searched for it online but cant find one. Any help is appreciated.
Trend finding and hierarchical text clustering
I have a lot of titles of articles and news. So basically a lot of short texts (5, 6 words). for example:
- Hurricane Irma causes devastation in the Caribbean - The secret of the world’s rarest silk - Pupils wearing 'wrong trousers' sent home from school
What I want to do is to cluster them hierarchically into different groups and then label the clusters appropriately based on the text in the group, to not have cluster_1, cluster_2.
And finally give a rating to each cluster from being top story/trend to "not very important".
An ideal result would be to have top clusters like sport, politic, etc. and then in them to have sub-clusters of trend titles about Trump or Roger Federer. (if there are a lot of titles about Trump or Federer in the input.)
I have tried k-means from scikit-learn in python, carrot2, and knime tool; but no brilliant results yet. Maybe because the texts are mainly in German.
Any help and enlightening is appreciated.
Maven internal properties of dependency
Tech: Maven 3 + IntelliJ + ElasticSearch 5.5.0 + Carrot2 3.15.1
I have a project with Carrot2 and ElasticSearch which rise some conflicts. Carrot2 uses Lucene 5.3.1 and ElasticSearch uses 6.3.1 version. I want to force carrot2 to use 6.3.1 Lucene version to fix it.
I have tried to add a property in my project's main pom file:
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <java.version>1.8</java.version> <elasticsearch.version>5.4.2</elasticsearch.version> <org.apache.lucene.version>6.5.1</org.apache.lucene.version> </properties>
Unfortunately this way still rises an error caused by Lucene versions conflict. Finally, I found the solution for my local machine by changing internal value of carrot's dependency pom file:
<parent> <groupId>org.sonatype.oss</groupId> <artifactId>oss-parent</artifactId> <version>5</version> </parent> <groupId>org.carrot2</groupId> <artifactId>carrot2</artifactId> <version>3.15.1</version> <name>Carrot2</name> (...) <properties> (...) <org.apache.lucene.version>6.5.1</org.apache.lucene.version> <org.simpleframework.version>2.7.1</org.simpleframework.version> <org.carrot2.attributes>1.3.1</org.carrot2.attributes> </properties>
It works fine but only on my local machine. Lucene version changed in carrot's pom file seems to not propagated and there is need to change this version on any instance of project manually. Is there any possibility to force the maven to use my project property value in external dependency?
Solr with Carrot2 Clustering
I'm trying to integrate Solr with Carrot2 clustering engine. I successfully managed to do so via Solr following this link : Result Clustering I'm getting the same output as mentioned in the techproducts example. But I want to integrate it with my own web application, for that I came across the Document Clustering Server provided by Carrot2, but when I try to give Solr as source there and process it , it throws the error :
Problem accessing /dcs/rest. Reason:
Could not perform processing: org.apache.http.client.HttpResponseException: Not Found