Trend finding and hierarchical text clustering
I have a lot of titles of articles and news. So basically a lot of short texts (5, 6 words). for example:
- Hurricane Irma causes devastation in the Caribbean - The secret of the world’s rarest silk - Pupils wearing 'wrong trousers' sent home from school
What I want to do is to cluster them hierarchically into different groups and then label the clusters appropriately based on the text in the group, to not have cluster_1, cluster_2.
And finally give a rating to each cluster from being top story/trend to "not very important".
An ideal result would be to have top clusters like sport, politic, etc. and then in them to have sub-clusters of trend titles about Trump or Roger Federer. (if there are a lot of titles about Trump or Federer in the input.)
I have tried k-means from scikit-learn in python, carrot2, and knime tool; but no brilliant results yet. Maybe because the texts are mainly in German.
Any help and enlightening is appreciated.
See also questions close to this topic
Data Preparation for training
I am trying to prepare the data file by creating one hot encoding of the text of characters using which I can later train my model for classification. I have a training data file which consists of lines of characters and I am doing initially the integer encoding of them and then the one hot encoding.
e.g. this is how the data file looks:
This is how I am approaching it:
import pandas as pd from sklearn import preprocessing categorical_data = pd.read_csv('abc.txt', sep="\n", header=None) labelEncoder = preprocessing.LabelEncoder() X = categorical_data.apply(labelEncoder.fit_transform) print("Afer label encoder") print(X.head()) oneHotEncoder = preprocessing.OneHotEncoder() oneHotEncoder.fit(X) onehotlabels = oneHotEncoder.transform(X).toarray() print("Shape after one hot encoding:", onehotlabels.shape) print(onehotlabels)
I am getting the integer encoding for each line (0,1,2 in my case) and then the subsequent one hot encoded vector.
My question is that how do I do it for each character in an individual line as for prediction, the model should learn from the characters in one line( which corresponds to a certain label). Can someone give me some insight on how to proceed from there?
Information Gain calculation with Scikit-learn
I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy.
Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn.
Is it possible to use a specific setting for mutual information in scikit-learn to accomplish this task?
How to use address data in binary classification along with other numerical data?
I have classify an address as commercial or business based few other columns , some columns are numeric columns and some text data columns . I am not sure how input both of these into an classification algorithm. Any help is appreciated.
Sample data is as below . flag is column to be classified
flag,Lat,Lng,Qty,Weight,Address,Product,Phone_Type,Store_Ratio N,22.281654,114.155019,2,20.9,ZETTER PICTURE FRAMER LG/F. HOSEINEE HOUSE WYNDHAM ST CENTRAL,VTF,Landline,0 N,22.298614,114.168342,1,34,ALDI SOURCING ASIA LIMITED SUITE 807-813 SOUTH TOWER WORLD FINANCE CENTRE HARBOUR CITY KLN,VWF,Undefined,0.011363636 N,22.299237,114.168019,3,48,TCHIBO MERCHANDISING HK HARDWARE 1 7F TOWER 1 THE GATEWA 25 CARTON RD TSIM SHA TSUI KLN,VWF,Undefined,0.011363636 N,22.373234,114.106912,1,62,JOHNSON GROUP PEST SSELECT OR FLAT 4-6 24/FLOOR MILLIOM FORT 34-36 CHAI WAN KOK ST TSUEN WAN,VTF,Undefined,0.162105263 N,22.373704,114.106212,1,50,PO WING GLOBAL TRADING LIMITED UNIT C3 18/F TERMINAL TOWER 3 HOI SHING RD TSUEN WAN NT,VWF,Landline,0.162105263 N,22.337504,114.149922,31,360.5,GLAMOROUS APPAREL LIMITED 7B PRECIOUS INDUSTRIAL CENTRE 18 CHEUNG YUE ST CHEUNG SHA WAN,VWF,Landline,0 N,22.389454,114.206072,3,21,C/O MOL LOGISTICS (HK) COMPANY LT 16/F EVER GAIN 3 BUILDING 22 ON SUM ST SHA TIN,VWF,Landline,0 N,22.398024,114.191519,2,109,CLARK KINCAID ENGINEERING (HK) UNIT 359FBLOCK DWAH LOK INDUSTRIAL CTR31-41 SHAN MEI ST FO TA SHA TIN,VWF,Landline,0.038095238 N,22.280394,114.155802,1,13.68,201 WILSON HOUSE 19-27 WYNDHA ST HONG KONG,VWF,Mobile,0.67646005 N,22.336164,114.196782,10,77,TAK FAT FASHION LTD (WAREH 6/F BLOCK B WAH HING INDUSTRIAL MANSION 36 TAI YAU ST SAN PO KONG KOWLOON,VWF,Landline,0 N,22.355534,114.126002,1,25.9,LKW PARTS - SERVICES LTD WORKSHOP 6 3/F FOOK YVT B NOS.53-57 KWAI FUNG CRESCENT NEW TERRITORVWS,VTF,Undefined,0 N,22.343104,114.123419,20,112.4,DATIAN W. GROUP (H.K.)LIMITED GRIDLINES 4024W-4026W -4022W A GISTICS CENTRE A BERTH3 KWAI KWAI CHUNG,VWF,Landline,0.038095238 N,22.351237,114.110432,3,26.8,NNR GLOBAL LOGISTICS (HK) LTD UNIT A-C 13/F GATEWAY TS 8 CHEUNG FAI RD TSING YI N.T N.T.(NEW TERRITORVWS),VTF,Landline,0 Y,22.281737,114.156819,1,0.14,ZAABA CAPITAL LTD UNIT 201B 2/F CHINA BUILDING 29 QUEENS RD CENTRAL HONG KONG,VT,Undefined,0.636363636 Y,22.284444,114.139492,1,0.68,MISS AMY KONG 27/F FLAT A HILARY COURT 63G BONHAM RD SAI YING PUN,VTD,Mobile,0.791754757 Y,22.284264,114.142842,1,1,MAN SUET YUEN FLAT A 15/F WILTON PLACE 18 PARK RD SAI YING PUN,VW,Mobile,1 Y,22.283444,114.156912,1,0.05,PHONE 85293398997 10/F. HVTSHING HONG CENTRE 55 DES VEUX RD CENTRAL CENTRAL,VT,Undefined,0.5 Y,22.283737,114.156219,1,2.09,WU LI 68 DES VOEUX RD 11TH FLOOR MAN YEE BUILDING HONG KONG,VW,Undefined,0.5 Y,22.283237,114.156172,2,12.05,7/F CHE SAN BUILDING 10 POTT CENTRAL,VW,Mobile,1 Y,22.286337,114.156919,1,0.86,8 FINANCE ST 51F 2 INTERNATIONAL FINANCE CENTRE CENTRAL - WESTERN,VT,Landline,0.971291866 Y,22.286337,114.156919,1,1.23,49/F TWO INTERNATIONAL FINANCE CENTRE 8 FINANCE ST CENTRAL,VT,Mobile,1 Y,22.285237,114.159252,1,0.45,KRC55@CORNELL.EDU 8 FINANCE ST FOUR SEASONS PLACE APARTMENT 1 HONG KONG,VW,Mobile,1 Y,22.283634,114.158332,1,3.82,ORIX AVIATION HONG KONG 25F 8 CONNAUGHT PLACE CENTRAL HONG KONG,VW,Mobile,1 Y,22.283634,114.158332,1,0.68,31/F ONE EXCHANGE SQUARE 8 CONNAUGHT PLACE CENTRAL,VT,Mobile,0.971291866 Y,22.286337,114.156919,1,0.36,ANTHONY CHOW - WELLINGTON MANA 8 FINANCE ST 8 FINANCE ST CENTRAL,VW,Undefined,1 Y,22.285237,114.159252,1,1.32,28/F TWO INTERNATIONAL FINANCE CENTRE 8 FINANCE ST HONG KONG,VW,Mobile,1 Y,22.286337,114.156919,1,0.41,8 FINANCE ST CENTRAL HK 21F TWO INTERNATIONAL FINANCE CENTRE CENTRAL - WESTERN,VT,Mobile,0.971291866 Y,22.282964,114.159262,1,1.27,ROOM 2305 JARDINE HOUSE 1 CONNAUGHT PLACE CENTRAL HONG KONG,VT,Mobile,0.971291866 Y,22.281244,114.142102,3,24.55,55 CONDUIT RD FLAT 9B MID LEVELS,VT,Mobile,0.785714286 Y,22.282437,114.151132,3,16,99 CAINE RD ALBRON COURT UNIT 27A MID LEVEL HONG KONG,VW,Mobile,1 Y,22.282637,114.155022,4,31,WINCY CHOW BASEMENT 55 WELLINGTON ST HONG KONG,VW,Landline,1 Y,22.282637,114.155022,4,31,WINCY CHOW BASEMENT 55 WELLINGTON ST HONG KONG,VW,Landline,1 Y,22.265324,114.128712,3,22.6,FLAT 5 47 SASSOON RD TELEGRAPH BAY POK FU LAM HONG KONG,VW,Mobile,0.4 Y,22.276084,114.152962,4,54.27,NICOLA ROBB FLAT 2 11/F BLOCK A QUEENS GARDENS 9 OLD PEAK RD HONG KONG,VT,Mobile,0.757575758 Y,22.264254,114.128672,3,22.6,FLAT 5 47 SASSOON RD TELEGRAPH BAY POK FU LAM HONG KONG,VW,Mobile,0.4 Y,22.282624,114.154682,3,33,ROOM 1904 1 LYNDHURST TOWER 1 LYNDHURST TERRACE CENTRAL,VT,Mobile,0.555555556 Y,22.239854,114.159882,6,80.55,ALYSCIA MAK LARVOTTO TOWER 7 32A 8 AP LEI CHAU PRAYA RD HONG KONG,VW,Undefined,0.757575758 Y,22.249214,114.154172,3,32.86,FLAT 4 FLOOR 5 KONG FU COURT 23 NAM NING ST ABERDEEN CENTRE,VW,Mobile,1 N,22.285084,114.155772,4,38,GALERVW PERROTIN HK 50 CONNAUGHT RD CENTRAL HONG KONG,VW,Landline,0.416666667 N,22.286664,114.150872,4,6.3,3/F WAH KIT COMMERCIAL CENTE 300 - DEX VOEUX RD . CENTRE XKT,VT,Landline,0 N,22.286337,114.150619,6,128,MEGA BIRDS NEST ENTERPRISE 3F WING HING COMMERCIAL BUILDING 139 WING LOK ST SHEUNG WAN,VT,Landline,0.1 N,22.286524,114.151419,4,91.2,MOUNTAIN - SEA BIRDS NEST COMPANY L UNIT 04-05A 13/F WING TRUCK COMMERCIAL CENTRE SHEUNG WAN,VT,Landline,0.080357143 N,22.287274,114.148692,7,83,MTG MINT CARD LIMITED FLAT B 9/F 205-211 WING LOK S SHEUNG WAN,VT,Landline,0.5 N,22.285974,114.13187,5,78,BABY SATAY SHOP D LG FLOOR KWOK GA BUILDING 6-12 WOO HOP ST SAI WAN,VW,Landline,0 N,22.282984,114.157619,5,73,CATHOLIC CENTRE 16 FLAT 15 - 18 CONNAUGHT RD HKG,VT,Landline,0 N,22.282214,114.155052,3,45,SEV VA 6F V PLUS 68-70 WELLINGTON ST VISIONS RESTAURANT HOLDINGS COMPANY CENTRAL,VT,Landline,0.162105263 N,22.282914,114.154452,4,39.5,SALLY COCO 98 WELLINGTON ST FLAT 9A JADE CENTRE CENTRAL HONGKONG,VW,Landline,0 N,22.282074,114.153202,5,46,STEELCASE HONG KONG LIMITED 32 HOLLYWOOD RD 15TH FLOOR KINWICK CENTRE HONG KONG,VW,Mobile,0 N,22.288104,114.146742,4,51.8,TRITON PRECISION ENGINEERING C 10/F GOLD UNION COMMERCIAL BUILDING 70-72 CONNAUGHT RD WEST HONG.KONG SHEUNG WAN,VW,Landline,0 N,22.288104,114.146742,4,28,CHINCS WORKSHOP LIMITED 5/F GOLD UNION COMMERCIAL BUILDING 70-72 COMMAUGHT RD SHEUNG WAN,VT,Undefined,0.153153153
Hierarchical kmeans clustering in python
I am trying to implement hierarchical clustring algorithm in python, based on the kmeans implementation from scikit-learn.
First I want to cluster my data by specific subset of features (
l1_features_sebset), and then by the rest of them (
l2_features_sebset). I wrote the following code:
km = KMeans_fit(df[l1_features_subset], nclust=10) df['l1_cluster'] = km.predict(df[l1_features_subset].as_matrix()) for i in df['l1_cluster'].unique(): print 'handling cluster',i km = KMeans_fit(df[df['l1_cluster']==i][l2_features_subset], nclust=15) print 'fitted' df.ix[df['l1_cluster']==i,'l2_cluster'] = km.predict(df[df['l1_cluster']==i][l2_features_subset].as_matrix()).astype(int)
l2_features_subsetare lists of features and
KMeans_fit()is a wrapping function that creates KMeans clusterer and fits it (written here just to make the code clearer for you)
def KMeans_fit(sparse_data, nclust): kmeans = k_means_.KMeans(n_clusters=nclust, n_jobs=-1,algorithm='full') _ = kmeans.fit(sparse_data) return kmeans
I'm trying to run the code over a DataFrame with 350,000 rows. The clustering of the first layer runs very fast (less then a minute). But the first iteration of the loop takes 30 seconds (for the smallest cluster - cluster of size 3000, thus run time will take very long), and actually, it stucks for too long on one of the clusters.
Does anyone have an idea how to optimize/improve my code ?
Any suggestion/note will be appreciated.
obtain Dendogram scipy for geological database scipy
I have a geological database as follows:
Id SampleNo Cu Mo Si
1 alis1 0.6 0.9 12
2 . 0.4 0.9 14
n data 0.3 0.4 13
as it may be clear the features are : Id , SampleNo , Cu ,Mo, and Si and samples are the raws I set the data frame df as an input data for clustering Z=linkage(z) output for dendrogram (x axis is labeled based on sample values too many) but I want to know the relation between features and when I transpose the df and the x axis will be set by features. The idea is a matrix [n_sample, n_features] but for having a suitable results I should transpose the data set to [n_features and n_samples] using scipy python. What is the reason ?
Reconstruct tree based on sum relation
Given a set of nodes with a name and a value
name value (A) 10 (B) 15 (C) 5
how can find a tree, assuming that the value of the parent is the sum of all children values.
(B) // 15 = 10 + 5 / \ (A) (C)
Real world example:
Imaging a file system. Each node is either a directory or a file. Each node has a value. In the case of a file the value equals its file size. For a directory the value is the sum of all children values.
Now we delete all relations between the nodes and the information whether it is a directory( node with child relations) or a file.
What possible approaches exist to reconstruct that tree?
Automatic Trend Detection for Time Series / Signal Processing
What are the good algorithms to automatically detect trend or draw trend line (up trend, down trend, no trend) for time series data? Appreciate if you can point me to any good research paper or good library in python, R or Matlab.
Ideally, the output from this algorithm will have 4 columns:
- trend (up/down/no trend/unknown)
- probability_of_trend or degree_of_trend
Thank you so much for your time.
Flatten or detrend a seasonal time series
I have a repeating time series with a seasonal (weekly) pattern, and I'd like to return the same time series with no week-over-week trend, taking the first value as a starting point.
To be specific, the 1st value will still be 39.8, but the 8th value will also be 39.8 rather than 17.1. If the first seven values were just repeated then there would be a week-long negative trend repeated and I'd like to have no trend at all (so the 7th value of 6.2 would also be higher).
Is there an elegant way to do this, especially one that is robust to zero-valued entries in a time-series (I have a lot of them)?
We can assume the time series trend is linear and constant (i.e. not just piecewise linear).
demand <- ts( c(39.8, 33.5, 40.6, 23.6, 11.9, 12.3, 6.2, 17.1, 10.8, 18, 1, -10.7, -10.4, -16.5, -5.6, -11.9, -4.7, -21.7, -33.4, -33.1, -39.2, -28.2, -34.6, -27.4, -44.4, -56.1, -55.7, -61.8, -50.9, -57.2, -50.1), frequency = 7 ) plot(demand)
SSRS colour expression for upward downward trend line chart
I need to make a line chart with the month on horizontal axis and a value on the vertical axis.
I can't find a way to compare the values grouped by month, and I prefer not to perform this by adding another query from db.
Solr with Carrot2 Clustering
I'm trying to integrate Solr with Carrot2 clustering engine. I successfully managed to do so via Solr following this link : Result Clustering I'm getting the same output as mentioned in the techproducts example. But I want to integrate it with my own web application, for that I came across the Document Clustering Server provided by Carrot2, but when I try to give Solr as source there and process it , it throws the error :
Problem accessing /dcs/rest. Reason:
Could not perform processing: org.apache.http.client.HttpResponseException: Not Found
Text Documents clustering in Carrot2 Banchmark gives only "Other Topics"
I am trying to cluster more than 100 documents in Carrot2 benchmark, by indexing those docs in Solr. But all I am getting is "Other Topics" in Carrot2 Benchmark and nothing else.
Can someone please suggest something on this?Please find ss of carrot2 screen
Here are the parameters and their value: Source: solr Algorithm: Lingo Query: : Service URL: http://localhost:8983/solr/******/select
Note: ******* = indexed directory name (I can't write actual name)
Clustering with Apache Solr and Carrot2
I am very new to both Apache Solr and Carrot2. I am trying to index lot of input files using Solr. The end goal is to cluster the documents.
I am not clear if the clustering is done by Solr or by carrot2 workbench?
Can anyone guide me in this?
R boxplot; center the axis labels under the tick marks
I plotted a dataframe (16700 obs. of 6 variables) using the following code:
labels <–c("X2137_Fe20","X2137_FeXS","vtc1_Fe20", "vtc1_FeXS","VTC1c_Fe20","VTC1c_FeXS") #labels x axis col <- c("chartreuse3", "chocolate2", "chartreuse3", "chocolate2", "chartreuse3", "chocolate2") #colors #Plot boxplot(CVtable, outline = FALSE, ylim = c(-0.5,70), main="CV Replicas", ylab="RSD(%)", range = 1.5, width = c(9,9,9,9,9,9), plot = TRUE, col = col, par (mar = c(5,4.5,5,0.5) + 0.1), par(cex.lab=2), par(cex.axis=1.7), notch = TRUE, labels = labels) dev.off()
I like this box plot, but there are a couple of things I would like to adjust. I need to keep this font size for the x axis labels, but as you can see the labels are too big and part of them is missed. The solution is to rotate them 45 degrees, but I do not manage to find an easy code to insert in my script.
I tried to delete the original axes (
axes=FALSE), then setting new ones by
boxplot(CVtable, outline = FALSE, ylim = c(0.5,70), ylab="RSD(%)", range = 1.5, width = c(9,9,9,9,9,9), plot = TRUE, col = col, par (mar = c(5,4.5,5,0.5) + 0.1), notch = TRUE, par(cex.lab=1.7), axes=FALSE) axis(1, at = c(1,2,3,4,5,6), labels = F, tick = 2, line = NA, pos = -1, outer = F, font = 3, lty = "solid", lwd = 2, lwd.ticks = 3, col = NULL, col.ticks = NULL, hadj = NA, padj = 0) axis(2, at = c(0,10,20,30,40,50,60,70) , labels = c(0,10,20,30,40,50,60,70), tick = 2, line = NA, pos = 0.5, outer = FALSE, font = 1, lty = "solid", lwd = 2, lwd.ticks = 3, col = NULL, col.ticks = NULL, hadj = NA, padj = 0, par(cex.lab=1.5)) text(x=c(1,2,3,4,5,6), y=par()$usr-0.1*(par()$usr-par()$usr), labels=labels, srt=45, adj=1, xpd=TRUE, par(cex.lab=2))
and this is the output: img2 Well, I do not know how to center my labels under the tick marks and how to extend the x axis to the origin of the graph (left) and to the end of the last box (right). Moreover, the argument
par(cex.lab=2)to fix the x axis labels font size seems no longer working in that string.
Any good suggestion?
PS: this is my 1st post, if any needed info is missed, please leave a comment and I will reply as soon as I can. Thank you!
Scalerank/Labelrank Implementation Methods
I'm trying to figure out how others have implemented scalerank/labelrank into their maps, specifically in regards to GeoNames city data, as there is no preexisting field that provides this type of information.
To be clear, I'm talking about a numeric field that specifies which cities display on each zoom level. I'm trying to determine the best way to assign the numeric values.
The city data provided by Natural Earth includes a scalerank that can be joined via geonameid, but there are only ~7000 cities and that doesn't cover the entire GeoNames dataset, which therefore makes it insufficient.
I devised my own method that is based on population and proximity, and may soon incorporate political capital status, but I would really like to know what others have done before I continue.
Any and all input is much appreciated. Thank you in advance!