what is the tsne initial pca step doing
Looking at the parameters to the Rtsne function:
https://cran.rproject.org/web/packages/Rtsne/Rtsne.pdf
There is a parameter called "pca" defined as "logical; Whether an initial PCA step should be performed (default: TRUE)"
Let's say you have a 10 dimensional feature set and you run TSNE. I was thinking you would scale the 10D matrix and then pass it to Rtsne().
What does the pca indicated by the pca parameter do?
WOuld it take the 10D matrix and run PCA on that? If so, would it pass all 10 dimensions in the PCA space to Rtsne?
Is there any info anywhere else about what this initial PCA step is?
Thank you.
1 answer

The original tSNE paper used PCA.
To reduce the dimensionality of the MNIST data prior to running tSNE.
See also questions close to this topic

ggplot2 axis for time durations on a loglike scale
I have data that is the duration in seconds (an integer) for a specific type of event. A few durations are days long, some are hours long, more are minutes long, but must are seconds long. I want to histogram this data, and given that the majority of values are small, I want the xaxis to be on the log10 scale.
I know how to format the xaxis labels to that they are expressed as powers of ten, for example: 10^1, 10^2, 10^3, and so on. This is not what I want! For example, 10^5 seconds is not easily understandable, and it would make more sense to express this value in units of hours or days.
What I want is an axis with labelling that looks something like this:
1 second, 10 seconds, 1 minute, 10 minutes, 1 hour, 10 hours, 1 day, 1 week, 1 month, 1 year... etc.
(Bonus points if durations less than 1 second are expressed in ms, μs, ns and very long durations in ka, Gy, etc.)

Defining polygon object by identifying lines in R
I have a dataset contains lines and I have imported them into R. I want to take a close look at the coordinates of them and define the identical first and last coordinate of each point if there is some >(looking for polygon). Therefore, I am using Slot which makes me able to have a close look at the details of the desired object. My final goal is to define the number of identical point coordinates(First and last) for each line in order to discover the number of the potential polygon in my data. recapping my difficulty is the following question: How many of lines objects have potential to be a polygon? To do so, I have done several steps: In the first step, I read my data into R. The second step, I have used slot to have a close look at the coordinate of each point(Sequence of points represent line object). the third step: I have tried to define the number of identical points but I have faced with an error says CRDs not found
at the following, you can tack a look at the codes
enter library(maptools) #Read data directly from National Geophysical Data Center (NGDC) coastline #extractor. shorelinedat="http://www.asdarbook.org/RC1/datasets/auckland_mapgen.dat" #Assign CRS llCRS < CRS("+proj=longlat +ellps=WGS84") #Read data from mapgen into a SpatialLines object. auck_shore < MapGen2SL("auckland_mapgen.dat", llCRS) #Required code to identify the lines. lns < slot(auck_shore, "lines") table(sapply(lns, function(x) length(slot(x, "Lines"))))
Here is the code in which I faced with the error
#identifying the number of identical coordinates islands_auck < sapply(lns, function(x) { + crds < slot(slot(x, "Lines")[[1]], "coords") + identical(crds[1, ], crds[nrow(crds), ]) + })
This is the error
Error in +crds < slot(slot(x, "Lines")[[1]], "coords") : object 'crds' not found
I would appreciate if anyone can give a hint.

Using stat_smooth with geom_line
I'm looking to produce a linegraph with 37 lines using stat_smooth to produce a smooth curve then rather than one which changes at every data point.
This is my script for the plot currently:
Data_Arr %>% ggplot(aes(x=Period, y=M_Min, colour = Day_Week)) + geom_line(aes(group=`Day Name`)) + geom_hline(aes(yintercept= 120, linetype = "120MPM"), colour= 'lawngreen')+ ylim(0,200)+ stat_smooth(aes(x=Period, y=M_Min, colour = Day_Week))+ theme_readable()
Stat_smooth appears to be working for some of it but not all? I haven't used stat_smooth before so this is probably a simple fix I'm missing
This is the dataset
Data_Arr = structure(list(Season = c(1718, 1718, 1718, 1718, 1718, 1718, 1718, 1718, 1718, 1718), Week = c("Wk 40 Newcastle_H", "Wk 40 Newcastle_H", "Wk 40 Newcastle_H", "Wk 41 Wasps_A", "Wk 41 Wasps_A", "Wk 41 Wasps_A", "Wk 42 No_Game", "Wk 40 Newcastle_H", "Wk 40 Newcastle_H", "Wk 40 Newcastle_H" ), Day_Week = c("Wk 40 Newcastle_H Attack_3G", "Wk 40 Newcastle_H Defence_3G", "Wk 40 Newcastle_H Newcastle_H", "Wk 41 Wasps_A Defence_3G", "Wk 41 Wasps_A Attack_3G", "Wk 41 Wasps_A Wasps_A", "Wk 42 No_Game Rugby_Games", "Wk 40 Newcastle_H Attack_3G", "Wk 40 Newcastle_H Defence_3G", "Wk 40 Newcastle_H Newcastle_H"), Date = structure(c(1522886400, 1522713600, 1523059200, 1523318400, 1523491200, 1523664000, 1523923200, 1522886400, 1522713600, 1523059200), class = c("POSIXct", "POSIXt" ), tzone = "UTC"), Day = c("Thu", "Tue", "Sat", "Tue", "Thu", "Sat", "Tue", "Thu", "Tue", "Sat"), `Training/Match` = c("Training", "Training", "Match_P", "Training", "Training", "Match_P", "Training", "Training", "Training", "Match_P"), `Day Name` = c("Attack_3G", "Defence_3G", "Newcastle_H", "Defence_3G", "Attack_3G", "Wasps_A", "Rugby_Games", "Attack_3G", "Defence_3G", "Newcastle_H"), `Full Match/Training/Quarters` = c("Training", "Training", "Match_P", "Training", "Training", "Match_P", "Training", "Training", "Training", "Match_P"), `Squad Classification` = c("Senior", "Senior", "Senior", "Senior", "Senior", "Senior", "Senior", "Senior", "Senior", "Senior"), `Forward/Back` = c("Backs", "Backs", "Backs", "Backs", "Backs", "Backs", "Backs", "Backs", "Backs", "Backs" ), Position = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Prop", "Hooker", "Second Row", "Back Row", "Scrum Half", "Fly Half", "Centre", "Wing/FullBack"), class = "factor"), Player.Name = c("Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny", "Arr Jonny"), Period = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Min_1", "Min_2", "Min_3", "Min_4", "Min_5", "Min_6", "Min_7", "Min_8", "Min_9", "Min_10"), class = "factor"), M_Min = c(154.3, 188, 156.2687833, 175.9911, 159.422783333333, 137.872366666667, 153.349133333333, 150.6, 166, 139.1597833)), .Names = c("Season", "Week", "Day_Week", "Date", "Day", "Training/Match", "Day Name", "Full Match/Training/Quarters", "Squad Classification", "Forward/Back", "Position", "Player.Name", "Period", "M_Min"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"))
Thanks
EDIT: This is the type of plot I'm looking to produce but this one was produced in excel and I'm looking to do it in R

How to remove noise from a set of data points assuming the data is normal but the noise is uniform?
I have a load of points inside some bounded rectangle of the plane. Most of them follow one of n bivariate normal distributions (the number n is unknown) but a pretty small amount of the remaining points instead follow one uniform distribution across the entire rectangle. I’m even willing to consider both the cases of when I have an estimate of how many of the points are noise, but prefer a solution which is agnostic of this.
In this image there are two gaussians and the red points are the uniform noise I want to filter out. Note that I’ve drawn this by hand so the good points might not look properly Gaussian. But in a real instance they will be!
I want to filter out that uniform noise so that I only have a mixture of gaussians left. With my assumption of normality, is there a fairly robust solution?
I have been thinking of using DBSCAN as a cleanup step to remove all of the noise but obviously have that problem of picking parameters.
I currently use GMMs to cluster my data and then some of the uniform noise ends up in its own clusters with massive, crazy covariance matrices that seem to go way outside of the rectangle. But I don’t know a robust way of choosing which clusters are the noisy ones and which are the true gaussians.
It seems I want a measure of density of the detected clusters. Or to relate the number of points with the area of the confidence region, as this ratio will be more exaggerated in the uniform cases.
Are there any papers on similar problems?

Weka ExpectedMaximum clustering result explanation
I currently have a very large
dataset
with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.So the data is set up like this (a search query can have multiple categories):
Search Query  Category
X  Y
X  Z
A  B
C  G
C  H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with Kmeans.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my
dataset
in Weka and used the EM clustering algorithm and it gives me the following results:=== Run information === Scheme: weka.clusterers.EM I 100 N 1 X 10 max 1 llcv 1.0E6 lliter 1.0E6 M 1.0E6 K 10 numslots 1 S 100 Relation: testrunawk1_weka_sample.txt Instances: 100000 Attributes: 2 att1 att2 Test mode: split 66% train, remainder test === Clustering model (full training set) === EM == Number of clusters selected by crossvalidation: 2 Number of iterations performed: 14 [135.000 lines long table with strings, 2 clusters and their values] Time is taken to build a model (percentage split): 28.42 seconds Clustered Instances 0 34000 (100%) Loglikelihood: 20.2942
So should I infer from this that I should be using 2 or 34000 clusters with kmeans?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this
dataset
if that is easier, but it's too large to paste in here. :) 
Agglomerative Vs Decisive Hierarchical clustering
I want to cluster my data set into 4 segments using HC. Knowing the about the logic between both the approaches. what is the best approach to select when you know the number of clusters you want?

Dimensionality Reduction to 2D With Custom Distance
I have N vectors of length M (N,M~500).
I'd like to plot them in 2D where the distances between the points are proportional to dot(A,B) or to log(dot(A,B)+1).

Sparse binary encoding of category variables
I'm building a logistic regression model with several categorical features some of which have very high cardinality (many hundred unique values per feature). By researching a bit it seems that a sparse binary encoding performs well. By "binary" I mean that each feature value is encoded as a number which is then represented by its bitwise equivalent. Thus 6 columns can encode +100 values. I've seen some custom implementation of that encoding. Is there a library that can do this?

Dimensionality reduction for 8 linked variables to cluster data into profiles
I'm working on data that involves sorting entries from journals into 1 of 8 categories. Once a whole journal is categorized, I end up with a distribution of entry percentages allocated to each of the categories. I've then repeated this for another 99 journals.
What I would like to do now is determine if there are distinct profiles across these journals (for example, to find if there are 25 journals that all have a split within a few %'s of [10%,10%,30%,10%,5%,5%,15%,10%]).
I'm not sure how to properly reduce the dimensions on this data though to move on to clustering, at least not without losing information. Any recommendations would be great. Please let me know if further clarification of what I'm attempting is needed.

MDS  how to ensure that all fields contain a value?
we're using MDS to insert some records but sometimes we have users that insert empty values by mistake which results in some issues. Since we want to verify and validate all fields to ensure that none of them are null we have tried to look at the Business Rules that can be created within MDS. However, it seems that these will only validate the data if we apply the business rules when checking the data or via a stored procedure.
Is there any method or business rule in MDS which does not allow the users from inserting blank values and discards any rows that do not meet this criteria?

Errors with NMDS, Vegan, R
I've run numerous iterations of NMDS in Vegan, and have encountered a few hiccups with analyzing my data. Initially, the ordination was flat with most of the data plotted on the left side along the yaxis, and one on the far right side – perhaps suggestive of an absolute outlier. I tested this by running the analysis again by including the statement noshare=TRUE.
Here is my code:
Data_sp04<metaMDS(Data_sp03[,3:46], distance = "bray", strata = Data_sp03$Site, k = 3, noshare=TRUE, trymax = 500) Square root transformation Wisconsin double standardization Using stepacross dissimilarities: Too long or NA distances: 49 out of 153 (32.0%) Stepping across 153 dissimilarities... Connectivity of distance matrix with threshold dissimilarity 1 Data are disconnected: 2 groups Groups sizes 1 2 17 1
and the resultant errors:
Error in cmdscale(dist, k = k) : NA values not allowed in 'd' In addition: Warning messages: 1: In stepacross(dis, trace = trace, toolong = 0, ...) : Disconnected data: Result will contain NAs 2: In metaMDSdist(comm, distance = distance, autotransform = autotransform, : Data are disconnected, results may be meaningless
Perhaps the errors are indicative of an absolute outlier? I’m uncertain. Importantly, is there anything else that I can do to extract a meaningful ordination?

Master Data Services Business Rule Deploy
I am trying to deploy an entity that has a business rule with a notification set to alert a specific group. I am doing it from one environment to another, and both have the same group specified at the MDS portal. Anyone knows how to keep the notification in the destination environment? The rule is there, but the notification just vanished.
Thanks, Fil