Using of DBSCAN model in ELKI: define the cluster of new objects
I'm using the DBSCAN algorithm in ELKI, and once I created my model on the training data, how I can perform the prediction of a cluster, based on this model Clustering<Model> c
, for new observation?
My initial data is double[][] array, while every new observation is double[] array. I used the example given in this answer and it works perfectly but how I can reuse the Clustering<Model> c
on new test data? I have to do my method for prediction myself or there is existing method like "predict" or something else?
See also questions close to this topic

How to create a Spring web app on a server
Firstly, please excuse my horrendously general question, as my understanding of Spring is very limited, but I will expalin what I want to achieve, and hopefully someone can point me in the right direction.
I have an application that retrieves some information from some source and updates a database. I'd like to put this program on a Tomcat server, so that the application is run every day.
I'm very new to Spring, and have spent the last few days completing some basic tutorials to display Hello World! in a broswer.
However, all of the tutorials I have found relate to Controllers for URLs, which, as far as I understand, is not what I want, as my application will not have a URL and there will be nothing to display, I just want the application to "hidden" somewhere on the server, and to execute daily.
I know this is a very general question, and as I said my knowledge of Spring is next to nonexistent, so I'd appreciate it if someone could point me in the right direction, I'll happily do research if I just knew what to look for.
Thanks in advance!

Difference between List.stream().forEach() and list.forEach()
Is there any difference between
List.stream().forEach()
andList.forEach()
like time complicity , looping technique ?? ex:List<A> myList=new LinkedList<>();
then
myList.forEach(a>{//Code});
and
myList.stream().forEach(a>{//Code});

Wrong 2nd argument type. Found: 'java.lang.Integer', required: '? extends java.lang.Number'
This drives me mad.
java.lang.Integer
definitely extendsjava.lang.Number
. So why IntelliJ IDEA fires Wrong type argument?!Function<Map<Timestamp, ? extends Number>, Map<Timestamp, ? extends Number>> calculateVATFromTotalFunction = result > { Map<Timestamp, ? extends Number> map = new TreeMap<>(); try { for (Map.Entry<Timestamp, ? extends Number> r : result.entrySet()) { map.put(r.getKey(), new Integer(r.getValue().intValue()/6)); } } catch (Exception e) { e.printStackTrace(); } return new TreeMap<>(map); };

Observation weights with Mclust and mclustBIC functions
I'm working with complex survey data that contains observation weights. Is there a way to include weights with the
Mclust
andmclustBIC
functions? I have come acrossme.weighted
in themclust package
but this doesn't appear to offer the same functionality as the aforementioned functions.Any guidance would be much appreciated.

How to exclude attributes from clustering in PHPML?
I have students data and I want to cluster them according to there attributes. The problem is that the student_id shouldn't be used in the clustering process, because it has nothing to do with the clustering, and I cannot just remove the student_id, because I won't be able then to know what is the according cluster to each student. My Array has the following structure:
Student_id  movies  chess  football  ....  19324857 1 0 1 ...
Code
$studentsInfo = [[1,1,0,0,1,1], [1,1,1,1,0,0], [0,1,1,0,0,1], ....]; $kmeans = new KMeans(6); $kmeans>cluster(studentsInfo);
There's a solution to search after the clustering process for each student's parameters, and then find his cluster, but it's not practical and time consuming, and I'm working with a lot of entries.

Deep clustering vs. traditional clustering methods
I've been wondering how modern unsupervised Deep Clustering methods such as Deep embedded Clustering (DEC), Clustering CNNs or other recent methods compare to traditional unsupervised clustering algorithms such as Hierarchichal Clustering, Kmeans, Gaussian Mixtures, SOMs etc..
What has also been done a lot in the past, was performing feature extraction first using autoencoders and then running conventional clustering algorithms on the extracted features. The new deep clustering approaches seem to combine the two steps of feature extraction and clustering.
How do they compare in terms of computation complexity and performance? Is there a publicacion that investigates this?

ggplot2 geom_ribbon from mgcv::gamm
I'm trying to add a ribbon based on predictions from a gamm model, this seems a little harder than intended, as gamm is somewhat different from gam.
I first tried directly with geom_stat, but that will not work (and will not use my entire model, which also includes several other covariates)
library(tidyverse); library(mgcv) dt = cbind(V1=scale(sample(1000)), Age=rnorm(n = 1000, mean = 40, sd = 10), ID=rep(seq(1:500),each=2) %>% as.data.frame() # Works fine  dt %>% ggplot(aes(x=Age, y=V1)) + stat_smooth(method="gam", formula= y~s(x,bs="cr")) # Fails horribly :P dt %>% ggplot(aes(x=Age, y=V1)) + stat_smooth(method="gamm", formula= y~s(x,bs="cr")) Maximum number of PQL iterations: 20 iteration 1 Warning message: Computation failed in `stat_smooth()`: no applicable method for 'predict' applied to an object of class "c('gamm', 'list')"
I've tried using the predict function on the model$gamm, but I'm not sure how to use this, and how to make the CI ribbon
dt.model = gamm(V1 ~ s(Age, bs="cr") + s(ID, bs = 're'), data=dt, family="gaussian", discrete=T) dt$pred = predict(dt.model$gam) dt %>% ggplot(aes(x = Age, y = V1)) + geom_line(aes(group=ID), alpha=.3) + geom_point(alpha=.2) + geom_smooth(aes(y=pred))
I recognise this is shitty example data because this is a stupid shape. But I'd like to be able to add a ribbon with the CI along the line as predicted by the model.fit. And I'd prefer to do this in ggplot, particularly as I want a spagetti plot in the background.

Tensorflow Extracting Classification Predictions
I've a tensorflow NN model for classification of onehotencoded group labels (groups are exclusive), which ends with (
layerActivs[1]
are the activations of the final layer):probs = sess.run(tf.nn.softmax(layerActivs[1]),...) classes = sess.run(tf.round(probs)) preds = sess.run(tf.argmax(classes))
The
tf.round
is included to force any low probabilities to 0. If all probabilities are below 50% for an observation, this means that no class will be predicted. I.e., if there are 4 classes, we could haveprobs[0,:] = [0.2,0,0,0.4]
, soclasses[0,:] = [0,0,0,0]
;preds[0] = 0
follows.Obviously this is ambiguous, as it is the same result that would occur if we had
probs[1,:]=[.9,0,.1,0]
>classes[1,:] = [1,0,0,0]
> 1preds[1] = 0
. This is a problem when using the tensorflow builtin metrics class, as the functions can't distinguish between no prediction, and prediction in class 0. This is demonstrated by this code:import numpy as np import tensorflow as tf import pandas as pd ''' prepare ''' classes = 6 n = 100 # simulate data np.random.seed(42) simY = np.random.randint(0,classes,n) # pretend actual data simYhat = np.random.randint(0,classes,n) # pretend pred data truth = np.sum(simY == simYhat)/n tabulate = pd.Series(simY).value_counts() # create placeholders lab = tf.placeholder(shape=simY.shape, dtype=tf.int32) prd = tf.placeholder(shape=simY.shape, dtype=tf.int32) AM_lab = tf.placeholder(shape=simY.shape,dtype=tf.int32) AM_prd = tf.placeholder(shape=simY.shape,dtype=tf.int32) # create onehot encoding objects simYOH = tf.one_hot(lab,classes) # create accuracy objects acc = tf.metrics.accuracy(lab,prd) # real accuracy with tf.metrics accOHAM = tf.metrics.accuracy(AM_lab,AM_prd) # OHE argmaxed to labels  expected to be correct # now setup to pretend we ran a model & generated OHE predictions all unclassed z = np.zeros(shape=(n,classes),dtype=float) testPred = tf.constant(z) ''' run it all ''' # setup sess = tf.Session() sess.run([tf.global_variables_initializer(),tf.local_variables_initializer()]) # real accuracy with tf.metrics ACC = sess.run(acc,feed_dict = {lab:simY,prd:simYhat}) # OHE argmaxed to labels  expected to be correct, but is it? l,p = sess.run([simYOH,testPred],feed_dict={lab:simY}) p = np.argmax(p,axis=1) ACCOHAM = sess.run(accOHAM,feed_dict={AM_lab:simY,AM_prd:p}) sess.close() ''' print stuff ''' print('Accuracy') print('known truth: %0.4f'%truth) print('on unprocessed data: %0.4f'%ACC[1]) print('on faked unclassed labels data (s.b. 0%%): %0.4f'%ACCOHAM[1]) print('\nTrue Class Freqs:\n%r'%(tabulate.sort_index()/n))
which has the output:
Accuracy known truth: 0.1500 on unprocessed data: 0.1500 on faked unclassed labels data (s.b. 0%): 0.1100  True Class Freqs: 0 0.11 1 0.19 2 0.11 3 0.25 4 0.17 5 0.17 dtype: float64 Note freq for class 0 is same as faked accuracy...
I experimented with setting a value of
preds
tonp.nan
for observations with no predictions, buttf.metrics.accuracy
throwsValueError: cannot convert float NaN to integer
; also triednp.inf
but gotOverflowError: cannot convert float infinity to integer
.How can I convert the rounded probabilities to class predictions, but appropriately handle unpredicted observations?

Statistics: Prediction of multiple outputs vs. one input
I have a problem and would like to hear ur expert opinion. Let's say you have a one input value (call it TotalValue) (let's say with a range of 010), and there are 5 output values (call it Week1, Week2, Week3, Week4, Week5).
Now, for example, if the input value is 3, they will be distributed over the output values.
So one possible data point would be: TotalValue: 3 > Week1: 0, Week2: 1, Week3: 0, Week4: 1, Week5: 1
Another possible data point would be: TotalValue: 4 > Week1: 2, Week2: 0, Week3: 2, Week4: 0, Week5: 0
With the condition that the sum of the output values is always equal to the Input value.
Now given that sufficient data is available, what would be the best way to go about this problem? I'm guessing multi output regression will not fair well, since we have only one input value.
Thanks all for the help.

ELKI Kmeans clustering Task failed error for high dimensional data
I have a 60000 documents which i processed in
gensim
and got a 60000*300 matrix. I exported this as acsv
file. When i import this inELKI
environment and runKmeans
clustering, i am getting below error.Task failed de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126) at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81) at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105) at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112) at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61) at [...]

Parallel DBSCAN in ELKI
Here I can see that there exists class
clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN
, but when I tried to invoke it, I've got error:java cp elki.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN algorithm.distancefunction EuclideanDistanceFunction dbc.in infile.txt dbscan.epsilon 1.0 dbscan.minpts 1 verbose out OUTFOLDER Class 'clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN' not found for given value. Must be a subclass / implementation of de.lmu.ifi.dbs.elki.algorithm.Algorithm
And this class is indeed absent in the list of available classes which was printed out with error message:
> clustering.CanopyPreClustering > clustering.DBSCAN > clustering.affinitypropagation.AffinityPropagationClusteringAlgorithm > clustering.em.EM > clustering.gdbscan.GeneralizedDBSCAN > clustering.gdbscan.LSDBC > clustering.GriDBSCAN > clustering.hierarchical.extraction.HDBSCANHierarchyExtraction > clustering.hierarchical.extraction.SimplifiedHierarchyExtraction > clustering.hierarchical.extraction.ExtractFlatClusteringFromHierarchy > clustering.hierarchical.SLINK > clustering.hierarchical.AnderbergHierarchicalClustering > clustering.hierarchical.AGNES > clustering.hierarchical.CLINK > clustering.hierarchical.SLINKHDBSCANLinearMemory > clustering.hierarchical.HDBSCANLinearMemory > clustering.kmeans.KMeansSort > clustering.kmeans.KMeansCompare > clustering.kmeans.KMeansHamerly > clustering.kmeans.KMeansElkan > clustering.kmeans.KMeansLloyd > clustering.kmeans.parallel.ParallelLloydKMeans > clustering.kmeans.KMeansMacQueen > clustering.kmeans.KMediansLloyd > clustering.kmeans.KMedoidsPAM > clustering.kmeans.KMedoidsEM > clustering.kmeans.CLARA > clustering.kmeans.BestOfMultipleKMeans > clustering.kmeans.KMeansBisecting > clustering.kmeans.KMeansBatchedLloyd > clustering.kmeans.KMeansHybridLloydMacQueen > clustering.kmeans.SingleAssignmentKMeans > clustering.kmeans.XMeans > clustering.NaiveMeanShiftClustering > clustering.optics.DeLiClu > clustering.optics.OPTICSXi > clustering.optics.OPTICSHeap > clustering.optics.OPTICSList > clustering.optics.FastOPTICS > clustering.SNNClustering > clustering.biclustering.ChengAndChurch > clustering.correlation.CASH > clustering.correlation.COPAC > clustering.correlation.ERiC > clustering.correlation.FourC > clustering.correlation.HiCO > clustering.correlation.LMCLUS > clustering.correlation.ORCLUS > clustering.onedimensional.KNNKernelDensityMinimaClustering > clustering.subspace.CLIQUE > clustering.subspace.DiSH > clustering.subspace.DOC > clustering.subspace.HiSC > clustering.subspace.P3C > clustering.subspace.PreDeCon > clustering.subspace.PROCLUS > clustering.subspace.SUBCLU > clustering.meta.ExternalClustering > clustering.trivial.ByLabelClustering > clustering.trivial.ByLabelHierarchicalClustering > clustering.trivial.ByModelClustering > clustering.trivial.TrivialAllInOne > clustering.trivial.TrivialAllNoise > clustering.trivial.ByLabelOrAllInOneClustering > clustering.uncertain.FDBSCAN > clustering.uncertain.CKMeans > clustering.uncertain.UKMeans > clustering.uncertain.RepresentativeUncertainClustering > clustering.uncertain.CenterOfMassMetaClustering
I thought that perhaps this method is internal and is invoked by
clustering.gdbscan.GeneralizedDBSCAN
, but it works single core for me. Maybe I need to add some command line parameter to enable multiprocessing? 
ELKI: clustering object with Gaussian uncertainty
I am very new to java and using ELKI. I have three dimensional objects have information about their uncertainty ( a multivariate gaussian). I would like to use FDBSCAN to cluster my data. I am wondering if it is possible to do this in ELKI using the
UncertainiObject
class. However, I am not sure how to do this.Any help or pointers to examples will be very useful.