Correllation (pandas) between date and integer? timeseries
Suppose my data is in the form of
date  price
20170909  13000
20170908  20000
20170907  15000
20170906  13000
20170905  15000
How do I find the correlation between price and time? df.corr() ignores the date column.
1 answer

Change date time format to numeric , then you can use
corr
df.date=pd.to_datetime(df.date) df.date=pd.to_numeric(df.date) df.corr() Out[306]: date price date 1.000000 0.165647 price 0.165647 1.000000
See also questions close to this topic

how to solve 'range() integer end argument expected, got float'?
I am getting the following error in the cnn_utils.py file when doing Coursera CNN example. I did everything I needed to convert the float to the int, but did not get the result. Please help me
TypeError Traceback (most recent call last) <ipythoninput22ddfc1f084c11> in <module>() > 1 _, _, parameters = model(X_train, Y_train, X_test, Y_test) <ipythoninput12a206d32574f3> in model(X_train, Y_train, X_test, Y_test, learning_rate, num_epochs, minibatch_size, print_cost) 36 num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set 37 seed = seed + 1 > 38 minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed) 39 40 for minibatch in minibatches: /root/sharedfolder/cnn_utils.py in random_mini_batches(X, Y, mini_batch_size, seed) 49 # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case. 50 num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning > 51 for k in range(0, num_complete_minibatches): 52 mini_batch_X = shuffled_X[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:,:,:] 53 mini_batch_Y = shuffled_Y[k * mini_batch_size : k * mini_batch_size + mini_batch_size,:] TypeError: range() integer end argument expected, got float.

Python how to loop multiple variables and different steps parallel?
How to loop multiple variables and different steps parallel?
Like, in c++,
for(int i=0, j=n1; i<n && j>=0; i++, j)
. 
How to split a big dataset into train, validation and testing sets
I have a dataset with 30 classes, each class have different idx. I want to split this dataset into 70, 20, and 10%, train, validation and test sets respectively in python. can you please suggest me an idea how to write a code. I am new to coading.

convert alphanumeric value to date
I get
a1523245800
value in the date field from my incoming data feed. I wish to know, how to convert this value into the date dtype? I have triedpandas.to_datetime
but that does not seem to work. thankyou.here is my code
pd.to_datetime([`a1523245800`], errors='coerce')
and here is the output of the above:
DatetimeIndex(['NaT'], dtype='datetime64[ns]', freq=None)

pandas dataframe: loc vs query performance
I have 2 dataframes in python that I would like to query for data.
DF1: 4M records x 3 columns. The query function seams more efficient than the loc function.
DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
Both queries return a single record. The simulation was done by running the same operation in a loop 10K times.
Running python 2.7 and pandas 0.16.0
Any recommendations to improve the query speed?

Pandas  rolling codes with dates
I have a Pandas DataFrame which contains an ID, Code and Date. For certain codes I would like to fill subsequent appearances of the ID, based on the date, with a determined set of missing codes. I would also like to know the first appearance of the code against the respective ID.
Example as follows, NB: missing codes are A and B (only codes A and B carry over):
import pandas as pd d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['20170322', '20170321', '20170323', '20170324', '20170328', '20170328'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C']} df = pd.DataFrame(data=d) # only A and B codes carry over df
The target dataframe would ideally look as follows:
import pandas as pd d = {'ID': [1, 2, 1, 2, 3, 1], 'date': ['20170322', '20170321', '20170324', '20170322', '20170328', '20170328'], 'Code': ['A, C', 'A', 'B, C', 'E, D', 'A', 'C'], 'Missing_code': ['', '', 'A', 'A', '', 'A, B'], 'First_code_date': ['', '', '20170322', '20170321', '', '20170323, 20170324']} df = pd.DataFrame(data=d) df
Note I am not fussy on how the 'First_code_date' looks providing it is dynamic as the code length may increase or decrease.
If the example is not clear please let me know and I will adjust.
Thank you for help.

How can I compute the ChiSquared test of independence between tow variables X and Y while conditioning on a third variable Z in R?
I wish to compute the ChiSquared Test of Independence between X & Y given an observation variable Z. X,Y, and Z are all binary random variables. How can I achieve this in R ?
Thank you.

Unbiased variance estimate (n1) simulation in python fails
I was trying to write a little program that simulates sampling from random numbers in Python3. But it seems to show the opposite of what I intended. What am I doing wrong? It must be extremely easy, but I don't get it.
import random import statistics import math pcounter = 0 counter = 0 for loop in range(1000): l = [] for x in range(500): l.append(random.randint(1,1000)) m = statistics.mean(l) v = list(l) v[:] = [(xm)**2 for x in v] realvariance = sum(v)/len(v) #print("Real Variance: " + str( sum(v)/len(v))) #print("Real Mean: " + str(m)) sample = random.sample(l, 10) v = list(sample) #print(v) v[:] = [(xm)**2 for x in v] samplem = statistics.mean(sample) samplebiasedvariance = sum(v)/len(v) samplevariance = sum(v)/(len(v)1) print(samplebiasedvariance) print(samplevariance) print(realvariance) print((samplebiasedvariance  realvariance)**2 < (samplevariance  realvariance)**2) if (samplebiasedvariance  realvariance)**2 < (samplevariance  realvariance)**2: pcounter = pcounter + 1 print("biased Variance wins: " + str(pcounter)) else: counter = counter + 1 print("Variance wins: " + str(counter)) print("biased Variance wins: " + str(pcounter)) print("Variance wins: " + str(counter))
This results in:
biased Variance wins: 563 Variance wins: 437
But it should be the other way around: I would expect the biased Variance to be worse then the unbiased Variance that is calculated using (n1). Therefore it should be more often closer to the true population Variance (realvariance) then the biased one.

count number of unique elements in each columns with dplyr in sparklyr
I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<data.frame(cbind(seq(1,10,1),rep(1,10))) d$group<rep(c("a","b"),each=5) d%>%group_by(group)%>%summarise_each(funs(length(unique(.)))) A tibble: 2 × 3 group X1 X2 <chr> <int> <int> 1 a 5 1 2 b 5 1 k<collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.))))) Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```

Pandas: datetime conversion from dtype object
I am working on a timeseries dataset which looks like this:
DateTime SomeVariable 0 01/01 01:00:00 0.24244 1 01/01 02:00:00 0.84141 2 01/01 03:00:00 0.14144 3 01/01 04:00:00 0.74443 4 01/01 05:00:00 0.99999
The date is without year. Initially, the dtype of the DateTime is object and I am trying to change it to pandas datetime format. Since the date in my data is without year, on using:
df['DateTime'] = pd.to_datetime(df.DateTime)
I am getting the error
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 10101 01:00:00
I understand why I am getting the error (as it's not according to the pandas acceptable format), but what I want to know is how I can change the dtype from object to pandas datetime format without having year in my date. I would appreciate the hints.

Adjusting Scale for time series using facet wrap
I have 15 minute interval water temperature data from 27 environmental stations within a Bay. I have calculated and plotted the daily average temperature for the whole Bay. I used facet_wrap to plot them by year: 2015, 2016 and 2017. But there are outlier points after December for both 2015 and 2016 (see below). Is there a way to only show data from January until December, and cut out those December outliers? .
My data
> head(tempdat) station sampletime_15min temp_c X station1 type lat lon depth_ft active..starting.20170524. Notes ID2 Easting Northing 1 d01 20160126 13:15:00 27.605 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 2 d01 20160126 13:30:00 27.487 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 3 d01 20160126 13:45:00 27.479 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 4 d01 20160126 14:00:00 27.471 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 5 d01 20160126 14:15:00 27.445 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 6 d01 20160126 14:30:00 27.462 1 d01 O2 18.3355 64.98416 45 no 1 290318.6 2028447 Area Date 1 Airport runway 20160126 2 Airport runway 20160126 3 Airport runway 20160126 4 Airport runway 20160126 5 Airport runway 20160126 6 Airport runway 20160126
My code
tempdat_pre<readRDS("Data/mnb_temp_all_20171117_QC.rds") #tempdat_pre<tempdat_prep station<read.csv("Data/Enviro_stations_area.csv") tempdat_p1<merge(tempdat_pre, station) tempdat_p1$Date<(as.Date(tempdat_p1$sampletime_15min)) tempdat_p1 = tempdat_p1[!tempdat_p1$Date<"20150101",] #Calculate averages tempdat<tempdat_p1 daily<tempdat%>% group_by(Date) %>% dplyr::summarise(avgtemp=mean(temp_c)) daily daily$Date<as.POSIXct(daily$Date, format="%Y%m%d", tz="UTC") daily<as.data.frame(daily) daily$Date<as.POSIXct(daily$Date, format="%Y%m%d", tz="UTC") daily$Year<format(as.Date(daily$Date), "%Y") daily = daily[!daily$Date<"20150101",] # to remove 2014 dates in case. # By grouped stations daily_2<tempdat%>% group_by(Area, Date) %>% dplyr::summarise(avgtemp=mean(temp_c)) daily_2 daily_2$Date<as.POSIXct(daily_2$Date, format="%Y%m%d", tz="UTC") daily_2<as.data.frame(daily_2) daily_2$Date<as.POSIXct(daily_2$Date, format="%Y%m%d", tz="UTC") daily_2$Year<format(as.Date(daily_2$Date), "%Y") #daily_2 = daily_2[!daily_2$Date<"20150101",] # to remove 2014 dates in case. #PLOTS dtplot<ggplot(daily,aes(x = Date,y = avgtemp)) + #geom_point() + geom_line(colour="red", size=1.5)+ #scale_colour_gradient2(low = "blue", mid = "green" , high = "red", midpoint = 16) + #geom_smooth(color = "red",size = 1) + scale_x_datetime(name="Date", date_breaks="1 months", labels=date_format(format="%b"))+ #scale_y_continuous(limits = c(5,30), breaks = seq(5,30,5)) + #theme(axis.text.x = element_text(angle = 90, hjust = 1))+ ggtitle ("Daily average water temperature in Brewer's Bay, St. Thomas, USVI") + xlab("Date") + ylab ("Average Temperature ( ºC )")+ facet_wrap(~Year, scales='free_x', ncol=1, nrow=3)+ theme(axis.text=element_text(size=10), axis.title=element_text(size=12,face="bold")) dtplot

Dendrograms with SciPy
I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z Date/Time 1 0 0 0,35 ... 1 Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a timeseries dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this: Example Dendrogram http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2016/AY/c6ay01061j/c6ay01061jf4_hires.gif
I tried to construct the linkage matrix with
Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?