Python Data manipulation: Duplicate and Average row and column values using dates

Hi I have a dataset in the following format:

enter image description here

Code for replicating the data:

import pandas as pd
d1 = {'Year': 
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
  'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)

I input the numbers as a string to show blank cells

Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).

Where the column value refers to expenditure on FMCG products. What I would like to do is the following:

Part 1 (Beginning and end points)

a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008 is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).

b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).

Please refer to the following diagram for the highlighted cell with the solutions.

enter image description here

Part II - (Rows in between)

Next I would like to do two things (I would need to do the following three steps separately, not all at one time):

For individuals like Subject_A, locate two observations that come one after the other (30 and 35).

i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.

for eg:

enter image description here

ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:

enter image description here

The final dataset will use (a), (b) & (i) or (a), (b) & (ii)

Final data I [using a,b and i]

enter image description here

Final data II [using a,b and ii]

enter image description here

I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.

Follow up question and Issues

Thanks @Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?

import pandas as pd
d1 = {'Year': 
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
  'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)

Issues For the current code, I found an issue for part II (ii). This is the output that I get:

enter image description here

This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?

1 answer

  • answered 2018-03-20 21:11 Juan Javier Santos Ochoa

    Here a code solution for subject A. Should work with the other subjects:

    d1 = {'Year': 
    ['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
      'Month':['1','1','2','6','7','8','8','11','12','12'],
    'Day':['6','22','6','18','3','10','14','6','16','24'],
    'Subject_A':['','30','','45','','35','','','',''],
    'Subject_B':['','','','','','','','40','',''],
    'Subject_C': ['','','','','','65','','50','','']}
    d1 = pd.DataFrame(d1)
    d1 = pd.DataFrame(d1)
    ## Create a variable named date
    d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
    # convert to float, to calculate mean
    d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
    # index of the not null rows
    subja = d1['Subject_A'].notnull()
    
    ### max and min index row with notnull value
    max_id_subja = d1.loc[subja,'date'].idxmax()
    min_id_subja = d1.loc[subja,'date'].idxmin()
    ### max and min date for Sub A with notnull value
    max_date_subja = d1.loc[subja,'date'].max()
    min_date_subja = d1.loc[subja,'date'].min()
    ### value for max and min date
    max_val_subja = d1.loc[max_id_subja,'Subject_A']
    min_val_subja = d1.loc[min_id_subja,'Subject_A']
    #### Cutoffs
    min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
    max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
    
    ## PART I.a
    d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
    ## PART I.b
    d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
    ## PART II
    d1_2i = d1.copy()
    d1_2ii = d1.copy()
    
    lower_date = min_date_subja
    lower_val = min_val_subja.copy()
    next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
    for N in next_dates_index:
        next_date = d1_2i.loc[N,'date']
        next_val =  d1_2i.loc[N,'Subject_A']
        #PART II.i
        d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
        #PART II.ii
        mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
        d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
        d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
        lower_date = next_date
        lower_val = next_val
    print(d1_2i)
    print(d1_2ii)