Python 3.6 panda merge by more than one key field

I have created two different datasets from my input to get two different measures. Now I will need to merge both inputs by more than a column. I need to add in the function merge in the parameter on the required columns.

My code:

import pandas as pn


df_csv = pn.read_csv('E:\\Sources\\BixiMontrealRentals2017\\OD_2017-06.csv',dtype={"user_id": int},low_memory= False,sep=',')


# data readiness for stations as starting 
df_csv['start_date_dt']= pn.to_datetime(df_csv['start_date'],infer_datetime_format=True)
df_csv['start_day'] = df_csv['start_date_dt'].dt.weekday_name
df_csv['start_hour'] = df_csv['start_date_dt'].dt.hour
df_start = df_csv.drop(df_csv.columns[[0,2,3,4,5,6]],axis=1)
df_start_summ = df_start.groupby(['start_station_code', 'start_day','start_hour']).size().reset_index(name='start_counts')
print(df_start_summ.head())

# data readiness for stations as ending

df_csv['end_date_dt']= pn.to_datetime(df_csv['end_date'],infer_datetime_format=True)
df_csv['end_day'] = df_csv['end_date_dt'].dt.weekday_name
df_csv['end_hour'] = df_csv['end_date_dt'].dt.hour
df_end = df_csv.drop(df_csv.columns[[0,1,2,4,5,6,7,8,9]],axis=1)
df_end_summ = df_end.groupby(['end_station_code', 'end_day','end_hour']).size().reset_index(name='end_counts')
print(df_end_summ.head())

Output for both datasets:

enter image description here

My ideal merge should be applied by station, day, hour. However, the columns in each dataset have different names and I don't know how to point the required join.

df_rowdata = pn.merge(df_start_summ,df_end_summ,
                      left_on= 'start_station_code', 'start_day','start_hour'                
                      ,how='inner')

I would need something like in T-SQL:

left join 
on start_station_code = end_station_code
and start_day = end_day
and start_hour = end_hour

I appreciate your help and comments guys.

1 answer

  • answered 2018-02-13 01:39 sacul

    The syntax you're using for a pandas dataframe merge is not quite right. Also, you're using how='inner', but the SQL join that you want to replicate is a left join, so you might be looking to use how='left' instead.

    Try something like:

    # Reproduce example dfs
    import pandas as pd
    
    df_start_summ = pd.DataFrame({'start_station_code':[5002]*5,
                                 'start_day':['Friday']*5,
                                 'start_hour':[6,8,9,12,14],
                                 'start_counts':[1,1,1,1,2]
                                 })[['start_station_code',
                                     'start_day', 'start_hour',
                                 'start_counts']]
    
    
    df_end_summ = pd.DataFrame({'end_station_code':[5002]*5,
                                 'end_day':['Friday']*5,
                                 'end_hour':[4,8,12,13,15],
                                 'end_counts':[1,1,1,1,1]
                                 })[['end_station_code',
                                     'end_day', 'end_hour',
                                     'end_counts''']]
    
    # inner merge (actually the default, you could omit the 'how='inner'')
    inner = df_start_summ.merge(df_end_summ, 
        left_on=['start_station_code', 'start_day', 'start_hour'], 
        right_on=['end_station_code', 'end_day', 'end_hour'], how = 'inner')
    
    # left merge:
    left = df_start_summ.merge(df_end_summ, 
        left_on=['start_station_code', 'start_day', 'start_hour'], 
        right_on=['end_station_code', 'end_day', 'end_hour'], how = 'left')
    

    This results in:

    >>> inner_merge
       start_station_code start_day  start_hour  start_counts  end_station_code  \
    0                5002    Friday           8             1              5002   
    1                5002    Friday          12             1              5002   
    
      end_day  end_hour  end_counts  
    0  Friday         8           1  
    1  Friday        12           1  
    
    
    >>> left_merge
       start_station_code start_day  start_hour  start_counts  end_station_code  \
    0                5002    Friday           6             1               NaN   
    1                5002    Friday           8             1            5002.0   
    2                5002    Friday           9             1               NaN   
    3                5002    Friday          12             1            5002.0   
    4                5002    Friday          14             2               NaN   
    
      end_day  end_hour  end_counts  
    0     NaN       NaN         NaN  
    1  Friday       8.0         1.0  
    2     NaN       NaN         NaN  
    3  Friday      12.0         1.0  
    4     NaN       NaN         NaN  
    

    Also take a look at the pandas documentation for merges.