Conditional filtering of pandas data frame
I have a pandas data frame about football results. Each row of the dataframe represents a football match. The information of each match are:
Day  WinningTeamID  LosingTeamID  WinningPoints  LosingPoints  WinningFouls  ... 
1 13 1 45 5 3
1 12 4 21 12 4
That is, the information are divided based on the game result: winning or losing. I would like to retrieve the data of each game for a specific team (e.g. 12).
Day  Points  Fouls  ... 
1 21 4 ...
2 32 6 ...
The simplest way is to scan the whole dataframe, check if a specific teamID is on WinningID or LosingID and then, based on that, retrieve the "Losingcolumns" or the "Winningcolumns". Is there a more "elegant" way of slicing the pandas dataframe? This will simply give me the subset of matches where the team 12 is involved.
df[df[WinningTeamID == 12]  [LosingTeamID == 12]]
How can I filter those data and create the desired dataframe?
2 answers

Suppose we could choose the format of the data. What would be ideal? Since we want to collect stats per
TeamID
, ideally we would have a column ofTeamID
s and separate columns for each stat including the outcome.So the data would look like this:
 Day  Outcome  TeamID  Points  Fouls   1  Winning  13  45  3   1  Losing  1  5  NaN   1  Winning  12  21  4   1  Losing  4  12  NaN 
Here is how we can manipulate the given data into the desired form:
import numpy as np import pandas as pd df = pd.DataFrame({'Day': [1, 1], 'LosingPoints': [5, 12], 'LosingTeamID': [1, 4], 'WinningFouls': [3, 4], 'WinningPoints': [45, 21], 'WinningTeamID': [13, 12]}) df = df.set_index(['Day']) columns = df.columns.to_series().str.extract(r'^(LosingWinning)?(.*)', expand=True) columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], names=['Outcome', None]) df.columns = columns df = df.stack(level='Outcome').reset_index() print(df)
yields
Day Outcome Fouls Points TeamID 0 1 Losing NaN 5 1 1 1 Winning 3.0 45 13 2 1 Losing NaN 12 4 3 1 Winning 4.0 21 12
Now we can obtain all the stats about
TeamID
12 usingprint(df.loc[df['TeamID']==12]) # Day Outcome Fouls Points TeamID # 3 1 Winning 4.0 21 12
df = df.set_index(['Day'])
moves theDay
column into the index.The purpose of placing
Day
in the index is to "protect" it from manipulations (primarily thestack
call) that are intended only for columns labeledLosing
orWinning
. If you had other columns, such asLocation
orOfficials
which, likeDay
, do not pertain toLosing
orWinning
, then you'd want to include them in theset_index
call too: e.g.df = df.set_index(['Day', 'Location', 'Officials'])
.Try commenting out
df = df.set_index(['Day'])
from the code above. Then step through the code linebyline. In particular, compare whatdf.stack(level='Outcome')
looks like with and without theset_index
call:With
df = df.set_index(['Day'])
:In [26]: df.stack(level='Outcome') Out[26]: Fouls Points TeamID Day Outcome 1 Losing NaN 5 1 Winning 3.0 45 13 Losing NaN 12 4 Winning 4.0 21 12
Without
df = df.set_index(['Day'])
:In [29]: df.stack(level='Outcome') Out[29]: Day Fouls Points TeamID Outcome 0 NaN 1.0 3.0 45 13 Losing NaN NaN 5 1 Winning 1.0 3.0 45 13 1 NaN 1.0 4.0 21 12 Losing NaN NaN 12 4 Winning 1.0 4.0 21 12
Without the
set_index
call you end up with rows that you do not want  the rows whereOutcome
equalsNaN
.
The purpose of
columns = df.columns.to_series().str.extract(r'^(LosingWinning)?(.*)', expand=True) columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], names=['Outcome', None])
is to create a multilevel column index (called a MultiIndex) which labels columns
Losing
orWinning
as appropriate. Notice that by separating out theLosing
orWinning
parts of the labels, the remaining parts of the labels become duplicated.We end up with a DataFrame,
df
, with two columns labeled "Points" for example. This allows Pandas to identify these columns as somehow similar.The big gain  the reason why we went through the trouble of setting up the MultiIndex is so that these "similar" columns can be "unified" by calling
df.stack
:In [47]: df Out[47]: Outcome Losing Winning Points TeamID Fouls Points TeamID Day 1 5 1 3 45 13 1 12 4 4 21 12 In [48]: df.stack(level="Outcome") Out[48]: Fouls Points TeamID Day Outcome 1 Losing NaN 5 1 Winning 3.0 45 13 Losing NaN 12 4 Winning 4.0 21 12
stack
,unstack
,set_index
andreset_index
are the 4 fundamental DataFrame reshaping operations.df.stack
moves a level (or levels) of the column MultiIndex into the row index.df.unstack
moves a level (or levels) of the row index into the column index.df.set_index
moves column values into the row indexdf.reset_index
moves a level (or levels) of the row index into a column of values
Together, these 4 methods allow you to move data in your DataFrame anywhere you want  in the columns, the row index or the column index.
The above code is an example of how to use these tools (well, two of the four) to reshape data into a desired form.

df.query['WinningTeamID == 12  LosingTeamID == 12']