Writing better Pandas code
My tips/experience in writing better pandas code that have been found useful from a readability, operational stand point.
Problem Intro
Problem Set One
There is a significant number of data scientist / analyst who write their prototyping code in pandas, and is required to ship to production encounters one or more of the following problems:
- Re-assignment multiple times, hard code referencing.
- Hard to understand pandas code and decipher the author’s intention.
- Especially hard to make changes (partly due to the absence of testing).
For example,
1
2
3
4
5
6
7
8
9
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4]})
df["B"] = df.A * 2
def divide_by_ten(input_series):
return input_series / 10
df["C"] = divide_by_ten(df.B)
Problem Set Two
It also happens that there is (quite) a number of data professionals who loves the R’s Dplyr or Spark’s Dataframe API
In my opinion, this is largely because of “piping”. In R you could do something like this:
1
2
3
4
5
6
7
df %>%
mutate(...) %>%
group_by(...) %>%
summarize(...) %>%
left_join(...) %>%
filter(...) %>%
...
Or if you are using PySpark, you would do something like this:
1
2
3
4
5
6
7
(df
.WithColumn(new_col = sf.col(...))
.groupBy(...)
.agg(...)
.join(..., how='left')
.where(...)
)
Hopefully, the problem is clear to you and the content here will help you in writing faster and better code!
Setup/Pre-req
Python Environment with the following libraries:
1
2
pandas==1.2.4
numpy==1.20.2
If you are familar with docker or have went through my docker post, this is my dockerfile:
1
2
3
4
FROM continuumio/miniconda3:4.8.2
WORKDIR $HOME/src
COPY requirements.txt $HOME/src
RUN pip install -r requirements.txt
And these are the imports you will need:
1
2
3
import pandas as pd
import numpy as np
import random
Creating DataFrames
First, some small revision on creating dataframes.
DF from dict
The most common way to create a dataframe is by a dictionary, where key is the column name, and the value is a list containing the column of the data frame.
1
2
3
4
5
6
7
8
9
10
11
12
13
d = {"category": ["a", "b", "c", "d", "e"], "value": [1, 2, 3, 4, 5]}
df = pd.DataFrame(data=d)
# this is also the equivalent
df = pd.DataFrame.from_dict(data=d)
"""
category value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
"""
DF row wise
If you happen to want each key to be a row in the dataframe:
1
2
3
4
5
6
7
8
df_rows = pd.DataFrame.from_dict(data=d, orient="index")
df_rows
"""
0 1 2 3 4
category a b c d e
value 1 2 3 4 5
"""
DF from records
If you are working with the cloud, chances are you are dealing with generators that sends you each data point.
1
2
3
4
5
6
7
records = [
dict(category="a", value=1),
dict(category="b", value=2),
dict(category="c", value=3),
]
df = pd.DataFrame.from_records(data=records)
Using Loc
Selecting with Loc
In most pandas code, this is how selection is usually done:
1
2
3
4
5
6
7
8
9
# This return a series
df.category
df["category"]
# This returns a dataframe
df[["category"]]
# This returns step wise
df[0:5:2]
Instead of specifying it this way, you can make use of the loc
function. Let’s use the rows-indexed dataframe as an example:
1
2
df_rows.loc["category"] # this returns a series
df_rows.loc[["category"]] # this returns a data frame
You can also use iloc
, which uses indexing
instead:
1
2
df_rows.iloc[0, 0:4] # still returns a series
df_rows.iloc[[0], 2:5] # double square brackets returns dataframe
This bring forth interesting ways to make use of loc
. Suppose you want the category row, with values greater than 2. i.e
1
2
3
4
5
df_rows.loc[[True, False], [False, False, True, True, True]]
"""
2 3 4
category c d e
"""
The following code are all equivalent:
1
2
3
4
5
6
7
8
9
10
df_rows.loc[[True, False], [False, False, True, True, True]]
df_rows.loc[["category"]].iloc[0:1, 2:5]
df_rows.loc[["category"]].loc[:, [False, False, True, True, True]]
(df_rows
.loc[["category"]]
.loc[:, lambda df: df.loc["value"] > 2]
)
Subsetting with loc
Typical Pandas code when it comes to filtering by some column values would look like this:
1
df[df["value"]>3]
By making use of loc
:
1
df.loc[lambda df: df["value"] > 3]
This, in general allows yourself (or future contributors) to read your code better, afterall, explicit is better than implicit.
There are many other reasons why using loc
is better, more information can be found in references.
Chaining w Groupby
As a data scientist, you normally need to calculate aggregate statistics, and if you are not using pandas on a regular basis, usually the indexes will catch you off guard.
In order to illustrate, we create a bigger dataframe for the following examples:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
random.seed(10)
num_values = 1000
category_one = random.choices(["a", "b", "c", "d", "e"], k=num_values)
category_two = random.choices(["i", "j", "k"], k=num_values)
values_one = random.sample(range(0, 1000), num_values)
values_two = random.sample(range(2000, 3000), num_values)
# Another way of defining it using list of lists
# Instead of using dict of arrays
df = pd.DataFrame(
zip(category_one, category_two, values_one, values_two),
columns=["c1", "c2", "v1", "v2"],
)
df.columns
"""
Index(['c1', 'c2', 'v1', 'v2'], dtype='object')
"""
This dataframe contains two categorical columns, and two numerical columns. Typical group-by transform code will look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
df_agg = df.groupby(["c1", "c2"]).agg(["count"])
df_agg.index
"""
MultiIndex([('a', 'i'),
('a', 'j'),
('a', 'k'),
('b', 'i'),
('b', 'j'),
('b', 'k'),
('c', 'i'),
('c', 'j'),
('c', 'k'),
('d', 'i'),
('d', 'j'),
('d', 'k'),
('e', 'i'),
('e', 'j'),
('e', 'k')],
names=['c1', 'c2'])
"""
df_agg.columns
"""
df_agg.columns
MultiIndex([('v1', 'count'),
('v2', 'count')],
)
"""
Dealing with MultiIndexes can be tricky, possibly unwelcome. One can fix it by doing so:
1
2
df_agg = df_agg.reset_index()
df_agg.columns = ["group1", "group2", "count_v1", "count_v2"]
This poses a problem - what if you have multiple aggregations? This is how you might do it
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
df_agg = df.groupby(["c1", "c2"]).agg(["count", "min", "max", "mean"])
df_agg.index = df_agg.index.map("_".join) # flattening index
df_agg.loc["a_i", :] # statistics that belong to group a, i
"""
v1 count 62.000000
min 39.000000
max 991.000000
mean 513.129032
v2 count 62.000000
min 2024.000000
max 2968.000000
mean 2519.016129
Name: a_i, dtype: float64
"""
# flatten the columns
df_agg = df_agg.reset_index()
df_agg.columns = df_agg.columns.map("_".join) # flattening columns
df_agg.loc[lambda df: df['index_']=="a_i"]
"""
index_ v1_count v1_min v1_max v1_mean v2_count v2_min v2_max v2_mean
a_i 62 39 991 513.129032 62 2024 2968 2519.016129
"""
Aside: you can use numpy aggregation functions too.
1
df_agg = df.groupby(["c1", "c2"]).agg([np.size, np.min, np.max, np.mean])
NamedAgg
Instead of worrying about flattening the index(es), and as previously mentioned, explicit is always better than implicit, this is where NamedAgg is very helpful.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
df_agg = df.groupby(["c1", "c2"]).agg(
v1_size=pd.NamedAgg(column="v1", aggfunc=np.size),
v1_min=pd.NamedAgg(column="v1", aggfunc=np.min),
v2_mean=pd.NamedAgg(column="v2", aggfunc=np.mean),
v2_max=pd.NamedAgg(column="v2", aggfunc=np.max),
v2_mean_max_diff=pd.NamedAgg(column="v2", aggfunc=lambda x: np.max(x) - np.mean(x)),
)
df_agg.reset_index().head(5)
"""
c1 c2 v1_size v1_min v2_mean v2_max v2_mean_max_diff
0 a i 62 39 2519.016129 2968 448.983871
1 a j 66 16 2500.984848 2984 483.015152
2 a k 72 5 2553.694444 2998 444.305556
3 b i 59 12 2486.033898 2997 510.966102
4 b j 61 7 2538.459016 2996 457.540984
"""
Simplfying NamedAgg
In fact, pandas.NamedAgg
is just a namedtuple
. Plain tuples are allowed as well. Because of this, you can actually store the aggregations as a dictionary, and pass it over in the aggregation function.
1
2
3
4
5
6
7
8
dict_of_agg = {
"v1_size": ("v1", np.size),
"v1_mean": ("v1", np.min),
"v2_mean": ("v2", np.mean),
"v2_max": ("v2", np.max),
"v2_mean_max_diff": ("v2", lambda x: np.max(x) - np.mean(x)),
}
df_agg = df.groupby(["c1", "c2"]).agg(**dict_of_agg)
Transform
When preparing data for modeling, a common data manipulation or feature engineering, is to add summary statistics to categorical features. You might approach this the following way:
- Calculate summary statistics based on the categorical feature
- Perform a join after, like so:
1
2
3
4
5
df_eg1 = df.copy()
df_group_stats = (
df_eg1.groupby(["c1", "c2"]).agg(c1_c2_v1_sum=("v1", np.sum)).reset_index()
)
df_eg1 = df_eg1.merge(df_group_stats, on=["c1", "c2"])
Turns out there is a better way to do this, with the transform
method:
1
df_eg2["c1_c2_v1_sum"] = df_eg2.groupby(["c1", "c2"])["v1"].transform(np.sum)
To test that these two results are the same:
1
2
3
4
5
6
7
8
# Sort just for comparision
df_eg1 = df_eg1.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg2 = df_eg2.sort_values(["c1", "c2"]).reset_index(drop=True)
df_eg1.equals(df_eg2)
"""
True
"""
For users coming from SQL-land, think of transform as a window
function. You can perform ranking on groups like so:
1
df.groupby(["c1", "c2"])["v1"].transform("rank")
Another common use case when transform can be very useful, is to impute missing values based on certain conditions:
1
2
3
4
5
6
7
8
9
10
df_eg3 = pd.DataFrame({"c1": ["a", "a", "b", "b"], "v1": [1, np.nan, 2, np.nan]})
df_eg3["v1"] = df_eg3["v1"].fillna(df_eg3.groupby("c1")["v1"].transform("mean"))
"""
c1 v1
0 a 1.0
1 a 1.0
2 b 2.0
3 b 2.0
"""
Similarly, more information can be found in the references section.
Filter
When doing exploratory data analysis, sometimes you only want to analyze subset of data fulfilling a certain set of conditions. One way of approaching the problem, could be:
- Calculate summary statistics into a new dataframe.
- Filter out keys with summary statistics fulfilling a certain condition.
- Perform an anti join.
Cutting to the chase instead,
1
df.groupby(["c1", "c2"]).filter(lambda x: len(x) > 78)
To illustrate:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
df_d_j = df_agg.loc[[("d", "j")]].reset_index()
df_d_j
"""
c1 c2 v1_size v1_mean v2_mean v2_max v2_mean_max_diff
0 d j 79 2 2505.759494 2981 475.240506
"""
# to verify the results
# Also note, we can reuse the `dict_of_agg`
df_d_j_2 = (
df.groupby(["c1", "c2"])
.filter(lambda x: len(x) > 78)
.reset_index()
.groupby(["c1", "c2"])
.agg(**dict_of_agg)
.reset_index()
)
df_d_j.equals(df_d_j_2) # True
For users coming from SQL-land, think of transform as a having
function.
Creating New Columns
At the problem intro, we mentioned about the re-assignment / hard code referencing problem. To reiterate, suppose we want to calculate the ratio of v1/(v1+v2)
:
1
2
df['sum'] = df['v1'] + df['v2']
df['fraction_v1'] = df['v1'] / df['sum']
If you need to change the sum
into something else, say total_v1_v2
, you would need to change multiple lines of code.
Another problem with this approach, when you want to prototype functions, you would find yourself re-running multiple lines of python code above or find yourself doing df.copy()
.
To overcome this problem, let’s first understand the assign
function.
Assign
One huge advantage of assign, is it returns you a new object that has a copy of the original data with the requested changes. The original dataframe remains unchanged!
1
df.assign(sum=lambda x: x["v1"] + x["v2"], fraction_v1=lambda x: x["v1"] / x["sum"])
There is still some hard coding involved with the sum
variable, also, the feature transformation can be assigned to a dictionary!
1
2
3
4
5
6
7
8
9
10
value1 = "v1"
value2 = "v2"
feat1_name = "sum"
feat2_name = "fraction_v1"
dict_of_features = {
feat1_name: lambda x: x[value1] + x[value2],
feat2_name: lambda x: x[value1] / x[feat1_name],
}
df.assign(**dict_of_features)
Assign w Apply
You can also use apply with custom functions, here is a trivial example to replicate the above:
1
2
3
4
5
6
7
8
9
10
11
def simple_ratio(x, y):
return x / (x + y)
df.assign(
**{
feat2_name: lambda x: x.apply(
lambda y: simple_ratio(y[value1], y[value2]), axis=1
)
}
)
Pipe
There still remain 2 problems unsolved:
- what if we want to convert such steps to be unit-testable?
- love the pyspark api/dplyr api from R?
Introducing the pipe method.
To replicate the above example of calculating the ratio of v1/(v1+v2)
, first create a function as follows:
1
2
3
4
5
6
7
8
def add_ratio(df_input, col1, col2):
return df_input.assign(
**{
feat2_name: lambda x: x.apply(
lambda y: simple_ratio(y[col1], y[col2]), axis=1
)
}
)
This function is easily test-able, done with mocks. To use it for data transformation in your main code / data processing,
1
2
# recall that df_input is the first input variable in the add_ratio function
df.pipe(add_ratio, col1=value1, col2=value2)
However, if the first input is not the dataframe, you can set pipe to know which arguments to replace in the function, like so:
1
df.pipe((add_ratio, "df_input"), col1=value1, col2=value2).head(5)
Summary
To summarize,
- Using loc
- for selecting columns
- for slicing of dataframes
- GroupBy
- Using NamedAgg
- Transform
- Filter
- Assign
- Assign with dictionaries
- Assign with apply
- Pipe
Hope this helps you in creating better & faster pandas code!
References
- Using Pandas loc
- NamedAgg
- Transform and Apply with groupby
- Assign