Pivot Tables in Python With Pandas

Advertisements

To introduce Python Pivot Tables, I will use open-source Adult data that can be easily accessed here. It has information about people and their characteristics such as race, gender, income, etc. You can download original data and the column names using this GitHub repository

Support Us

BTC:bc1qsg4s7gmxj88ztnux6en577anzuvm0rvjhne8cp

ETH:0xBB0f503e443F2b2646785B014A951Fc8AAd9561E

Donation

Firstly, we need to read the data (as it does not have internal column names, we need to use another CSV file to read the columns and put them into the data frame).

import pandas as pd
import csv

with open('col_name.csv', newline='') as f:
reader = csv.reader(f)
cols = list(reader)

def flatten(t):
return [item for sublist in t for item in sublist]

cols = flatten(cols)

data = pd.read_csv("adults.csv", names=cols)

For this article’s first Python pivot table, I want to determine the maximum age of each sex. To do so, I will write the pandas function for the pivot table.

It is stated as pd.Dataframe.pivot_table and has the following crucial parameters (there are some more of them, but they are not used in most of the cases):

  • index: single column or list of columns by which the data will be grouped.
  • values: single column or list of columns aggregated and shown in the resulting pivot table.
  • columns: column, by which we will (simultaneously with index columns) group the data.
  • aggfunc: aggregation function or list of functions for the values in the pivot table.

The resulting code for our question is:

data.pivot_table(index="sex", values="age", aggfunc="max")
Resulting Pivot Table

Adding complexity

The next pivot table will be a multi-index one. What about determining what age patterns exist in the data? We will use race and sex as the grouping columns and use pd.Dataframe.describe function to aggregate the age. Thus, we will see a lot of statistics related to age in almost no time and effort.

Now, we need to use a list of columns for the index parameter of the function. The code for this table is below.

data.pivot_table(index=["race", "sex"], values="age", aggfunc="describe")
Resulting Pivot Table

We already covered the columns parameter in this article but let’s see how it can be helpful in a specific situation.

Let’s say, we want to use the same grouping columns as before (race and sex) as well as add salary (it is a binary variable with values of less or equal to 50K dollars or more than 50 K dollars) to it. Now, I want to calculate the number of people in each category (based on gender, race, and income).

It’s worth noting that for only counting the number of people in each cell (and not some mathematical operation such as mean or median), I can use any column which has neither missing values nor has an index parameter as the value. In our case, all columns have no NaNs, so I can pick either one. For the sake of convenience, I’ve decided to stay with the age column.

Of course, I can write commands as easy as this one:

data.pivot_table(index=["race", "sex", "salary"], values="age", aggfunc="count")
Resulting Pivot Table

Whoops! The table is very long and sophisticated and it isn’t easy to see any patterns in it.

Gladly, there is a solution. We can use the columns parameter for one of the index columns. Even though you can use any column in the columns parameter, I recommend sticking to one such that has the least unique values not to make the table too large.

Here, I can either choose gender or income. But income seems like the better option to separate from race and gender as logically it is not a predetermined biological feature of a person.

The new code looks like that:

data.pivot_table(index=["race", "sex"], columns="salary", values="age", aggfunc="count")
Resulting Pivot Table

It seems much better. Now, you already know how to create good-looking pivot tables, so there is time to learn some advanced features of Pandas Pivot Tables.

Source

Leave a comment

Your email address will not be published. Required fields are marked *