Pandas Axis for Machine Learning

Python for Machine Learning: Pandas Axis Explained

Pandas is a powerful library in a toolbox for every Machine Learning engineer. It provides two main data structures: Series and DataFrame.

Many API calls of these types accept cryptical “axis” parameter. This parameter is poorly described in Pandas’ documentation, though it has a key significance for using the library efficiently. The goal of the article is to fill in this gap and to provide a solid understanding of what the “axis” parameter is and how to use it in various use cases.

Pandas Axis Usage in Machine Learning

Axis in Series

Series is a one-dimensional array of values. Under the hood, it uses NumPy ndarray. That is where the term “axis” came from. NumPy uses it quite frequently because ndarray can have a lot of dimensions.

Series object has only “axis 0” because it has only one dimension. The arrow on the image displays “axis 0” and its direction for the Series object.

Usually, in Python, one-dimensional structures are displayed as a row of values. On the contrary, here we see that Series is displayed as a column of values.

Each cell in Series is accessible via index value along the “axis 0”. For our Series object indexes are: 0, 1, 2, 3, 4. Here is an example of accessing different values:

Axes in DataFrame

DataFrame is a two-dimensional data structure akin to SQL table or Excel spreadsheet. It has columns and rows. Its columns are made of separate Series objects. Let’s see an example:

A DataFrame object has two axes: “axis 0” and “axis 1”. “axis 0” represents rows and “axis 1” represents columns. Now it’s clear that Series and DataFrame share the same direction for “axis 0” – it goes along rows direction.

Our DataFrame object has 0, 1, 2, 3, 4 indexes along the “axis 0”, and additionally, it has “axis 1” indexes which are: 'a' and 'b'.

To access an element within DataFrame we need to provide two indexes (one per each axis). Also, instead of bare brackets, we need to use .loc method:

Using “axis” parameter in API calls

There are a lot of different API calls for Series and DataFrame objects which accept “axis” parameter. Series object has only one axis, so this parameter always equals 0 for it. Thus, you can omit it, because it does not affect the result:

On the contrary, DataFrame has two axes, and “axis” parameter determines along which axis an operation should be performed. For example, .sum can be applied along “axis 0”. That means, .sum operation calculates a sum for each column:

We see, that having sum with axis=0 smashed all values along the direction of the “axis 0” and left only columns( 'a' and 'b') with appropriate sums.

With axis=1 it produces a sum for each row:

If you prefer regular names instead of numbers, each axis has a string alias. “axis 0” has two aliases: 'index' and 'rows'. “axis 1” has only one: 'columns'. You can use these aliases instead of numbers:

Dropping NaN values

Let’s build a simple DataFrame with NaN values and observe how axis affects .dropna method:

Here .dropna filters out any row(we are moving along “axis 0”) which contains NaN value.

Let’s use “axis 1” direction:

Now .dropna collapsed “axis 1” and removed all columns with NaN values. Columns 'a' and 'b' contained NaN values, thus only 'c' column was left.

Concatenation

Concatenation function with axis=0 stacks the first DataFrame over the second:

With axis=1 both DataFrames are put along each other:

Summary

Pandas borrowed the “axis” concept from NumPy library. The “axis” parameter does not have any influence on a Series object because it has only one axis. On the contrary, DataFrame API heavily relies on the parameter, because it’s a two-dimensional data structure, and many operations can be performed along different axes producing totally different results.