Pandas is a powerful library in a toolbox for every Machine Learning engineer. It provides two main data structures: Series and DataFrame.
Many API calls of these types accept cryptical “axis” parameter. This parameter is poorly described in Pandas’ documentation, though it has a key significance for using the library efficiently. The goal of the article is to fill in this gap and to provide a solid understanding of what the “axis” parameter is and how to use it in various use cases including leading-edge artificial intelligence applications.
Axis in Series
Series is a one-dimensional array of values. Under the hood, it uses NumPy ndarray. That is where the term “axis” came from. NumPy uses it quite frequently because ndarray can have a lot of dimensions.
Series object has only “axis 0” because it has only one dimension.
The arrow on the image displays “axis 0” and its direction for the Series object.
Usually, in Python, one-dimensional structures are displayed as a row of values. On the contrary, here we see that Series is displayed as a column of values.
Each cell in Series is accessible via index value along the “axis 0”. For our Series object indexes are: 0, 1, 2, 3, 4. Here is an example of accessing different values:
>>> import pandas as pd >>> srs = pd.Series(['red', 'green', 'blue', 'white', 'black']) >>> srs[0] 'red' >>> srs[3] 'white'
Axes in DataFrame
DataFrame is a two-dimensional data structure akin to SQL table or Excel spreadsheet. It has columns and rows. Its columns are made of separate Series objects. Let’s see an example:
A DataFrame object has two axes: “axis 0” and “axis 1”. “axis 0” represents rows and “axis 1” represents columns. Now it’s clear that Series and DataFrame share the same direction for “axis 0” – it goes along rows direction.
Our DataFrame object has 0, 1, 2, 3, 4 indexes along the “axis 0”, and additionally, it has “axis 1” indexes which are: ‘a’ and ‘b’.
To access an element within DataFrame we need to provide two indexes (one per each axis). Also, instead of bare brackets, we need to use .loc method:
>>> import pandas as pd >>> srs_a = pd.Series([1,3,6,8,9]) >>> srs_b = pd.Series(['red', 'green', 'blue', 'white', 'black']) >>> df = pd.DataFrame({'a': srs_a, 'b': srs_b}) >>> df.loc[2, 'b'] 'blue' >>> df.loc[3, 'a'] 8
Using “axis” parameter in API calls
There are a lot of different API calls for Series and DataFrame objects which accept “axis” parameter. Series object has only one axis, so this parameter always equals 0 for it. Thus, you can omit it, because it does not affect the result:
>>> import pandas as pd >>> srs = pd.Series([1, 3, pd.np.nan, 4, pd.np.nan]) >>> srs.dropna() 0 1.0 1 3.0 3 4.0 dtype: float64 >>> srs.dropna(axis=0) 0 1.0 1 3.0 3 4.0 dtype: float64
On the contrary, DataFrame has two axes, and “axis” parameter determines along which axis an operation should be performed. For example, .sum can be applied along “axis 0”. That means, .sum operation calculates a sum for each column:
>>> import pandas as pd >>> srs_a = pd.Series([10,30,60,80,90]) >>> srs_b = pd.Series([22, 44, 55, 77, 101]) >>> df = pd.DataFrame({'a': srs_a, 'b': srs_b}) >>> df a b 0 10 22 1 30 44 2 60 55 3 80 77 4 90 101 >>> df.sum(axis=0) a 270 b 299 dtype: int64
We see, that having sum with axis=0 smashed all values along the direction of the “axis 0” and left only columns(‘a’ and ‘b’) with appropriate sums.
With axis=1 it produces a sum for each row:
>>> df.sum(axis=1) 0 32 1 74 2 115 3 157 4 191 dtype: int64
If you prefer regular names instead of numbers, each axis has a string alias. “axis 0” has two aliases: ‘index’ and ‘rows’. “axis 1” has only one: ‘columns’. You can use these aliases instead of numbers:
>>> df.sum(axis='index') a 270 b 299 dtype: int64 >>> df.sum(axis='rows') a 270 b 299 dtype: int64 >>> df.sum(axis='columns') 0 32 1 74 2 115 3 157 4 191 dtype: int64
Dropping NaN values
Let’s build a simple DataFrame with NaN values and observe how axis affects .dropna method:
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame({'a': [2, np.nan, 8, 3], 'b': [np.nan, 32, 15, 7], 'c': [-3, 5, 22, 19]}) >>> df a b c 0 2.0 NaN -3 1 NaN 32.0 5 2 8.0 15.0 22 3 3.0 7.0 19 >>> df.dropna(axis=0) a b c 2 8.0 15.0 22 3 3.0 7.0 19
Here .dropna filters out any row(we are moving along “axis 0”) which contains NaN value.
Let’s use “axis 1” direction:
>>> df.dropna(axis=1) c 0 -3 1 5 2 22 3 19
Now .dropna collapsed “axis 1” and removed all columns with NaN values. Columns ‘a’ and ‘b’ contained NaN values, thus only ‘c’ column was left.
Concatenation
Concatenation function with axis=0 stacks the first DataFrame over the second:
>>> import pandas as pd >>> df1 = pd.DataFrame({'a': [1,3,6,8,9], 'b': ['red', 'green', 'blue', 'white', 'black']}) >>> df2 = pd.DataFrame({'a': [0,2,4,5,7], 'b': ['jun', 'jul', 'aug', 'sep', 'oct']}) >>> pd.concat([df1, df2], axis=0) a b 0 1 red 1 3 green 2 6 blue 3 8 white 4 9 black 0 0 jun 1 2 jul 2 4 aug 3 5 sep 4 7 oct
With axis=1 both DataFrames are put along each other:
>>> pd.concat([df1, df2], axis=1) a b a b 0 1 red 0 jun 1 3 green 2 jul 2 6 blue 4 aug 3 8 white 5 sep 4 9 black 7 oct
Summary
Pandas borrowed the “axis” concept from NumPy library. The “axis” parameter does not have any influence on a Series object because it has only one axis. On the contrary, DataFrame API heavily relies on the parameter, because it’s a two-dimensional data structure, and many operations can be performed along different axes producing totally different results.