Best way to Impute NAN within Groups — Mean & Mode

Source: Photo by Nate Bell on Unsplash

We know that we can replace the nan values with mean or median using fillna(). What if the NAN data is correlated to another categorical column? What if the expected NAN value is a categorical value? In this article we will learn why we need to Impute NAN within Groups.

Below are some useful tips to handle NAN values.

Definitely you are doing it with Pandas and Numpy.

import pandas as pd
import numpy as np

ngroup

cl = pd.DataFrame({
'team':['A','A','A','A','A','B','B','B','B','B'],                   'class'['I','I','I','I','I','I','I','II','II','II'],
'value': [1, np.nan, 2, 2, 3, 1, 3, np.nan, 3,1]})
Data with NaN

Lets assume if you have to fillna for the data of liquor consumption rate, you can just fillna if no other data is relevant to it.

But if the age of the person is given then you can see a pattern in the age and consumption rate variables. Because the liquor consumption will not be in same level for all the people.

An another example is fillna in salary value could be related with age, job title and/or education.

In this case we can Impute NAN values within Groups which gives more better results than overall imputation.

In the above example, let assume that columns test and class are related to value.

Using ngroup you can name the group with the index.

cl['idx'] = cl.groupby(['team','class']).ngroup()
Data with NaN

Now you can clearly understand the goups now and we named the groups with index. It will be helpful in situations where you wanted to handle data in such complex groups. for example fillna with a complex group of 10 columns.

Group by 2 colums and fillna with mean

Lets take the below data:

cl = pd.DataFrame({
'team': 'A','A','A','A','A','B','B','B','B','B'],
'class':['I','I','I','II','II','I','I','II','II','II'],
'value': [1, np.nan, 2, 2, 3, 1, 3, np.nan, 3,1]})
Data with NaN

As discussed earlier now we want to fill nan with mean by group of team and class.

cl['value'] = cl.groupby(['team','class'], sort=False)['value'].apply(lambda x: x.fillna(x.mean()))
After Imputation within Groups

With team A and class I, the mean value of 1.0 and 2.0 is 1.5. Similarly the remaining groups. you can see that both the null values are imputed with different means (yellow shaded values). i.e. the mean of each group.

Group by 2 colums and fillna with mode

Mode is not compatible with fillna as same as mean & median.

Mean & median returns and works as same ways, both returns a series. But mode returns a dataframe.

To use mode with fillna we need make a little change. We need to locate the fist data using iloc.

df = pd.DataFrame({'A': [1, 2, 1, 2, 1, 2, 3]})

a = df.mode()
print(a.iloc[0])
print(type(a))

A 1
Name: 0, dtype: int64
<class ‘pandas.core.frame.DataFrame’>

df = pd.DataFrame({'A': [1, 2, 1, 2, 1, 2, 3]})

a = df.mean()
print(a)
print(type(a))

A 1.714286
dtype: float64
<class ‘pandas.core.series.Series’>

Now Lets impute the NAN values with mode for the below mentioned data.

Data
cl['value'] = cl.groupby(['team','class'], sort=False)['value'].apply(lambda x: x.fillna(x.mode().iloc[0]))
After Imputation within Groups

The mode of 1,2,2,3 is 2.

Group by 1 column and fillna

Data:

Data
cl1['value'] = cl1.groupby('sec').transform(lambda x: x.fillna(x.mean()))

The below statements also work.

cl1['value'] = cl1.groupby('sec')['value'].transform(lambda x: x.fillna(x.mean()))

Result:

After Imputation within Groups

Conclusion:

In this article we have learned about better way to group data and fillna.

Hope you are excited to practice what we have learned now.

We will meet with a new tip in Python. Thank you! 👍

Like to support? Just click the heart icon ❤️.

Happy Programming!🎈

0

Leave a Reply

Your email address will not be published. Required fields are marked *