Pandas DataFrames: Search and Fill/Drop Empty Values

7 minutes read

When working on Dataset, it is common to find empty columns or even rows. This can happen for various reasons, but do you know how to search for them and replace them with Pandas DataFrame?

dataframeemptyrowsexample.png


Finding Empty Rows and Columns

To look for empty rows and columns, we have a few options:

  • Using isnull() method with .sum() or .any() method
  • Calculating the percentage of empty rows and columns
  • Using isnull() as a mask to filter empty rows

Let's take a look at each of them individually.

Example 1: Searching with isnull()

This is probably the most common way to look for empty columns, as it is simple and easy to use:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
print(df.isnull().sum())

This will return the number of empty rows for each column in the DataFrame. You can read it as "for each column, how many empty rows are there?".

But if you want to know if there are any empty rows at all, you can use the .any() method:

# ...
print(df.isnull().any())

This will return a boolean value for each column, indicating if there are any empty rows in it.

You can also see people using code that looks like this:

# ...
print(df.isna().sum())
print(df.isna().any())

This is because isnull() and isna() are the same method. To be exact, isnull() is an alias for isna().

You can take a look at an example of this online in Kaggle - Sleep Health Lifestyle Eda Visualization

Example 2: Calculating the Percentage of Empty Rows

Another way to look for empty rows is to calculate how many are empty. This can be done with a simple for loop:

  • Loop over each column
  • Calculate the percentage of empty rows
  • If the percentage is greater than 0, print the column name and the percentage
import pandas
 
df = pandas.read_csv('healthdata.csv')
 
for i in df.columns:
null_rate = df[i].isnull().sum() / len(df) * 100
if null_rate > 0:
print(f"{i} null rate: %{null_rate:.2f}")

This will return the name of the column and the percentage of empty rows in it:

This is an excellent way to determine how accurate your Dataset might be. If you see a column with a high percentage of empty rows, you should look closely at it and decide if you want to keep it.

You can take a look at an example of this online in Kaggle - Netflix Data Visualization

Example 3: Using isnull() as a Mask

Now that we know how to search for empty columns, we should also know how to filter empty rows. This can be done by using the isnull() method as a mask:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
null_mask = df.isnull().all(axis=1)
null_rows = df[null_mask]
print(null_rows)

Once we run this code, we will get a DataFrame with only the empty rows:

This is an excellent way to select only the empty rows and decide what to do with them. You might want to check the data source and see if you can fill them with some data or if you should drop them.


Replacing Empty Columns

Now that we know how to search for empty rows and columns, we should also know how to replace them. Here are a few ways to do it:

  • Using the fillna() method to fill empty rows with a value
  • Using the dropna() method to drop empty rows and columns

Example 1: Using fillna() - Filling All Empty Columns

Most of the time, we need to take empty rows and fill them with some value. This can be anything, from a simple 0 to a more complex value. Here's how to do it:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df.fillna(0, inplace=True)

This will fill all empty rows with the value 0:

But this changes the original DataFrame because we used the inplace=True parameter. Suppose you don't want to change the original DataFrame. In that case, you can use the inplace=False parameter or assign the result to a new variable:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
new_df = df.fillna(0)

This will create a new DataFrame with the empty rows filled with 0, and the original DataFrame will remain unchanged.

Example 2: Using fillna() - Filling Specific Columns

If filling all empty rows with the same value is not what you want, you can also fill specific columns with specific values:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df['Sleep Disorder'].fillna('No', inplace=True)

This will fill all empty rows in the Sleep Disorder column with the value No:

You can also use the inplace=False parameter or assign the result to a new variable:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
new_df = df['Sleep Disorder'].fillna('No')

Example 3: Using dropna() - Dropping Empty Rows and Columns

You can drop them if you don't want to fill empty rows. This can be done with the dropna() method:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df.dropna(inplace=True)

This will drop all empty rows and columns from the DataFrame:

Just like the fillna() method, you can use the inplace=False parameter or assign the result to a new variable:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
new_df = df.dropna()

Wait, What is Empty? None vs NaN vs Empty String

When we talk about empty rows, we have a few options to consider:

  • None - This is the Python way to represent an empty value
  • NaN - This is the Pandas' way to represent an empty value
  • '' - This is an empty string, which is different from None and NaN

Let's look at how this looks in a CSV file:

"",,,,,,,,,
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,''
2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,NaN

Can you spot all empty values? Let's take a look at them:

  • First line - Both "" and ,,,,, are empty values
  • Second line - The last value is an empty string - not an empty value!
  • Third line - The last value is None which is an empty value
  • Fourth line - The last value is NaN, which is an empty value

Let's take a look at how Pandas see them:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
print(df.head())

This will return the following DataFrame:

Now, if we try to replace them with a value, we can see how they are different:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df.fillna(0, inplace=True)

This will fill all empty values with 0:

But wait, what is this? The last value in the second line is still an empty string ''! Pandas see it as a different value from None and NaN. This is why knowing what you are working with and how to handle it is essential.

Let's handle it:

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df.fillna(0, inplace=True)
df['Sleep Disorder'].replace(to_replace="''", value=0, inplace=True)

This will replace '' with 0:



No comments or questions yet...