Pandas DataFrame Query Method: 11 Syntax Options

While using Pandas in Python, it's common to filter data with df.query() method. But there are so many options of the syntax and parameters! In this tutorial, we will try to show the most important ones.

Simple Check

One of the simplest ways to filter data is to use a comparison operator:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query("Gender == 'Male'")

This will return all rows where the Gender column is equal to Male:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
0          1   Male  27.0  ...         77         4200             NaN
1          2   Male  28.0  ...         75        10000             NaN
2          3   Male  28.0  ...         75        10000             NaN
3          4   Male  28.0  ...         85         3000     Sleep Apnea
4          5   Male  28.0  ...         85         3000     Sleep Apnea
 
[5 rows x 13 columns]

Checking Against Variables

In case you have a dynamic value, you can use a variable in the query:

df = pandas.read_csv('healthdata.csv')
 
value = 'Sleep Apnea'
df2 = df.query("`Sleep Disorder` == @value")

This will return all rows where the Sleep Disorder column is equal to the value of the value variable:

    Person ID  Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
3           4    Male  28.0  ...         85         3000     Sleep Apnea
4           5    Male  28.0  ...         85         3000     Sleep Apnea
16         17  Female  29.0  ...         80         4000     Sleep Apnea
17         18    Male  29.0  ...         70         8000     Sleep Apnea
30         31  Female  30.0  ...         78         4100     Sleep Apnea
 
[5 rows x 13 columns]

Fields With Spaces

Now, if your field has a space in the name, like in our previous example - you can use backticks to escape the field name:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query("`Sleep Disorder` == 'Sleep Apnea'")

This is because without the backticks, the query would fail:

Multiple Conditions

Next, let's see how we can use multiple conditions:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query("Gender == 'Male' and `Age` >= 29")

This has allowed us to filter the data to only include rows where the Gender is Male and the Age is greater than or equal to 29:

    Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
6           7   Male  29.0  ...         82         3500        Insomnia
7           8   Male  29.0  ...         70         8000             NaN
8           9   Male  29.0  ...         70         8000             NaN
9          10   Male  29.0  ...         70         8000             NaN
10         11   Male  29.0  ...         70         8000             NaN
 
[5 rows x 13 columns]

Of course, you can use or as well:

df2 = df.query("`Heart Rate` >= 80 or `Age` <= 27")

This gives us all rows where the Heart Rate is greater than or equal to 80 or the Age is less than or equal to 27:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
0          1   Male  27.0  ...         77         4200             NaN
3          4   Male  28.0  ...         85         3000     Sleep Apnea
4          5   Male  28.0  ...         85         3000     Sleep Apnea
5          6   Male  28.0  ...         85         3000        Insomnia
6          7   Male  29.0  ...         82         3500        Insomnia
 
[5 rows x 13 columns]

Multiple Conditions Faster

But what if we want to use a range condition on a column? We can write it like this:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query("30 <= Age and Age < 50")

It filters our entries:

    Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
19         20   Male  30.0  ...         70         8000             NaN
20         21   Male  30.0  ...         70         8000             NaN
21         22   Male  30.0  ...         70         8000             NaN
22         23   Male  30.0  ...         70         8000             NaN
23         24   Male  30.0  ...         70         8000             NaN
 
[5 rows x 13 columns]

But there's a better way to write this - combining the conditions into a single one:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query('30 <= Age < 50')

This will give us the same result but with more readable code.

Using the `in` Operator

Sometimes, you want to filter your data based on a list of values. You can do this using the in operator:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query("`Sleep Disorder` in ['Sleep Apnea', 'Insomnia']")

This will give us all rows where the Sleep Disorder column is either Sleep Apnea or Insomnia:

    Person ID  Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
3           4    Male  28.0  ...         85         3000     Sleep Apnea
4           5    Male  28.0  ...         85         3000     Sleep Apnea
5           6    Male  28.0  ...         85         3000        Insomnia
6           7    Male  29.0  ...         82         3500        Insomnia
16         17  Female  29.0  ...         80         4000     Sleep Apnea
 
[5 rows x 13 columns]

This also supports variables and not in operator:

list = ['Sleep Apnea', 'Insomnia']
df2 = df.query("`Sleep Disorder` not in @list")

It's a great way to build some logic into your queries.

Using F-Strings

Another way to use variables in queries is to use f-strings:

df = pandas.read_csv('healthdata.csv')
 
value = 30
df2 = df.query(f'Age < {value}')

This does the same thing as before, but you don't need to add the @ symbol:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
0          1   Male  27.0  ...         77         4200             NaN
1          2   Male  28.0  ...         75        10000             NaN
2          3   Male  28.0  ...         75        10000             NaN
3          4   Male  28.0  ...         85         3000     Sleep Apnea
4          5   Male  28.0  ...         85         3000     Sleep Apnea
 
[5 rows x 13 columns]

Math Actions

Next, we want to show you how to use math in queries. Let's say we want to filter our data based on a math operation:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query('Age <= `Heart Rate` / 3')

This compares our Age column to the result of the Heart Rate column divided by 3:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
3          4   Male  28.0  ...         85         3000     Sleep Apnea
4          5   Male  28.0  ...         85         3000     Sleep Apnea
5          6   Male  28.0  ...         85         3000        Insomnia
 
[3 rows x 13 columns]

We can also write this as multiplication:

df2 = df.query('Age <= `Heart Rate` * 0.33')

And it will still give us the same result.

String Method Matches

Our following example involves string methods. Let's say we want to filter our data based on a string method and only take rows where the Blood Pressure column ends with /80:

df = pandas.read_csv('healthdata.csv')
 
# String methods:
df2 = df.query('`Blood Pressure`.str.endswith("/80")')

This will give us all rows where the Blood Pressure column ends with /80:

   Person ID Gender   Age Occupation  Sleep Duration  Quality of Sleep  \
1          2   Male  28.0     Doctor             6.2                 6
2          3   Male  28.0     Doctor             6.2                 6
7          8   Male  29.0     Doctor             7.8                 7
8          9   Male  29.0     Doctor             7.8                 7
9         10   Male  29.0     Doctor             7.8                 7
 
   Physical Activity Level  Stress Level BMI Category Blood Pressure  \
1                       60             8       Normal         125/80
2                       60             8       Normal         125/80
7                       75             6       Normal         120/80
8                       75             6       Normal         120/80
9                       75             6       Normal         120/80
 
   Heart Rate  Daily Steps Sleep Disorder
1          75        10000            NaN
2          75        10000            NaN
7          70         8000            NaN
8          70         8000            NaN
9          70         8000            NaN

Note: We have used pandas.set_option('display.max_columns', None) to display all columns in the output.

This can be very useful when you filter your data based on some part of strings or apply any other string method.

Take Non-Null Values

Pandas also allows you to filter by not null values:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query('`Sleep Disorder`.notnull()')

This gives us all rows where the Sleep Disorder column is not null:

    Person ID  Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
3           4    Male  28.0  ...         85         3000     Sleep Apnea
4           5    Male  28.0  ...         85         3000     Sleep Apnea
5           6    Male  28.0  ...         85         3000        Insomnia
6           7    Male  29.0  ...         82         3500        Insomnia
16         17  Female  29.0  ...         80         4000     Sleep Apnea
 
[5 rows x 13 columns]

As you can see, our data starts at index 3 because our original source looks like this:

Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,None
2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,None
3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,None
4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
6,Male,28,Software Engineer,5.9,4,30,8,Obese,140/90,85,3000,Insomnia
7,Male,29,Teacher,6.3,6,40,7,Obese,140/90,82,3500,Insomnia
8,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,None
9,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,None
10,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,None
...

This is a great way to filter out incomplete rows and only work with complete data.

Reversing the Condition

You can also reverse the condition:

df2 = df.query('`Sleep Disorder`.isnull()')

This will give us all rows where the Sleep Disorder column is null:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
0          1   Male  27.0  ...         77         4200             NaN
1          2   Male  28.0  ...         75        10000             NaN
2          3   Male  28.0  ...         75        10000             NaN
7          8   Male  29.0  ...         70         8000             NaN
8          9   Male  29.0  ...         70         8000             NaN
 
[5 rows x 13 columns]

Which can be used to quickly create two datasets - one with complete data and one with incomplete data.

Targeting Index - Not Columns

When using the pandas query, you can also target the index of the rows, not just the columns:

df = pandas.read_csv('healthdata.csv')
 
df2 = df.query('index % 2 == 0')

This will give us every second row:

   Person ID Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
0          1   Male  27.0  ...         77         4200             NaN
2          3   Male  28.0  ...         75        10000             NaN
4          5   Male  28.0  ...         85         3000     Sleep Apnea
6          7   Male  29.0  ...         82         3500        Insomnia
8          9   Male  29.0  ...         70         8000             NaN

This can be achieved by using the index keyword in your query, as it's a special keyword that targets the index of the rows.

Bonus: Different Syntax

As a last example, we wanted to show you another way people do this in tutorials and documentation. This does not use the query method but still filters the data:

df = pandas.read_csv('healthdata.csv')
 
df2 = df[df['Age'] > 30]

This will filter the data to only include rows where the Age column is greater than 30:

    Person ID  Gender   Age  ... Heart Rate  Daily Steps  Sleep Disorder
32         33  Female  31.0  ...         69         6800             NaN
33         34    Male  31.0  ...         72         5000             NaN
34         35    Male  31.0  ...         70         8000             NaN
35         36    Male  31.0  ...         72         5000             NaN
36         37    Male  31.0  ...         72         5000             NaN
 
[5 rows x 13 columns]

Performance

One thing to keep in mind is performance. As with everything, method calls have different performance expectations, and you should always test your code to see which method is the fastest for your use case.

In our case, we have tested the performance of the query method and the [] method like this:

main.py

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df2 = df[df['Age'] > 30]

main2.py

import pandas
 
df = pandas.read_csv('healthdata.csv')
 
df2 = df.query('Age > 30')

Then, we used the time console command to measure the performance. This is the result:

As you can see, the query method is faster in this case. But this might not be the case for your data, so always test your code to see which method/approach is the fastest.