Python Collections module: Practical Use-Cases

8 minutes read

Python is an excellent language for big datasets. But did you know that Python has a built-in module for collections of data? Take a look at the following example of how it can simplify your code:

We went from 6 lines of code to just 2! But that's not all. We can look at other examples in collections and see how to use them!

In this tutorial, we will cover the following classes from the collections module:

  • Counter
  • DefaultDict
  • Deque
  • NamedTuple

Counter - Simplified Counting

It is common to count things in data projects. From counting the number of times a word appears in a text (for ML projects) to counting the number of distinct values in a list. In our example, we have the following list:

And we have loaded it into a simple list:

import csv
 
def dataset(column_index):
data = []
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data.append(row[column_index])
 
return data

We want to count how many Males and Females we have in our dataset. We can do this manually:

# ...
 
def count_manually(data):
count = {}
for element in data:
count[element] = count.get(element, 0) + 1
return count
 
print(count_manually(dataset(1))) # We want to count "Gender" column from our CSV file
 
# Output:
# {'Male': 189, 'Female': 184}

Or we can use the Counter class from the collections module:

from collections import Counter
 
# ...
 
print(Counter(dataset(1)).most_common()) # We want to count "Gender" column from our CSV file
 
# Output:
# [('Male', 189), ('Female', 184)]

This helped us reduce the code complexity and made it easier to understand. But what if we want to count the number of different professions in our dataset? Here's how this looks with Counter:

# ...
 
print(Counter(dataset(3)).most_common()) # We want to count "Occupation" column from our CSV file

And we should get the following output:

[('Nurse', 72), ('Doctor', 71), ('Engineer', 63), ('Lawyer', 46), ('Teacher', 40), ('Accountant', 37), ('Salesperson', 32), ('Software Engineer', 4), ('Scientist', 4), ('Sales Representative', 2), ('', 1), ('Manager', 1)]

Last, we can even take the top 5 professions in our dataset:

print(Counter(dataset(3)).most_common(5)) # We want to count "Occupation" column from our CSV file and limit it to top 5

And we should get the following output:

[('Nurse', 72), ('Doctor', 71), ('Engineer', 63), ('Lawyer', 46), ('Teacher', 40)]

It's that simple! We did not have to write a single line of code to iterate over our dataset manually. We just used the Counter class designed to do this for us.


DefaultDict - Dictionary with Default Values

Working with dict is common in Python. Same as seeing the following error (if you are not careful):

Traceback (most recent call last):
File "/python/python-collections/defaultdict.py", line 27, in <module>
print(dataset(use_default_dict=False)['1555112'])
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: '1555112'

This happens because we are trying to access a key that does not exist in our dictionary. We can solve this using the defaultdict class from the collections module. Here's how we can use it:

from collections import defaultdict
import csv
 
 
def dataset():
def print_default():
return 'N/A' # Default value for missing data
 
data = defaultdict(print_default)
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data[row[0]] = row
 
return data
 
 
print(dataset()['1']) # Returns the entire row for the key 1
print(dataset()['1555112']) # Returns the default value for missing data (N/A)

Specifically, in this example, we need to focus on this part of the code:

# ...
def print_default():
return 'N/A' # Default value for missing data
 
data = defaultdict(print_default)
# ...

It allows us to tell what is the default value if a key is not found. Let's see the output of our code above:

['1', 'Male', '27', 'Software Engineer', '6.1', '6', '42', '6', 'Overweight', '126/83', '77', '4200', 'None']
N/A

As you can see, we have a row for our dataset()['1'] being printed and the default value for dataset()['1555112'] being printed as well. This is a great way to avoid the KeyError we saw before and make our code more robust.


Deque - Double-Ended Queue

A deque is a double-ended queue. It can be used to add or remove elements from both ends, so it's a great way to work with elements that need to be added or removed from the beginning or the end of a list. Here's an example of how we can use it:

from collections import deque
import csv
 
 
def dataset(column_index):
data = deque()
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data.append(row[column_index])
 
return data
 
 
deque_list = dataset(1)
 
print(deque_list)
print('-' * 50 + ' Adding more elements to the deque list ' + '-' * 50)
deque_list.append("new data")
deque_list.appendleft("first data")
print(deque_list)

Main code to focus on:

# ...
data = deque()
 
# ...
data.append(row[column_index])
 
# ...
 
deque_list.append("new data")
deque_list.appendleft("first data")

This simple example allows us to append or "prepend" (add at the start of the list) any new elements we want. Here's the output of our code:

deque(['Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male'])
-------------------------------------------------- Adding more elements to the deque list --------------------------------------------------
deque(['first data', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'new data'])

As you can see, we originally had one dataset with 10 elements. Then, we called append() and appendleft() to add new elements to our list. This added two different elements at the start and end of our list, which would not be possible using a regular list. We can only append() with a standard list, but there is no appendleft() method.


NamedTuple - Tuple with Named Fields

Sometimes, we need to work with tuples, but accessing the fields by index can be confusing. Especially when we have a lot of fields. This is where NamedTuple comes in. It allows us to access the fields by name. Here's an example:

from collections import namedtuple
import csv
 
def collections_dataset():
sleep_data = namedtuple('patient_data', [
'gender',
'age',
'sleep_duration',
])
 
data = []
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data.append(
sleep_data(
gender=row[1],
age=row[2],
sleep_duration=row[4]
)
)
 
patients = namedtuple('patients', ['patients'])
 
return patients(data)
 
collection_dataset = collections_dataset()
print(collection_dataset.patients[0].sleep_duration)

The central part of the code is this:

# ...
 
sleep_data = namedtuple('patient_data', [
'gender',
'age',
'sleep_duration',
])
 
data = []
 
# ...
 
data.append(
sleep_data(
gender=row[1],
age=row[2],
sleep_duration=row[4]
)
)
 
# ...
 
patients = namedtuple('patients', ['patients'])
 
return patients(data)

This lets us define a named tuple sleep_data with our fields. Then, on assignment, we can assign each field by the name instead of an index or a key. If we compare it to a regular tuple, this has a few advantages. The biggest one is the auto-completion:

But of course, this is the biggest from a development perspective. It can make our code take up less RAM and be faster. Here's how we tested it:

import sys
from collections import namedtuple
import csv
 
 
def collections_dataset():
sleep_data = namedtuple('patient_data', [
'gender',
'age',
'sleep_duration',
])
 
data = []
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data.append(
sleep_data(
gender=row[1],
age=row[2],
sleep_duration=row[4]
)
)
 
patients = namedtuple('patients', ['patients'])
 
return patients(data)
 
 
def dataset():
data = []
 
with open('data/healthdata.csv', 'r') as f:
contents = csv.reader(f)
next(contents) # Skip the header
for row in contents:
data.append(
{
'gender': row[1],
'age': row[2],
'sleep_duration': row[4]
}
)
 
return {
'patients': data
}
 
 
collection_dataset = collections_dataset()
print(collection_dataset.patients[0].sleep_duration) # Has auto-completion
print('Size or space occupied by dictionary', sys.getsizeof(collection_dataset))
 
dataset = dataset()
print(dataset['patients'][0]['sleep_duration']) # No auto-completion
print('Size or space occupied by dictionary', sys.getsizeof(dataset))

In the end, you can see that we called sys.getsizeof() to retrieve the size of our datasets. This is what we got:

6.1
Size or space occupied by dictionary 48
6.1
Size or space occupied by dictionary 184

Our named tuple took up less space in memory than our regular dictionary. Of course, this depends on how you configure your named tuple. Still, it's an excellent example of how it can be more efficient and developer-friendly.


No comments or questions yet...