Python Data Pre-Processing Example: Multiple CSVs into Single File

8 minutes read

Imagine that you need to analyze the data for multiple years, and the client gives the data in multiple CSV files, one per year. Even worse, some columns in some years are in a different format. The task for data pre-processing, huh? Let's write a Python script for it.

Here's the format for the file for the year 2016:

Take a closer look at three fields about the levels (junior/mid/senior):

And another closer look at the fields about the years of experience - from 0 to 10:

And here's the format for the year 2023:

See the difference and the problem?

We actually have four tasks here:

  1. The first file has columns like "level" and "years" split into multiple columns, so we need to combine them and assign the value of 1/2/3 for level and 1-10 for years of experience
  2. Also, while merging the files into one CSV, we need to add a survey_year column.
  3. Programming languages are provided comma-separated: we need to take only the first one
  4. Some cities are provided with typos: we need to replace them with the correct names

The final result should be this:


Step 1. Simple Loop.

Let's start by importing a library and writing a structure for our data:

import csv
 
# This is how our data will look at the export
datalist = [[
'survey_year',
'language',
'salary',
'city',
'years',
'level'
]]

Next, we need to create a list of files that we want to merge:

# ...
 
# Files that we want to merge
files = {
2016: 'data/rawSource/salary-2016.csv',
2017: 'data/rawSource/salary-2017.csv',
2018: 'data/rawSource/salary-2018.csv',
2019: 'data/rawSource/salary-2019.csv',
2020: 'data/rawSource/salary-2020.csv',
2021: 'data/rawSource/salary-2021.csv',
2022: 'data/rawSource/salary-2022.csv',
2023: 'data/rawSource/salary-2023.csv',
}

And, of course, we need to read the data from the files:

# ...
 
# Loop through the files and read the data
for year, file in files.items():
# Open the file
with open(file, newline='') as csvfile:
# Read the file
reader = csv.DictReader(csvfile)
# Loop through the rows
for row in reader:
# Add the row to our data list
datalist.append([
year,
row['language'],
row['salary'],
row['city'],
row['years'],
row['level']
])

Now, if we run the script, we will get an error:

Traceback (most recent call last):
File "/transforming-csv-to-a-common-format/mergeCSV.py", line 52, in <module>
row['years'],
KeyError: 'years'

This is because we have different columns in our files. Just take a look at the 2016 and 2023 files:

2016

"language",junior,mid,senior,"years_0","years_1","years_2","years_3","years_4","years_5","years_6","years_7","years_8","years_9","years_10","city","salary"
,Junior,Intermediate,Senior,0,1,2,3,4,5,6,7,8,9,10+,,
PHP,,1,,,,,1,,,,,,,,Vilnius,1000
Java,,1,,,,,1,,,,,,,,Vilnius,900

2023

language,city,level,years,salary
php,UK,3,10,3000
.net,Danija,2,10,2700
php,Vilnius,3,10,2800

As you can see, we have different columns in both files. Our script currently expects the 2023 file, but that failed in 2016.


Step 2. Functions to Transform Years/Level

We can define the functions to change the values based on other values:

# ...
datalist = [[
# ...
]]
 
def get_level(row):
if row['junior']:
return 1
 
if row['mid']:
return 2
 
if row['senior']:
return 3
 
return 0
 
 
def get_years(row):
for y in range(0, 11):
if row[f'years_{y}']:
return y
 
 
files = {
2016: 'data/rawSource/salary-2016.csv',
2017: 'data/rawSource/salary-2017.csv',
2018: 'data/rawSource/salary-2018.csv',
2019: 'data/rawSource/salary-2019.csv',
2020: 'data/rawSource/salary-2020.csv',
2021: 'data/rawSource/salary-2021.csv',
2022: 'data/rawSource/salary-2022.csv',
2023: 'data/rawSource/salary-2023.csv',
}
 
for year, file in files.items():
with open(file, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
datalist.append([
year,
row['language'],
row['salary'],
row['city'],
get_years(row) if 'years' not in row else row['years'],
get_level(row) if 'years' not in row else row['level']
])

Here's what we did:

  • We created a get_level() function that returns the level of the developer based on the columns (this applies to files that match the 2016 format)
  • We created a get_years() function that returns the years of the developer based on the columns (this applies to files that match the 2016 format)
  • We check if the years column exists in our loop. If it does not - we use our custom functions to get the data. If it does - we use the column directly.

Running this now, we should not see any errors. However, we still have more work to do.


Step 3. Merging Similar Cities and Languages

Here's the plan of the next step:

  • Similar city names should be merged
  • Similar language names should be merged
  • Multi-language developers should get their own row (or not, based on the settings)

We've manually identified the typos and different ways of writing the city/language names, and we create lists from them.

# ...
 
# Set this to control how the data is split
# True -> Split language into individual languages
# False -> Take the first language and ignore the rest
split_language_into_individual_languages = True
 
# Normalize the city names
vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']
kaunas_names = ['KAUNAS', 'kaunas', 'Kaune', 'Kns']
klaipeda_names = ['Klaipėda', 'Klaipeda', 'Klaipėdoje', 'Klaipedoje']
 
# Normalize the language names
php_names = ['PHP', 'php', 'Php']
js_names = ['js', 'JS', 'JavaScript', 'javascript']
dotnet_names = ['.net', 'dotnet', 'Dotnet', 'DotNet']
java_names = ['java', 'Java', 'JAVA', 'Java (Android)']
 
# Clean the data:
# Check if salary is a number. If not - remove the row (people who provided a salary range)
# Split languages into individual languages (if enabled)
for i, row in enumerate(datalist):
if i == 0:
continue
 
if type(row[2]) is not int:
salary = [x for x in row[2] if x.isdigit()]
salary = ''.join(salary)
if salary.isdigit():
datalist[i][2] = int(salary)
else:
datalist.pop(i)
continue
 
# Split languages into individual languages
if split_language_into_individual_languages:
languages = row[1].split(',')
if len(languages) > 1:
for language in languages:
datalist.append([
row[0],
language.strip(),
row[2],
row[3],
row[4],
row[5]
])
else:
# Take the first language and ignore the rest
languages = row[1].split(',')[0]
datalist[i][1] = languages
 
# Loop through the data and normalize the city names and language names
for i, row in enumerate(datalist):
if i == 0:
continue
if row[1] == '':
datalist.pop(i)
if row[3] in vilnius_names:
datalist[i][3] = 'Vilnius'
elif row[3] in kaunas_names:
datalist[i][3] = 'Kaunas'
elif row[3] in klaipeda_names:
datalist[i][3] = 'Klaipeda'
 
if row[1] in php_names:
datalist[i][1] = 'PHP'
elif row[1] in js_names:
datalist[i][1] = 'js'
elif row[1] in dotnet_names:
datalist[i][1] = '.net'
elif row[1] in java_names:
datalist[i][1] = 'java'

Step 4. Write Data into a New CSV

Last but not least - we need to export the data:

# ...
 
with open('data/salaries.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(datalist)
 
print('Done')

And now, if we run the script - we should get a salaries.csv file with all the data:


Full Code

import csv
 
datalist = [[
'survey_year',
'language',
'salary',
'city',
'years',
'level'
]]
 
 
def get_level(row):
if row['junior']:
return 1
 
if row['mid']:
return 2
 
if row['senior']:
return 3
 
return 0
 
 
def get_years(row):
for y in range(0, 11):
if row[f'years_{y}']:
return y
 
 
files = {
2016: 'data/rawSource/salary-2016.csv',
2017: 'data/rawSource/salary-2017.csv',
2018: 'data/rawSource/salary-2018.csv',
2019: 'data/rawSource/salary-2019.csv',
2020: 'data/rawSource/salary-2020.csv',
2021: 'data/rawSource/salary-2021.csv',
2022: 'data/rawSource/salary-2022.csv',
2023: 'data/rawSource/salary-2023.csv',
}
 
for year, file in files.items():
with open(file, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
datalist.append([
year,
row['language'],
row['salary'],
row['city'],
get_years(row) if 'years' not in row else row['years'],
get_level(row) if 'years' not in row else row['level']
])
 
# Set this to control how the data is split
# True -> Split language into individual languages
# False -> Take the first language and ignore the rest
split_language_into_individual_languages = False
 
# Normalize the data
vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']
kaunas_names = ['KAUNAS', 'kaunas', 'Kaune', 'Kns']
klaipeda_names = ['Klaipėda', 'Klaipeda', 'Klaipėdoje', 'Klaipedoje']
 
php_names = ['PHP', 'php', 'Php']
js_names = ['js', 'JS', 'JavaScript', 'javascript']
dotnet_names = ['.net', 'dotnet', 'Dotnet', 'DotNet']
java_names = ['java', 'Java', 'JAVA', 'Java (Android)']
 
for i, row in enumerate(datalist):
if i == 0:
continue
 
if type(row[2]) is not int:
salary = [x for x in row[2] if x.isdigit()]
salary = ''.join(salary)
if salary.isdigit():
datalist[i][2] = int(salary)
else:
datalist.pop(i)
continue
 
# Split languages into individual languages
if split_language_into_individual_languages:
languages = row[1].split(',')
if len(languages) > 1:
for language in languages:
datalist.append([
row[0],
language.strip(),
row[2],
row[3],
row[4],
row[5]
])
datalist.pop(i)
else:
# Take the first language and ignore the rest
languages = row[1].split(',')[0]
datalist[i][1] = languages
 
for i, row in enumerate(datalist):
if i == 0:
continue
if row[1] == '':
datalist.pop(i)
if row[3] in vilnius_names:
datalist[i][3] = 'Vilnius'
elif row[3] in kaunas_names:
datalist[i][3] = 'Kaunas'
elif row[3] in klaipeda_names:
datalist[i][3] = 'Klaipeda'
 
if row[1] in php_names:
datalist[i][1] = 'PHP'
elif row[1] in js_names:
datalist[i][1] = 'js'
elif row[1] in dotnet_names:
datalist[i][1] = '.net'
elif row[1] in java_names:
datalist[i][1] = 'java'
 
with open('data/salaries.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(datalist)
 
print('Done')

No comments or questions yet...