Python: Split Large CSV File into Smaller Ones (tasks from Upwork)

I recently saw a few jobs on Upwork that involve splitting large CSV files into smaller ones. So, I decided to write this tutorial and show how to do it in Python.

Here are the real jobs that you would get paid for:

Now, let's perform the task.

Imagine you have a CSV file with 500,000 rows. To split it, we define two functions:

The function of reading the rows and building the "current" list until its size is 50,000
The function to write that current list into a new CSV

import csv
 
 
def write_csv(filename, data, header=None):
    with open(filename, 'w') as file:
        writer = csv.writer(file)
        if header:
            writer.writerow(header)
        writer.writerows(data)
 
 
def split_csv(filename, num_rows, has_header=True):
    name, extension = filename.split('.')
    file_no = 1
    chunk = []
    row_count = 0
    header = ''
 
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            if has_header:
                header = row
                has_header = False
                continue
            chunk.append(row)
            row_count += 1
            if row_count >= num_rows:
                write_csv(f'{name}-{file_no}.{extension}', chunk, header)
                chunk = []
                file_no += 1
                row_count = 0
        if chunk:
            write_csv(f'{name}-{file_no}.{extension}', chunk, header)

As you can see, we also have a few helper variables:

name - to build the salaries-1.csv, salaries-2.csv, etc.
file_no - to build the salaries-1.csv, salaries-2.csv, etc
header - to write that in each of the resulted files, on top
chunk - the current chunk which is filled in until reading the num_rows size
row_count - iteration variable to compare against the num_rows

Now, we can call that main function with two parameters - the original filename and how many rows should be in each of the resulting files:

split_csv('salaries.csv', 50000)

In this case, we have split a file with 500,000 rows into 10 files with 50,000 rows each: