Python: Generate CSV File with Dummy Data - Three Options

3 minutes read

If you want to quickly generate a CSV file with Python for ML model testing, this tutorial will show you how.

In this case, we will generate a CSV of employee salaries with two columns:

  • Years of experience (random from 1 to 10)
  • Salary (depending on the years of experience)

This is the expected result:

The salary will mimic a Linear Regression model that we want to test:

salary = 2000 + years * 200 + random.randint(1, 400)
# Base salary is 2000
# Then, +200 for every year
# And extra random number from 1 to 400

Now, I will show you three ways to generate such a file.


Option 1. While Loop Writing Line-by-Line.

import csv
import random
 
 
with open('salaries.csv', 'w') as file:
writer = csv.writer(file)
writer.writerow(('years', 'salary'))
i = 0
while i < 1000: # generating 1000 rows
years = random.randint(1, 10)
salary = years * 200 + 2000 + random.randint(1, 400)
writer.writerow((years, salary))
i += 1

We import two libraries and generate each row on the fly, writing it to the CSV immediately.


Option 2. Generate Rows Function, Write All At Once

Some developers would want to generate all the rows separately and then write all of them once.

For that, we can separate it into a function.

Or, in fact, if we're separating into functions, let's create a separate function for the salary formula, too.

import csv
import random
 
 
def years_to_salary(years):
return years * 200 + 2000 + random.randint(1, 400)
 
 
def entry_generator(num_rows):
i = 0
while i < num_rows:
years = random.randint(1, 10)
yield years, years_to_salary(years)
i += 1
 
 
with open('salaries.csv', 'w') as file:
writer = csv.writer(file)
writer.writerow(('years', 'salary'))
writer.writerows(entry_generator(count))

Option 3. Use NumPy and Generate All

Machine Learning developers often use a library called NumPy for various array operations. In our case, it would allow us to make our code shorter.

import csv
import numpy as np
 
 
years = np.random.randint(1, 10 + 1, 1000)
salaries = years * 200 + 2000 + np.random.randint(1, 400 + 1, 1000)
 
with open('salaries.csv', 'w') as file:
writer = csv.writer(file)
writer.writerow(('years', 'salary'))
writer.writerows(zip(years, salaries))

On top, we replaced the random library with numpy as np, which has its own random function.

But the difference is that numpy allows us to generate multiple records at once: not only for the years, but also for the salaries, see the code above.

Notice: np.random.randint() has a different second parameter logic than random.randint() - the NumpPy version will generate a number from 1 to 10 excluding the last number 10. That's why we need to pass 10 + 1 or 400 + 1.


So yeah, the resulting CSV is identical, but which code option you like the most?


Mykhailo Kazymyrskyi avatar
Mykhailo Kazymyrskyi

Hello, Povilas! I think, there is a small bug in option 3 - np.random.randint(1, 400 + 1) generates only one value, and then this value is added during calculation of every salary - it is not regenerated row after row. So in this case the salary values do not have any randomness - all rows with the same "year" value have the same "salary" value. At least, this is what I've seen in the csv, generated by the code in option 3 :)

Povilas avatar

Hmmm, good point... I haven't noticed it while testing, good catch! So yeah, probably not the best option if you want randomness.

Still, will leave it in the tutorial, as maybe for some people this randomness of years is fine.

MrCrayon avatar

It can be fixed easily adding one array of random values instead of a single random value:

salaries = years * 200 + 2000 + np.random.randint(1, 400 + 1, 1000)
Povilas avatar

Of course, how could I not think about it. Thanks a lot, fixed the article!

MrCrayon avatar

You are welcome! It was the very least I could do after all the value I get from your videos.