If you want to quickly generate a CSV file with Python for ML model testing, this tutorial will show you how.
In this case, we will generate a CSV of employee salaries with two columns:
- Years of experience (random from 1 to 10)
- Salary (depending on the years of experience)
This is the expected result:
The salary will mimic a Linear Regression model that we want to test:
salary = 2000 + years * 200 + random.randint(1, 400)# Base salary is 2000# Then, +200 for every year# And extra random number from 1 to 400
Now, I will show you three ways to generate such a file.
Option 1. While Loop Writing Line-by-Line.
import csvimport random with open('salaries.csv', 'w') as file: writer = csv.writer(file) writer.writerow(('years', 'salary')) i = 0 while i < 1000: # generating 1000 rows years = random.randint(1, 10) salary = years * 200 + 2000 + random.randint(1, 400) writer.writerow((years, salary)) i += 1
We import two libraries and generate each row on the fly, writing it to the CSV immediately.
Option 2. Generate Rows Function, Write All At Once
Some developers would want to generate all the rows separately and then write all of them once.
For that, we can separate it into a function.
Or, in fact, if we're separating into functions, let's create a separate function for the salary formula, too.
import csvimport random def years_to_salary(years): return years * 200 + 2000 + random.randint(1, 400) def entry_generator(num_rows): i = 0 while i < num_rows: years = random.randint(1, 10) yield years, years_to_salary(years) i += 1 with open('salaries.csv', 'w') as file: writer = csv.writer(file) writer.writerow(('years', 'salary')) writer.writerows(entry_generator(count))
Option 3. Use NumPy and Generate All
Machine Learning developers often use a library called NumPy for various array operations. In our case, it would allow us to make our code shorter.
import csvimport numpy as np years = np.random.randint(1, 10 + 1, 1000)salaries = years * 200 + 2000 + np.random.randint(1, 400 + 1, 1000) with open('salaries.csv', 'w') as file: writer = csv.writer(file) writer.writerow(('years', 'salary')) writer.writerows(zip(years, salaries))
On top, we replaced the random
library with numpy as np
, which has its own random function.
But the difference is that numpy
allows us to generate multiple records at once: not only for the years
, but also for the salaries
, see the code above.
Notice: np.random.randint()
has a different second parameter logic than random.randint()
- the NumpPy version will generate a number from 1 to 10 excluding the last number 10. That's why we need to pass 10 + 1
or 400 + 1
.
So yeah, the resulting CSV is identical, but which code option you like the most?
Hello, Povilas! I think, there is a small bug in option 3 - np.random.randint(1, 400 + 1) generates only one value, and then this value is added during calculation of every salary - it is not regenerated row after row. So in this case the salary values do not have any randomness - all rows with the same "year" value have the same "salary" value. At least, this is what I've seen in the csv, generated by the code in option 3 :)
Hmmm, good point... I haven't noticed it while testing, good catch! So yeah, probably not the best option if you want randomness.
Still, will leave it in the tutorial, as maybe for some people this randomness of years is fine.
It can be fixed easily adding one array of random values instead of a single random value:
salaries = years * 200 + 2000 + np.random.randint(1, 400 + 1, 1000)
Of course, how could I not think about it. Thanks a lot, fixed the article!
You are welcome! It was the very least I could do after all the value I get from your videos.