Lesson 4: Split into Training and Testing Data

Next, we need to train the model and make a prediction for some value(s), like the salary for X, Y, or Z years of experience.

But how can we test if our model is actually making the correct predictions?

For that, developers split the data into two parts:

  • Training data (slicing the 80% of the initial data)
  • Testing data (taking the rest 20%)

Such an 80-20 split is probably the most common one in ML projects, but it depends on the situation - some developers prefer 70-30, it's your personal preference.

What happens then:

  1. The model analyzes the training data (is "trained" with it)
  2. Then, the model tries to predict the salaries of the testing data years of experience
  3. And since we actually have the salaries in the testing data, we can compare the model's predictions with the actual values and see how accurate the predictions were

To split the data, we use the train_test_split() method from the scikit-learn library. This is another new library we will import in addition to pandas + numpy + matplotlib.

Note: If you don't have it on your computer, install it with pip3 install scikit-learn.

main.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
 
df = pd.read_csv("salaries.csv")
 
x = df.iloc[:,:-1].values # get all rows with all columns except the last one
y = df.iloc[:,-1].values # get all rows with only the last column
 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
 
plt.scatter(x, y)
plt.show()

We have imported a single method from the sklearn.model_selection library. The syntax to import only specific method is from library.member import method. This is used when you only want to use that specific method from the library and avoid typing the full library name each time for it. You can read the summary of importing syntax options in this tutorial.

Speaking about the train_test_split() method, these are the typical variable names for training/testing data:

  • x_train
  • x_test
  • y_train
  • y_test

The parameter test_size means what would be the size of the testing data (0.2 means 20%, you can experiment with different values).

The parameter random_state may accept any integer number. The default value is None. I've explained it in-depth in a separate tutorial, but basically, if you leave it unset, then the testing data will always be randomized, and you will get a different result of accuracy every time you run the model. For this tutorial, I want to stay with the same accuracy, so I will set that to 0.

Next, we can check how many items are in those arrays using the .shape property from the numpy library.

main.py

# ...
plt.show()
 
print(x_train.shape)
print(x_test.shape)
# Output:
(80, 1)
(20, 1)

This method shows the number of rows/columns, so from the initial 100 rows and 2 columns, we have the training data of 80 rows and the testing data of 20 rows.

Now, as we have the data split, we can move to actually training the model and predicting the salaries.


No comments or questions yet...