Lesson 4: Split into Training and Testing Data
Next, we need to train the model and make a prediction for some value(s), like the salary for X, Y, or Z years of experience.
But how can we test if our model is actually making the correct predictions?
For that, developers split the data into two parts:
- Training data (slicing the 80% of the initial data)
- Testing data (taking the rest 20%)
Such an 80-20 split is probably the most common one in ML projects, but it depends on the situation - some developers prefer 70-30, it's your personal preference.
What happens then:
- The model analyzes the training data (is "trained" with it)
- Then, the model tries to predict the salaries of the testing data years of experience
- And since we actually have the salaries in the testing data, we can compare the model's predictions with the actual values and see how accurate the predictions were
To split the data, we use the train_test_split()
method from the scikit-learn
library. This is another new library we will import in addition to pandas + numpy + matplotlib
.
Note: If you don't have it on your computer, install it with pip3 install scikit-learn
.
main.py
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split df = pd.read_csv("salaries.csv") x = df.iloc[:,:-1].values # get all rows with all columns except the last oney = df.iloc[:,-1].values # get all rows with only the last column x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0) plt.scatter(x, y)plt.show()
We have imported a single method from the sklearn.model_selection
library. The syntax to import only specific method is from library.member import method
. This is used when you only want to use that specific method from the library and avoid typing the full library name each time for it. You can read the summary of importing syntax options in this tutorial.
Speaking about the train_test_split()
method, these are the typical variable names for training/testing data:
- x_train
- x_test
- y_train
- y_test
The parameter test_size
means what would be the size of the testing data (0.2 means 20%, you can experiment with different values).
The parameter random_state
may accept any integer number. The default value is None
. I've explained it in-depth in a separate tutorial, but basically, if you leave it unset, then the testing data will always be randomized, and you will get a different result of accuracy every time you run the model. For this tutorial, I want to stay with the same accuracy, so I will set that to 0.
Next, we can check how many items are in those arrays using the .shape
property from the numpy
library.
main.py
# ...plt.show() print(x_train.shape)print(x_test.shape)# Output:(80, 1)(20, 1)
This method shows the number of rows/columns, so from the initial 100 rows and 2 columns, we have the training data of 80 rows and the testing data of 20 rows.
Now, as we have the data split, we can move to actually training the model and predicting the salaries.
- Intro
- Example 1: More Simple
- Example 2: More Complex
No comments or questions yet...