Method `train_test_split()`: Best Value for `random_state` Parameter?
When using the function train_test_split()
from the scikit-learn
library, you may notice a parameter random_state
, which confuses a lot of beginners. What should be its "ideal" value, and how does it work?
Typical examples you may see in online tutorials:
train_test_split(x,y,test_size=0.2,random_state=0)train_test_split(x,y,test_size=0.2,random_state=1)train_test_split(x,y,test_size=0.2,random_state=42)
When should we use 0 or 1? Or, even weirder, 42?
1. No Value At All (None)
In some tutorials, you may find that the parameter isn't set at all.
train_test_split(x,y,test_size=0.2)
Let's start with exactly that case - None
value.
If you don't specify the random_state
, this function will randomize data every time.
In other words, it will be unpredictable which 20% of the dataframe will be taken for testing.
Consequently, you will get different predictions and model accuracy results.
It's not necessarily wrong, sometimes you want exactly that. But in many cases, it's better to have the same stable data for testing.
2. Any Value: 0, 1, ...
If you set the random_state
to any integer value, Python will save that specific data in the memory and assign that exact number as its identifier.
Next time, if you run the method with the same value, you will get the same data.
So, wait, is 0 better than 1? Or worse?
The answer is neither better or worse. It's just a different identification number, that's it.
For example:
train_test_split(x,y,test_size=0.2,random_state=0)train_test_split(x,y,test_size=0.2,random_state=1)
Then, whenever you re-run the script and call the function with random_state=0
, you will get the data for that 0 value.
If you pass in value 1
, you get a different set of data labeled as 1
for the future.
3. But Wait, Why 42?!
You may often find in tutorials developers specifically use the number 42
. Here's an example question from Stackoverflow:
This is a piece of developer humor. The number 42 is considered to be "The Answer to the Ultimate Question of Life, the Universe, and Everything" coming from The Hitchhiker's Guide to the Galaxy. So, developers often use this number with the meaning that "if you can use any number, then why not the ultimate answer 42". From the method train_test_split()
point of view, 42 is not better than 0, 1, 2, or any other integer.
I do not understand why there is a line in step 2 that says "The official definition of this function looks like this:". Perhaps it is incomplete, in the wrong location, or decided later not to include the definition?
Well noticed! And, to be honest, I don't remember now what I meant here 2 months ago as I was writing... Shame on me :)
Removed that sentence altogether, thanks for the proof-reading.