Lesson 12: Correlation Heatmap

Next, we must decide what our independent variables (or features) will be.


Do We Need All Columns For x?

The answer to the question above looks simple: everything except salary, right?

  • Programming language
  • City
  • Level
  • Years of experience

But actually, the question is how much the salary really depends on any of those variables.

In other words, we should raise this question:

Is the salary bigger/smaller based on the programming language? On the city? On the level? On the years of experience?

We may find out that some of those values have a very low correlation with the salary.

To help us see it clearly, there's a great feature of the seaborn library called correlation heatmap.

In the earlier lesson, we imported the library with import seaborn as sns on top, so now we an call it like this:

main.py

import seaborn as sns
# ...
 
print(df.head(10))
 
sns.heatmap(df.corr(), annot=True)
plt.show()

Here's the result:

It looks cool, but... what does it actually mean? How to read it?


How To Read Correlation Heatmap

This is a matrix of columns, each cell representing the correlation of one column value to other column values.

The correlation numbers are from -1 to 1, and we're looking for numbers as close to 1. This means the strongest correlation.

And the opposite is also true: we're looking for numbers close to 0. They mean that this particular column combination does NOT have almost any correlation, and we can almost safely drop those columns from the model.

This color scheme means that the lighter the cell is, the more significant the correlation the columns have with each other.

Naturally, the correlation of a column to itself is always 1.

Also, the correlation of the column with two possible values is always -1: if the city is Vilnius, it means it's always not Kaunas.

But then, between other columns, we need to identify the strongest correlations for the salary column we want to predict.


Looking at Correlations with Salary

We need to look at the salary column and see which rows have the lightest color and the highest numbers.

The correlation of 0.77 or 0.73 is a pretty strong one. It means that salary depends quite strongly on the years of experience and the level.

But the numbers for other columns are extremely low: 0.06, 0.01, or even negative. It means that those columns almost don't affect the salary.

Yes, it may surprise some of you, but salary doesn't depend strongly on the programming language or the city you live in. At least in this survey of local developers in Lithuania.

That means we can drop city and programming language columns and take only years of experience and level as x values, as they are the only ones with solid correlation.

main.py

# ...
 
x = df.iloc[:,0:2].values # we take only years and level
y = df.iloc[:,2].values # we take the salary
print(x[0:5])

main.py

# Result:
([[ 3, 10],
[ 3, 10],
[ 3, 10],
[ 2, 4],
[ 1, 1]])

main.py

# ...
 
print(y[0:5])

main.py

# Result:
([2800, 3400, 2500, 2100, 3500])

Ok, we got rid of a few columns. But does that also mean that all the previous work filtering out the cities and languages was kind of pointless? No, not at all. Without that initial analysis, we couldn't come to this conclusion.

I will also note that many more transformations may be needed to prepare the data. In other tutorials/courses, you may find things like data scaling, feature engineering, etc. They were just not required for this particular case.

Ok, great, now we can finally build our model in the next lesson.


Final File

If you felt lost in the code above, here's the final file with all the code from this lesson:

main.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
df = pd.read_csv('salaries-2023.csv')
 
print(df.head())
print(df.shape)
df.info()
print(df.describe())
 
allowed_languages = ['php', 'js', '.net', 'java']
df = df[df['language'].isin(allowed_languages)]
 
vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']
condition = df['city'].isin(vilnius_names)
df.loc[condition, 'city'] = 'Vilnius'
 
kaunas_names = ['KAUNAS', 'kaunas', 'Kaune']
condition = df['city'].isin(kaunas_names)
df.loc[condition, 'city'] = 'Kaunas'
 
print(df.city.value_counts())
 
allowed_cities = ['Vilnius', 'Kaunas']
df = df[df['city'].isin(allowed_cities)]
print(df.shape)
 
df_sorted = df.sort_values(by='salary', ascending=False)
print(df_sorted.head(20))
 
x = df.iloc[:, -2:-1]
y = df.iloc[:, -1].values
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.scatter(x, y)
plt.show()
 
df = df[df['salary'] <= 6000]
print(df.shape)
 
x = df.iloc[:, -2:-1]
y = df.iloc[:, -1].values
plt.xlabel('Years of experience')
plt.ylabel('Salary')
plt.scatter(x, y)
plt.show()
 
one_hot = pd.get_dummies(df['language'], prefix='lang')
df = df.join(one_hot)
df = df.drop('language', axis=1)
 
one_hot = pd.get_dummies(df['city'], prefix='city')
df = df.join(one_hot)
df = df.drop('city', axis=1)
 
print(df.head(10))
 
sns.heatmap(df.corr(), annot=True)
plt.show()
 
x = df.iloc[:, 0:2].values # we take only years and level
y = df.iloc[:, 2].values # we take the salary
print(x[0:5])
 
print(y[0:5])

No comments or questions yet...