Filters by City and Language

Next, we will use the pandas library to show us the data in various shapes and identify some data that we can remove as insignificant.

Filter Programming Languages

First, what different programming languages do we have, and how many of each?

main.py

print(df.language.value_counts())

main.py

# Result:
language
php           217
js            181
.net          105
java           93
python         37
go             26
c++            24
node           19
kotlin         15
ruby           14
swift           6
rust            4
flutter         4
sql             2
sap             2
scala           2
salesforce      1
vue             1
blockchain      1
devops          1
progress        1
openedge        1
pascal          1
uipath          1
unity           1
Name: count, dtype: int64

Ok, so the distribution is very uneven. Only three languages are represented by more than 100 people.

So, this is when you must decide where to "draw the line".

It's pretty obvious that we can't make any statistical conclusions or predictions based on 1-2 people with unity, pascal, scala, and other languages at the bottom.

But I would argue that even ~20-30 data points are insufficient for analysis. So, my personal choice from this data would be to drop everything except for the first four languages.

Here's one of the options of how to write it in Python.

main.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score
 
df = pd.read_csv('salaries-2023.csv')
 
allowed_languages = ['php', 'js', '.net', 'java']
df = df[df['language'].isin(allowed_languages)]
print(df.shape)
 
# Result: (596, 5)

Notice: in this case, I decided to override the same df variable, but maybe you want to come back to the original data later, so you may choose a different variable, like df_filtered or something.

As you can see, we now have 596 rows of data instead of 760.

Filter Cities

In the same way, let's take a look at the cities represented. In the list below, you will see the names of the cities in the Lithuanian language.

main.py

# ...
print(df.city.value_counts())

main.py

# Result:
city
Vilnius                     337
Kaunas                      155
Klaipėda                     11
Vilniuje                      9
JAV                           9
UK                            8
Kaune                         7
Šiauliai                      7
Vokietija                     6
vilnius                       6
Klaipeda                      4
Danija                        4
VILNIUS                       3
Panevėžys                     2
kaunas                        2
Mažeikiai                     2
Norvegija                     2
Olandija                      2
Šveicarija                    2
Marijampole                   1
Prancūzija                    1
Rumsiskes                     1
Birštonas                     1
Jonava                        1
Klaipėdoje                    1
Pietų Afrikos Respublika      1
KAUNAS                        1
Baltarusija                   1
Siauliai                      1
VILNIUJE                      1
Suomija                       1
Panevezys                     1
Alytus                        1
Skandinavija                  1
Tauragė                       1
Vilniua                       1
Vilniuj                       1
Name: count, dtype: int64

In this case, the leaders of the list are even more evident than in programming languages. The majority are from the two biggest cities in Lithuania: Vilnius and Kaunas. We could safely drop all the other values.

But not so fast.

I remind you that this was a survey question with a free-form text field and not a choice from a dropdown, so respondents could type (or mistype) whatever they wanted.

Some of them phrased the city names in another form of the word, made them uppercase, etc.

So, we need to "fix" some of the values that also actually mean "Vilnius" or "Kaunas".

You can perform the condition in many possible ways. I chose to just manually list the non-standard names I saw above:

main.py

# ...
 
print(df.describe())
 
allowed_languages = ['php', 'js', '.net', 'java']
df = df[df['language'].isin(allowed_languages)]
 
vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']
condition = df['city'].isin(vilnius_names)
df.loc[condition, 'city'] = 'Vilnius'
 
kaunas_names = ['KAUNAS', 'kaunas', 'Kaune']
condition = df['city'].isin(kaunas_names)
df.loc[condition, 'city'] = 'Kaunas'
 
print(df.city.value_counts())

And here's the new list:

main.py

Vilnius                     358
Kaunas                      165
Klaipėda                     11
JAV                           9
UK                            8
Šiauliai                      7
Vokietija                     6
Danija                        4
Klaipeda                      4
Norvegija                     2
Olandija                      2
Šveicarija                    2
Panevėžys                     2
Mažeikiai                     2
Birštonas                     1
Jonava                        1
Siauliai                      1
Pietų Afrikos Respublika      1
Baltarusija                   1
Prancūzija                    1
Marijampole                   1
Suomija                       1
Panevezys                     1
Alytus                        1
Tauragė                       1
Klaipėdoje                    1
Skandinavija                  1
Rumsiskes                     1
Name: count, dtype: int64

As you can see, Vilnius went from 337 to 358 values, and Kaunas went from 155 to 165. It's pretty significant if you ask me.

And now we can actually drop all the other cities, as they are not represented enough to have statistical significance.

main.py

# ...
 
print(df.city.value_counts())
 
allowed_cities = ['Vilnius', 'Kaunas']
df = df[df['city'].isin(allowed_cities)]
print(df.shape)
 
# Result: (523, 5)

In other words, now we're working with 523 data rows representing salaries of four programming languages in two cities.

In the next lesson, we will filter out some other values represented by so-called "outliers": the numbers that are significantly different and visually far from the "general" straight line.

Final File

If you felt lost in the code above, here's the final file with all the code from this lesson:

main.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
df = pd.read_csv('salaries-2023.csv')
 
print(df.head())
print(df.shape)
df.info()
print(df.describe())
 
allowed_languages = ['php', 'js', '.net', 'java']
df = df[df['language'].isin(allowed_languages)]
 
vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']
condition = df['city'].isin(vilnius_names)
df.loc[condition, 'city'] = 'Vilnius'
 
kaunas_names = ['KAUNAS', 'kaunas', 'Kaune']
condition = df['city'].isin(kaunas_names)
df.loc[condition, 'city'] = 'Kaunas'
 
print(df.city.value_counts())
 
allowed_cities = ['Vilnius', 'Kaunas']
df = df[df['city'].isin(allowed_cities)]
print(df.shape)