Lesson 9: Filters by City and Language
Next, we will use the pandas
library to show us the data in various shapes and identify some data that we can remove as insignificant.
Filter Programming Languages
First, what different programming languages do we have, and how many of each?
main.py
print(df.language.value_counts())
main.py
# Result:languagephp 217js 181.net 105java 93python 37go 26c++ 24node 19kotlin 15ruby 14swift 6rust 4flutter 4sql 2sap 2scala 2salesforce 1vue 1blockchain 1devops 1progress 1openedge 1pascal 1uipath 1unity 1Name: count, dtype: int64
Ok, so the distribution is very uneven. Only three languages are represented by more than 100 people.
So, this is when you must decide where to "draw the line".
It's pretty obvious that we can't make any statistical conclusions or predictions based on 1-2 people with unity
, pascal
, scala
, and other languages at the bottom.
But I would argue that even ~20-30 data points are insufficient for analysis. So, my personal choice from this data would be to drop everything except for the first four languages.
Here's one of the options of how to write it in Python.
main.py
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.metrics import r2_score df = pd.read_csv('salaries-2023.csv') allowed_languages = ['php', 'js', '.net', 'java']df = df[df['language'].isin(allowed_languages)]print(df.shape) # Result: (596, 5)
Notice: in this case, I decided to override the same df
variable, but maybe you want to come back to the original data later, so you may choose a different variable, like df_filtered
or something.
As you can see, we now have 596 rows of data instead of 760.
Filter Cities
In the same way, let's take a look at the cities represented. In the list below, you will see the names of the cities in the Lithuanian language.
main.py
# ...print(df.city.value_counts())
main.py
# Result:cityVilnius 337Kaunas 155Klaipėda 11Vilniuje 9JAV 9UK 8Kaune 7Šiauliai 7Vokietija 6vilnius 6Klaipeda 4Danija 4VILNIUS 3Panevėžys 2kaunas 2Mažeikiai 2Norvegija 2Olandija 2Šveicarija 2Marijampole 1Prancūzija 1Rumsiskes 1Birštonas 1Jonava 1Klaipėdoje 1Pietų Afrikos Respublika 1KAUNAS 1Baltarusija 1Siauliai 1VILNIUJE 1Suomija 1Panevezys 1Alytus 1Skandinavija 1Tauragė 1Vilniua 1Vilniuj 1Name: count, dtype: int64
In this case, the leaders of the list are even more evident than in programming languages. The majority are from the two biggest cities in Lithuania: Vilnius
and Kaunas
. We could safely drop all the other values.
But not so fast.
I remind you that this was a survey question with a free-form text field and not a choice from a dropdown, so respondents could type (or mistype) whatever they wanted.
Some of them phrased the city names in another form of the word, made them uppercase, etc.
So, we need to "fix" some of the values that also actually mean "Vilnius" or "Kaunas".
You can perform the condition in many possible ways. I chose to just manually list the non-standard names I saw above:
main.py
# ... print(df.describe()) allowed_languages = ['php', 'js', '.net', 'java']df = df[df['language'].isin(allowed_languages)] vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']condition = df['city'].isin(vilnius_names)df.loc[condition, 'city'] = 'Vilnius' kaunas_names = ['KAUNAS', 'kaunas', 'Kaune']condition = df['city'].isin(kaunas_names)df.loc[condition, 'city'] = 'Kaunas' print(df.city.value_counts())
And here's the new list:
main.py
Vilnius 358Kaunas 165Klaipėda 11JAV 9UK 8Šiauliai 7Vokietija 6Danija 4Klaipeda 4Norvegija 2Olandija 2Šveicarija 2Panevėžys 2Mažeikiai 2Birštonas 1Jonava 1Siauliai 1Pietų Afrikos Respublika 1Baltarusija 1Prancūzija 1Marijampole 1Suomija 1Panevezys 1Alytus 1Tauragė 1Klaipėdoje 1Skandinavija 1Rumsiskes 1Name: count, dtype: int64
As you can see, Vilnius
went from 337 to 358 values, and Kaunas
went from 155 to 165. It's pretty significant if you ask me.
And now we can actually drop all the other cities, as they are not represented enough to have statistical significance.
main.py
# ... print(df.city.value_counts()) allowed_cities = ['Vilnius', 'Kaunas']df = df[df['city'].isin(allowed_cities)]print(df.shape) # Result: (523, 5)
In other words, now we're working with 523 data rows representing salaries of four programming languages in two cities.
In the next lesson, we will filter out some other values represented by so-called "outliers": the numbers that are significantly different and visually far from the "general" straight line.
Final File
If you felt lost in the code above, here's the final file with all the code from this lesson:
main.py
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns df = pd.read_csv('salaries-2023.csv') print(df.head())print(df.shape)df.info()print(df.describe()) allowed_languages = ['php', 'js', '.net', 'java']df = df[df['language'].isin(allowed_languages)] vilnius_names = ['Vilniuj', 'Vilniua', 'VILNIUJE', 'VILNIUS', 'vilnius', 'Vilniuje']condition = df['city'].isin(vilnius_names)df.loc[condition, 'city'] = 'Vilnius' kaunas_names = ['KAUNAS', 'kaunas', 'Kaune']condition = df['city'].isin(kaunas_names)df.loc[condition, 'city'] = 'Kaunas' print(df.city.value_counts()) allowed_cities = ['Vilnius', 'Kaunas']df = df[df['city'].isin(allowed_cities)]print(df.shape)
- Intro
- Example 1: More Simple
- Example 2: More Complex
Well explained