import pandas as pd
Learning pandas
This section is about learning how to work with pandas
by using it without much guidance. You will play with the dataset of the Scottish Index of Multiple Deprivation (SIMD). Specifically, you will load a table containing the rank of each data zone according to the overall SIMD and according to its subdomains (check the links above if you would like to learn more).
Data Preparation
Before you start playing with the data, you need to figure out how to read the data. It is not a CSV file this time, but XLSX. pandas
is capable of reading Excel files like this one. Instead of the read_csv
you used before, you just need to use read_excel
this time.
Take a second to download the dataset and check its structure in Excel before moving forward.
You have probably figured out that there are two sheets in this file, and you need to read the second one, called "SIMD 2020v2 ranks"
.
Let’s start by importing pandas
.
Since you have the file in your Downloads folder now, you can read it from there. Or you can read it directly from gov.scot
.
= "https://www.gov.scot/binaries/content/documents/govscot/publications/statistics/2020/01/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/documents/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/govscot%3Adocument/SIMD%2B2020v2%2B-%2Branks.xlsx" url
To read the proper sheet, you need to specify sheet_name
to read_excel
. pandas
will download the file for you and open it as a DataFrame
.
= pd.read_excel(url, sheet_name="SIMD 2020v2 ranks") simd
If you want to read the file from your downloads folder, you will need to pass its location. That can look like this if you are on macOS:
= pd.read_excel(
simd "/Users/martin/Downloads/SIMD+2020v2+-+ranks.xlsx",
="SIMD 2020v2 ranks",
sheet_name )
Or like this if you are on Windows. Note the r
in front of the string, ensuring that the backlashes (\
), common in Windows file paths, are properly interpreted. Or you can replace them with /
:
= pd.read_excel(
simd r"C:\Users\martin\Downloads\SIMD+2020v2+-+ranks.xlsx",
="SIMD 2020v2 ranks",
sheet_name )
Alternatively, you can read the file from the data
folder attached to this lecture.
= pd.read_excel(
simd "https://martinfleischmann.net/sds/data_wrangling/data/SIMD+2020v2+-+ranks.xlsx",
="SIMD 2020v2 ranks",
sheet_name )
And you should be good to go!
simd.head()
Data_Zone | Intermediate_Zone | Council_area | Total_population | Working_age_population | SIMD2020v2_Rank | SIMD2020v2_Income_Domain_Rank | SIMD2020_Employment_Domain_Rank | SIMD2020_Health_Domain_Rank | SIMD2020_Education_Domain_Rank | SIMD2020_Access_Domain_Rank | SIMD2020_Crime_Domain_Rank | SIMD2020_Housing_Domain_Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | S01006506 | Culter | Aberdeen City | 894 | 580 | 4691 | 3936.0 | 3220.0 | 5174 | 5887 | 4724 | 4664.0 | 3248.0 |
1 | S01006507 | Culter | Aberdeen City | 793 | 470 | 4862 | 4829.0 | 4481.0 | 5051 | 4384 | 2148 | 4602.0 | 3486.0 |
2 | S01006508 | Culter | Aberdeen City | 624 | 461 | 5686 | 4460.0 | 5110.0 | 5942 | 5915 | 4200 | 4563.5 | 5342.0 |
3 | S01006509 | Culter | Aberdeen City | 537 | 307 | 4332 | 3481.0 | 3229.0 | 3871 | 6401 | 3982 | 5626.0 | 4394.5 |
4 | S01006510 | Culter | Aberdeen City | 663 | 415 | 3913 | 3344.0 | 3448.0 | 3049 | 4092 | 5588 | 3885.0 | 3736.0 |
Learn about Scotland
In the previous chapter, you learned the basics of pandas
. Now is the time to use this knowledge and learn something about Scotland. The SIMD rankings are similar in structure to Chicago Influenza rates you have seen before, but not too much. Each row is allocated to a single area, but the table does not follow the “tidy” principles.
Try answering the following questions about the dataset:
- Which data zone is the most deprived (lowest ranking?) Which council area is it?
- Can you add a column with a proportion of the working age population in each data zone?
- Which council area is, on average, the least deprived based on the Income domain?
- What is the total population of the “Glasgow City” council?
- Can you plot the ranking distribution in “Glasgow City”? How does it compare to the “City of Edinburgh”?
- Can you figure out how to create a scatter plot of a relationship between the education and health domain rankings?
- What else can you derive from the data?