Learning pandas

This section is about learning how to work with pandas by using it without much guidance. You will play with the dataset of the Scottish Index of Multiple Deprivation (SIMD). Specifically, you will load a table containing the rank of each data zone according to the overall SIMD and according to its subdomains (check the links above if you would like to learn more).

Data Preparation

Before you start playing with the data, you need to figure out how to read the data. It is not a CSV file this time, but XLSX. pandas is capable of reading Excel files like this one. Instead of the read_csv you used before, you just need to use read_excel this time.

Take a second to download the dataset and check its structure in Excel before moving forward.

You have probably figured out that there are two sheets in this file, and you need to read the second one, called "SIMD 2020v2 ranks".

Let’s start by importing pandas.

import pandas as pd

Since you have the file in your Downloads folder now, you can read it from there. Or you can read it directly from gov.scot.

url = "https://www.gov.scot/binaries/content/documents/govscot/publications/statistics/2020/01/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/documents/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/govscot%3Adocument/SIMD%2B2020v2%2B-%2Branks.xlsx"

To read the proper sheet, you need to specify sheet_name to read_excel. pandas will download the file for you and open it as a DataFrame.

simd = pd.read_excel(url, sheet_name="SIMD 2020v2 ranks")

If you want to read the file from your downloads folder, you will need to pass its location. That can look like this if you are on macOS:

simd = pd.read_excel(
    "/Users/martin/Downloads/SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

Or like this if you are on Windows. Note the r in front of the string, ensuring that the backlashes (\), common in Windows file paths, are properly interpreted. Or you can replace them with /:

simd = pd.read_excel(
    r"C:\Users\martin\Downloads\SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

Alternatively, you can read the file from the data folder attached to this lecture.

simd = pd.read_excel(
    "https://martinfleischmann.net/sds/data_wrangling/data/SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

And you should be good to go!

simd.head()
Data_Zone Intermediate_Zone Council_area Total_population Working_age_population SIMD2020v2_Rank SIMD2020v2_Income_Domain_Rank SIMD2020_Employment_Domain_Rank SIMD2020_Health_Domain_Rank SIMD2020_Education_Domain_Rank SIMD2020_Access_Domain_Rank SIMD2020_Crime_Domain_Rank SIMD2020_Housing_Domain_Rank
0 S01006506 Culter Aberdeen City 894 580 4691 3936.0 3220.0 5174 5887 4724 4664.0 3248.0
1 S01006507 Culter Aberdeen City 793 470 4862 4829.0 4481.0 5051 4384 2148 4602.0 3486.0
2 S01006508 Culter Aberdeen City 624 461 5686 4460.0 5110.0 5942 5915 4200 4563.5 5342.0
3 S01006509 Culter Aberdeen City 537 307 4332 3481.0 3229.0 3871 6401 3982 5626.0 4394.5
4 S01006510 Culter Aberdeen City 663 415 3913 3344.0 3448.0 3049 4092 5588 3885.0 3736.0

Learn about Scotland

In the previous chapter, you learned the basics of pandas. Now is the time to use this knowledge and learn something about Scotland. The SIMD rankings are similar in structure to Chicago Influenza rates you have seen before, but not too much. Each row is allocated to a single area, but the table does not follow the “tidy” principles.

Try answering the following questions about the dataset:

  • Which data zone is the most deprived (lowest ranking?) Which council area is it?
  • Can you add a column with a proportion of the working age population in each data zone?
  • Which council area is, on average, the least deprived based on the Income domain?
  • What is the total population of the “Glasgow City” council?
  • Can you plot the ranking distribution in “Glasgow City”? How does it compare to the “City of Edinburgh”?
  • Can you figure out how to create a scatter plot of a relationship between the education and health domain rankings?
  • What else can you derive from the data?