Learning pandas

This section is about learning how to work with pandas by using it without much guidance. You will play with the dataset of the Scottish Index of Multiple Deprivation (SIMD). Specifically, you will load a table containing the rank of each data zone according to the overall SIMD and according to its subdomains (check the links above if you would like to learn more).

Data Preparation

Before you start playing with the data, you need to figure out how to read the data. It is not a CSV file this time, but XLSX. pandas is capable of reading Excel files like this one. Instead of the read_csv you used before, you just need to use read_excel this time.

Take a second to download the dataset and check its structure in Excel before moving forward.

You have probably figured out that there are two sheets in this file, and you need to read the second one, called "SIMD 2020v2 ranks".

Let’s start by importing pandas.

import pandas as pd

Since you have the file in your Downloads folder now, you can read it from there. Or you can read it directly from gov.scot.

url = "https://www.gov.scot/binaries/content/documents/govscot/publications/statistics/2020/01/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/documents/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/scottish-index-of-multiple-deprivation-2020-ranks-and-domain-ranks/govscot%3Adocument/SIMD%2B2020v2%2B-%2Branks.xlsx"

To read the proper sheet, you need to specify sheet_name to read_excel. pandas will download the file for you and open it as a DataFrame.

simd = pd.read_excel(url, sheet_name="SIMD 2020v2 ranks")

Reading a local file

If you want to read the file from your downloads folder, you will need to pass its location. That can look like this if you are on macOS:

simd = pd.read_excel(
    "/Users/martin/Downloads/SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

Or like this if you are on Windows. Note the r in front of the string, ensuring that the backlashes (\), common in Windows file paths, are properly interpreted. Or you can replace them with /:

simd = pd.read_excel(
    r"C:\Users\martin\Downloads\SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

Alternatively, you can read the file from the data folder attached to this lecture.

simd = pd.read_excel(
    "https://martinfleischmann.net/sds/data_wrangling/data/SIMD+2020v2+-+ranks.xlsx",
    sheet_name="SIMD 2020v2 ranks",
)

And you should be good to go!

simd.head()

	Data_Zone	Intermediate_Zone	Council_area	Total_population	Working_age_population	SIMD2020v2_Rank	SIMD2020v2_Income_Domain_Rank	SIMD2020_Employment_Domain_Rank	SIMD2020_Health_Domain_Rank	SIMD2020_Education_Domain_Rank	SIMD2020_Access_Domain_Rank	SIMD2020_Crime_Domain_Rank	SIMD2020_Housing_Domain_Rank
0	S01006506	Culter	Aberdeen City	894	580	4691	3936.0	3220.0	5174	5887	4724	4664.0	3248.0
1	S01006507	Culter	Aberdeen City	793	470	4862	4829.0	4481.0	5051	4384	2148	4602.0	3486.0
2	S01006508	Culter	Aberdeen City	624	461	5686	4460.0	5110.0	5942	5915	4200	4563.5	5342.0
3	S01006509	Culter	Aberdeen City	537	307	4332	3481.0	3229.0	3871	6401	3982	5626.0	4394.5
4	S01006510	Culter	Aberdeen City	663	415	3913	3344.0	3448.0	3049	4092	5588	3885.0	3736.0

Learn about Scotland

In the previous chapter, you learned the basics of pandas. Now is the time to use this knowledge and learn something about Scotland. The SIMD rankings are similar in structure to Chicago Influenza rates you have seen before, but not too much. Each row is allocated to a single area, but the table does not follow the “tidy” principles.

Try answering the following questions about the dataset:

Which data zone is the most deprived (lowest ranking?) Which council area is it?
Can you add a column with a proportion of the working age population in each data zone?
Which council area is, on average, the least deprived based on the Income domain?
What is the total population of the “Glasgow City” council?
Can you plot the ranking distribution in “Glasgow City”? How does it compare to the “City of Edinburgh”?
Can you figure out how to create a scatter plot of a relationship between the education and health domain rankings?
What else can you derive from the data?