Fit that model
In this part, you will try to fit some regression models on your own.
Scottish Index of Multiple Deprivation yet again
You already know the Scottish Index of Multiple Deprivation (SIMD) from exercise on pandas and then on spatial autocorrelation. This time, you will use it again to look at the impact of individual components on the overall deprivation ranking.
- Go back to the Does it correlate? and load the dataset to your Jupyter Notebook. Do not filter for Glasgow this time, use the whole dataset.
- Let’s try to understand the effect of a proportion of youths entering university (
"EduUniver"
), crime rate (“CrimeRate
”), hospital stays related to alcohol use ("HlthAlcSR"
), hospital stays related to drug use ("HlthDrugSR"
) and mortality ("HlthSMR")
. - Check the columns and ensure they are all
float
orint
columns (with numbers, no text.)
A few hints
You will need to remove some characters from strings and convert dtypes.
More hints
Check the .str
accessor on a pandas.Series
.
Even more hints
The .str.rstrip()
method will be particularly useful.
Okay, here’s the code
This is how the pre-processing could look.
"EduUniver"] = simd["EduUniver"].str.rstrip('%').astype(float) simd[
- Create a standard OLS regression predicting
"Rankv2"
based on these 5 variables. What can you tell about them? How good is the model? - Compare the prediction with the original data on a map. Can you spot the difference?
- Plot residuals. Is there a geographical pattern?
- Check for geographical patterns in residuals using the spatial autocorrelation analysis of your choice.
- Create another OLS model and include local authority
"LAName"
in the formula. Are there significant spatial fixed effects? Did the model improve? How much? - Create geographically weighted regression using the set of variables from the first model. Use adaptive kernel with
bandiwdth=150
or figure out the optimal adaptive bandwidth yourself. Is the model better than those before? - Explore GWR results. What is the distribution of local R2? Can you say anything about the significance of individual variables?