Plant a forest

Hedonic prices in Busan

You will investigate the association between green amenities and property prices in a metropolitan city. The dataset was obtained from https://zenodo.org/records/11092589. The provided data file contains a total of 2.404 observations with 18 variables, retrieved from the Korea Transport Database, Statistics Korea, Ministry of Land, Infrastructure and Transport, and the Spatial Information Portal. Specifically, the green index is quantified using a large amount of Google street view images, obtained from the download tool (https://svd360.istreetview.com/).

Here is the URL for the CSV you should use.

url = "https://martinfleischmann.net/sds/tree_regression/data/prices_data.csv"

Load the data and convert it to a geodataframe, make sure you have the correct local projection.

Hint

You can get the local UTM zone.

GeoDataFrame.estimate_utm_crs()

The target variable is ‘Property Prices’. Choose at least five independent variables that you think will be good predictors of the target variable.
Prepare you data for modeling - split it into training and testing sets.
Try to fit a linear regression model and compare it with a tree based model of your choice, is there a difference?
Cross validate your model and evaluate its performance. Plot the residuals.
Perform spatial evaluation using the "hex_id".
Compare blocked spatial metrics based on "hex_id" with the smoothed spatial metrics using graph.
Check the spread of the asummed auotocorrelation using variogram.
Perform LISA on the residuals - locate underpriced and overpriced properties, and clusters of consistently correct and consistently wrong locations.
Build a model which avoids spatial leakage using the "hex_id" and tune its hyperparameters. Is the performace different from the previous models? Which features are important in this model?