Assignment
This course is assessed based on work during the semester and a final assignment - a computational essay.
Structure of the final mark
The final mark is derived from the total number of points collected during the semester and from the final assignment. You can get up to 24 points during the semester for submitted exercises and 76 from the final assignment.
Exercises
Each lesson contains an exercise to practice the topic covered during the class. There will be some time to finish it during the class but it is not expected to be finished during the class. For each exercise submitted within one week, student will receive 2 points. The correctness is not verified, we just want to see that you put some effort in it. Submission needs to happen via GitHub Classroom.
Computational essay
A computational essay is an essay whose narrative is supported by code and its results, which are part of the essay. Think of a Jupyter Notebook with cells corresponding to text explaining the process and its results and cells with executed code doing the computation.
One nice example of a computational essay is the Age Capsule by Dani Arribas-Bel. The code in there is a bit more advanced than what you are asked to do and you will need to include some maps and possibly tables, but you get the gist.
The essay corresponds to a range of 2,500-5,000 words. That does not mean that you have to write that many words. Since you will have to produce not only text (in English or Czech) but also code and its outputs, the following requirements are specified:
- The approximate number of words in Markdown cells is 1,500 (the bibliography, if provided, does not count towards the word count). Try to stay within 20% margin.
- The approximate number of maps or other graphic outputs is 5 (one output may contain more than one map and will only count as one, but it must be included in the same
matplotlib
object). - The approximate number of tables is 2, where a table is considered an output of a DataFrame automatically rendered by the Notebook.
The rest of the word count is assumed to be consumed by code.
Treat this is guidelines but do not hesitate to deviate if you think it will help the narrative.
You have two options regarding topics. The first one is to work on your data on the topic of your choice, while the second is a semi-defined task if you prefer that.
Define your task
Option one is to come up with your own idea for an essay, supported by data you either already have or can gather from openly available sources.
The requirement is to cover:
- Initial data exploration and visualisation1
- Exploration of a degree of randomness of data (e.g. spatial autocorrelation, point pattern analysis)2
- At least one other technique of your choice (clustering, interpolation, regression, prediction)3
Consult the second option to get a better sense of the extent.
If you decide to define your own task, the topic and the extent need to be approved by the tutor. Essays on custom topics without prior approval will not be accepted and will be marked with 0%. Reach out via Discord or email to get approval.
This option is recommended for students of Master’s and Postgraduate courses enrolled in MZ340V17
.
The second option is more defined but still leaves some space for your creativity.
A Barcelona case
You will take the role of a real-world data scientist tasked to explore a dataset on the city of Barcelona (Spain) and find useful insights for a variety of decision-makers. It does not matter if you have never been to Barcelona. In fact, this will help you focus on what you can learn about the city through the data, without the influence of prior knowledge. Furthermore, the assessment will not be marked based on how much you know about Barcelona but instead on how much you can show you have learned through analysing data.
Part one
In the first part, you are asked to provide an overview of the socio-economic structure of Barcelona.
Data
Head to the Open Data BCN data service of Barcelona’s City Hall and download data reflecting two aspects (two variables) of the population structure of the city at the level of census areas (Secció censal in Spanish), find relevant geometry, and link them together.
- Explore the spatial distribution of the data using choropleths. Comment on the details of your maps and interpret the results.
- Explore the degree of spatial autocorrelation. Describe the concepts behind your approach and interpret your results.
Part two
For this one, you need to pick one of the following three options.
- Create a classification (clustering) of Barcelona based on your socioeconomic data and interpret the results. In the process, answer the following questions:
- What are the main types of neighbourhoods you identify?
- Which characteristics help you delineate this typology?
- If you had to use this classification to target areas in most need, how would you use it? Why?
- How is the city partitioned by your data?
The other two options share the basics:
- Download listings for Barcelona from Inside Airbnb. You have already used Airbnb data before in the course, so you can refer to the code used there.
- Barcelona is known for its issue with Airbnb density. Visualise the data appropriately and discuss why you have taken your specific approach.
- Asses the distribution of Airbnbs in Barcelona
- Are the Airbnb listings distributed equally across the city? Does it depend on the type of listing or its price?
- Can you create a regionalisation of Barcelona census areas based on the presence of Airbnbs? What does it say about the city?
- Asses the relationship between the socio-economic profile of Barcelona and the presence of Airbnb.
- Use regression techniques to asses a link between the socio-economic data from part one and the variable of your choice from the Airbnb dataset. Think of a density of listings or an average price.
- Discuss the implications of the results. What does it mean for policy?
Submission
The submission will contain an executed Jupyter Notebook. The code needs to be reproducible. That means that all the data used in the essay need to be available online (and ideally fetched directly from the notebook but a link to a download page is also fine, although data manipulation outside of the Notebook is not allowed) or shared as part of the submission. Any additional Python packages apart from those available in the provided sds
environment need to be explicitly specified on top of the notebook. However, it is not expected that you will need it.
Evalutation criteria
The essay is primarily evaluated on a point scale of 0-76:
- 0-14: the code does not work and there is no documentation for it.
- 15-24: the code does not work or works but does not lead to the expected result. There is some documentation explaining its logic.
- 25-34: the code runs and produces the expected output. There is some documentation explaining its logic.
- 35-44: the code runs and produces the expected output. There is extensive documentation explaining its logic.
- 45-54: the code runs and produces the expected output. There is extensive documentation, properly formatted, explaining its logic.
- 55-64: everything as above, plus the code design includes clear evidence of skills presented in advanced parts of the course (e.g., custom methods, list comprehension, etc.).
- 65-76: everything as above, plus the code includes new knowledge that extends/improves the functionality provided to the student (e.g., algorithm optimization, new methods to perform the task, etc.).
Additional point deductions may happen in case of issues in the code or text. Note that the documentation should be formatted using Markdown syntax, not HTML.
To successfully complete the class, a minimum of 40 points in total is required. The expected relationship between the score and a final mark (potential change will be discussed ahead):
- 0-39: 4
- 40-49: 3
- 50-69: 2
- 70-100: 1
Use of generative AI
You are free to use any tool available to you. Please keep in mind the outcome of the following experiment decribed by Esther Shein:
Last fall, Eric Klopfer decided to conduct an experiment in his undergraduate computer science class at the Massachusetts Institute of Technology (MIT). He divided the class into three groups and gave them a programming task to solve in the Fortran language, which none of them knew.
One group was allowed to use ChatGPT to solve the problem, the second group was told to use Meta’s Code Llama large language model (LLM), and the third group could only use Google. The group that used ChatGPT, predictably, solved the problem quickest, while it took the second group longer to solve it. It took the group using Google even longer, because they had to break the task down into components.
Then, the students were tested on how they solved the problem from memory, and the tables turned. The ChatGPT group “remembered nothing, and they all failed,” recalled Klopfer, a professor and director of the MIT Scheller Teacher Education Program and The Education Arcade.
Meanwhile, half of the Code Llama group passed the test. The group that used Google? Every student passed.
Acknowledgements
The assignment structure is partially derived from A Course on Geographic Data Science by Arribas-Bel (2019), licensed under CC-BY-SA 4.0.
References
Footnotes
Based on sessions Introduction - Spatial data.↩︎
Based on sessions Spatial weights - Point patterns.↩︎
Based on sessions Clustering - Machine learning.↩︎