Assignment - Blog Post 1

In this blog post, you’ll create several interesting, interactive data graphics using the NOAA climate data that we’ve explored in the first several weeks of lectures.

General Requirements

Your post should include not only code but also outputs (i.e. tables, figures) and expository writing that explains what you’re doing. Your target audience is a a student who has completed PIC16A but hasn’t taken PIC16B yet (i.e. you before the start of the quarter). The details of what you should do are in the “Prompt” section.

Throughout, you should:

Optimize code concision by minimizing repetition. Repeated operations should be enclosed in functions.
Minimize for-loops by making full use of vectorized operations for Numpy arrays and Pandas data frames.
Supply helpful comments throughout your code, and docstrings for any functions and classes you define.

Submit your blog post (both draft and revised) by submitting a URL to the post on CCLE.

Draft: submit to the Draft Blog Post 1 workshop activity.
Revised: submit to the Blog Post 1 (Revised) assignment.

Grading

Generally speaking, it’s ok for your draft to be a little rough around the edges, and maybe you haven’t fully optimized your code or made your plots maximally shiny yet. That’s ok! The important part is:

Make enough progress that you can benefit from your peers’ feedback.

This means that you should have “decent” versions of all the required components by the time you’re submitting your draft. This is enough for full credit on the draft stage. In the revised submission, I’ll also be grading on code quality, style, and expository writing.

Prompt

§1. Create a Database

First, create a database with three tables: temperatures, stations, and countries. Information on how to access country names and relate them to temperature readings is in this lecture. Rather than merging, as we did in the linked lecture, you should keep these as three separate tables in your database.

Make sure to close the database connection after you are finished constructing it.

§2. Write a Query Function

Write a function called query_climate_database() which accepts four arguments:

country, a string giving the name of a country for which data should be returned.
year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
month, an integer giving the month of the year for which should be returned.

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

The station name.
The latitude of the station.
The longitude of the station.
The name of the country in which the station is located.
The year in which the reading was taken.
The month in which the reading was taken.
The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so you don’t have to do any aggregation at this stage.)

For example:

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	PBO_ANANTAPUR	14.5830	77.6330	India	1973	1	24.97
1	PBO_ANANTAPUR	14.5830	77.6330	India	1974	1	24.16
2	PBO_ANANTAPUR	14.5830	77.6330	India	1975	1	23.22
3	PBO_ANANTAPUR	14.5830	77.6330	India	1977	1	24.00
4	PBO_ANANTAPUR	14.5830	77.6330	India	1978	1	24.70
...	...	...	...	...	...	...	...
7361	DIU	20.7167	70.9167	India	1955	1	22.75
7362	DIU	20.7167	70.9167	India	1956	1	22.50
7363	DIU	20.7167	70.9167	India	1957	1	22.20
7364	DIU	20.7167	70.9167	India	1958	1	22.85
7365	DIU	20.7167	70.9167	India	1960	1	21.70

7366 rows × 7 columns

§3. Write a Geographic Scatter Function for Yearly Temperature Increases

In this part, you will write a function to create visualizations that address the following question:

How does the average yearly change in temperature vary within a given country?

Write a function called temperature_coefficient_plot(). This function should accept five explicit arguments, and an undetermined number of keyword arguments.

country, year_begin, year_end, and month should be as in the previous part.
min_obs, the minimum required number of years of data for any given station. Only data for stations with at least min_obs years worth of data in the specified month should be plotted; the others should be filtered out. df.transform() plus filtering is a good way to achieve this task.
**kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.

The output of this function should be an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station. A reasonable way to do this is to compute the first coefficient of a linear regression model at that station, as illustrated in these lecture notes.

For example, after writing your function, you should be able to create a plot of estimated yearly increases in temperature during the month of January, in the interval 1980-2020, in India, as follows:

# assumes you have imported necessary packages
color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

Please pay attention to the following details:

The station name is shown when you hover over the corresponding point on the map.
The estimates shown in the hover are rounded to a sober number of significant figures.
The colorbar and overall plot have professional titles.

It’s not necessary for your plot to look exactly like mine, but please attend to details such as these. Feel free to be creative about these labels, as well as the choice of colors, as long as your result is polished overall.

You are free (and indeed encouraged) to define additional functions as needed.

§4. Create Two More Interesting Figures

Create at least two more complex and interesting interactive data visualizations using the same data set. In each case, you should construct your visualization from data obtained by querying the database that you created in §1. The code to construct each visualization should be wrapped in functions, such that a user could create visualizations for different parts of the data by calling these functions with different arguments.

Alongside the plots, you should clearly state a question that the plot addresses, similar to the question that we posed in §3. The questions for your two additional plots should be meaningfully different from each other and from the §3 question. You will likely want to define different query functions for extracting data for these new visualizations.

It is not necessary to create geographic plots for this part. Scatterplots, histograms, and line plots (among other choices) are all appropriate. Please make sure that they are complex, engaging, professional, and targeted to the questions you posed. In other words, push yourself! Don’t hesitate to ask your peers or talk to me if you’re having trouble coming up with questions or identifying plots that might be suitable for addressing those questions.

Written on March 31, 2021