Blog 1

Goal: climate data with pandas and SQLite3

1. Create a Database

Goal:

  • Create a database with three tables: temperatures, stations, and countries.
  • Keep these as three separate tables in your database.
  • Make sure to close the database connection after constructing it.
  1. We first import all data packages we need
  2. We then load use sqlite3 to connect to a new database
    • note: if the database does not exist, it will create a new database in the current directory with the name that we specified
import sqlite3
import pandas as pd
import numpy as np
from plotly import express as px
from sklearn.linear_model import LinearRegression
import calendar
conn = sqlite3.connect("temps.db") 
  1. We then load in every tables we have into the database by using to_sql method
countries = pd.read_csv("https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv")
# new table: countries
countries = countries.rename(columns = {"FIPS 10-4"  : "FIPS"})
countries.to_sql("countries", conn, if_exists = "replace", index = False)
countries.head()
C:\Users\ymf\anaconda3\envs\PIC16B\lib\site-packages\pandas\core\generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  method=method,
FIPS ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
# new table stations
stations.to_sql("stations", conn, if_exists = "replace", index = False)
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR
  1. since the temperatures data file contains all the temperatures of 12 months in a row which is not efficient for us to do data analysis, we first do some data cleaning and revising.
def prepare_df(df):
    df["FIPS"] = df["ID"].str[0:2] #Creates a new column corresponding to the station's country's code
    df = df.set_index(keys=["ID", "Year", "FIPS"]) #Sets our indexing
    df = df.stack() #Creates a multi-level index dependent on the above
    df = df.reset_index() #Reverts to normal column indexing in the same order as above
    df = df.rename(columns = {"level_3"  : "Month" , 0 : "Temp"}) #Renames column appropriately
    df["Month"] = df["Month"].str[5:].astype(int) #Change the month column to reflect the integer value of the month
    df["Temp"]  = df["Temp"] / 100 #Converts from a hundreths of a Celsius to whole Celsius
    return(df) #Returns our dataframe

Because the temps.csv file is so large that sometimes the computer will take a long time to load this file(sometimes it would crush just like on my computer). So we can specify the chunksize parameter in the read_csv method. It will load in the data as a iterator instead of dataframe. This method will be more convenient to load in the data.

df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
    df = prepare_df(df)
    df.to_sql("temperatures", conn, if_exists = "append", index = False)
df.head()
ID Year FIPS Month Temp
0 USW00014924 2016 US 1 -13.69
1 USW00014924 2016 US 2 -8.40
2 USW00014924 2016 US 3 -0.20
3 USW00014924 2016 US 4 3.21
4 USW00014924 2016 US 5 13.85
  1. In order to access our database, we use Cursor on our Connection. The cursor interacts with our database and executes the SQL commands. The code below use cursor and SQL commands to see what tables populate the database.Now we can see that the databased has three seperate tables named countries, stations, temperatures as required.
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
#fetchall() returns a list containing all the items returned
[('countries',), ('stations',), ('temperatures',)]

2. Write a Query Function

query_climate_database() which accepts four arguments:

  • country, a string giving the name of a country for which data should be returned.
  • year_begin and year_end, two integers giving the earliest and latest years for which should be returned.
  • month, an integer giving the month of the year for which should be returned.

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

  • The station name.
  • The latitude of the station.
  • The longitude of the station.
  • The name of the country in which the station is located.
  • The year in which the reading was taken.
  • The month in which the reading was taken.
  • The average temperature at the specified station during the specified year and month.

Since using SQL commands would be convenient to select data columns that we want from different tables, we first introduce the SQL commands:

  • SELECT : what columns are to be displayed
  • FROM : from what table are we looking at
  • LEFT JOIN : Coorresponds data from matching values in table
  • WHERE : specify restrictions to data shown

Note: Since we declare requirements in the function arguments, we must pass our desired parameters to the commands through param into the read_sql_query() method.

def query_climate_database(country, year_begin, year_end, month):
    conn = sqlite3.connect("temps.db") #Connects to our database
    cmd = \
    """
    SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C on T.fips = C.fips
    WHERE C.name = ? AND T.year >= ? AND T.year <= ? AND T.month = ?
    """
    param = (country, year_begin, year_end, month,) 
    df = pd.read_sql_query(cmd, conn, params = param)
    conn.close() #we must close connection to database to prevent data corruption
    
    return df

Now that we have specified the function, we can now assign values to each argument and see if the result contains all the information as desired by the arguments.

query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
NAME LATITUDE LONGITUDE Name Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

3. Write a Geographic Scatter Function for Yearly Temperature Increases

Goal: Write a function to address this questions:How does the average yearly change in temperature vary within a given country? Write a function called temperature_coefficient_plot(). This function should accept five explicit arguments, and an undetermined number of keyword arguments.

  • country, year_begin, year_end, and month should be as in the previous part.
  • min_obs, the minimum required number of years of data for any given station. Only data for stations with at least min_obs years worth of data in the specified month should be plotted; the others should be filtered out. df.transform() plus filtering is a good way to achieve this task.
  • **kwargs, additional keyword arguments passed to px.scatter_mapbox(). These can be used to control the colormap used, the mapbox style, etc.

The output of this function should be an interactive geographic scatterplot, constructed using Plotly Express, with a point for each station, such that the color of the point reflects an estimate of the yearly change in temperature during the specified month and time period at that station.

We need to pay attention to the following details:

  • The station name is shown when you hover over the corresponding point on the map.
  • The estimates shown in the hover are rounded to a sober number of significant figures.
  • The colorbar and overall plot have professional titles.

Below is a complementary function to answer the question:

# this coef function computes the first coefficient of a linear regression model at each station
# the result helps to determine the color of the point
# the color reflects an estimate of the yearly change in temperature during the specified month and time period at that station
def coef(data_group):
    x = data_group[["Year"]]
    y = data_group["Temp"]  
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    df = query_climate_database(country, year_begin, year_end, month)
        
    count_year = df.groupby(["NAME"])["Year"].transform(len)
    df = df[count_year >= min_obs]

    coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)
    coefs = coefs.reset_index()
    coefs = coefs.rename(columns = {0 : "Estimated Yearly Increase (°C)"})
    coefs["Estimated Yearly Increase (°C)"] = coefs["Estimated Yearly Increase (°C)"].round(4)
    

    
    fig = px.scatter_mapbox(coefs, 
                            lat = "LATITUDE", 
                            lon = "LONGITUDE", 
                            hover_name = "NAME",
                            hover_data = ["Estimated Yearly Increase (°C)"],
                            color = "Estimated Yearly Increase (°C)",
                            title="Estimates of yearly temperature increases in " + calendar.month_name[month] + f" for stations in {country}, years {year_begin}-{year_end}",
                            **kwargs)
    return fig

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

We break down the whole function code into pieces to understand the process:

Because the length of the output of transform is the same as that of the original data, we can use transform to create new columns. Here we first use transform and groupby to retrieve how many years are under the the year column for the given station. Then the > operator will return us results in True or False. If the number of years for the any given station is at least min_obs years, then the operator return us a True; otherwise, it returns false. Then the df[]will return us only the data such that the binary result inside [] is true. Therefore, through these two following lines, we get the data for stations with at least min_obs years worth of data.

count_year = df.groupby(["NAME"])["Year"].transform(len)
df = df[count_year >= min_obs]

Next lines of code use apply, which is a perfectly good way to compute data summaries.It takes in two data columns and spits out a number for each row, which will be a new column. Since we define the coef funciton earlier, so the apply can apply this function to the data we select. Then we rename the new column to the Estimated Yearly Increase.

  • Since the estimates should be rounded to a sober number of significant digits, here we round the result to 5 significant digits.
coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)
coefs = coefs.reset_index()
coefs = coefs.rename(columns = {0 : "Estimated Yearly Increase (°C)"})
 coefs["Estimated Yearly Increase (°C)"] = coefs["Estimated Yearly Increase (°C)"].round(5)

Now we are good to go with plotting:

  • we use the scatter_mapbox to create the scatter plot over the map format that we want
fig = px.scatter_mapbox(coefs, 
                        lat = "LATITUDE", 
                        lon = "LONGITUDE", 
                        hover_name = "NAME",
                        hover_data = ["Estimated Yearly Increase (°C)"],
                        color = "Estimated Yearly Increase (°C)",
                        title="Estimates of yearly temperature increases in " + calendar.month_name[month] + f" for stations in {country}, years {year_begin}-{year_end}",
                        **kwargs)

4. Two more interesting graphs

  1. The first question that the following codes and graph are going to answer is How can visualize the median temperature change in a given period for several years in a given country?
def temperature_changes_country(country,begin_month,end_month,begin_year,end_year):
    conn = sqlite3.connect("temps.db") #Connects to our database
    cmd = \
    """
    SELECT C.name, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C on T.fips = C.fips
    WHERE C.name = ? AND  T.month >= ? AND T.month <= ? AND T.year >= ? AND T.year <= ? 
    """
    param = (country,begin_month,end_month,begin_year,end_year) 
    df = pd.read_sql_query(cmd, conn, params = param)
    df = df.groupby(["Year"])["Temp"].aggregate([np.median])
    df = df.reset_index()
    conn.close() #we must close connection to database to prevent data corruption
    fig = px.line(df, 
                     x = "Year", 
                     y = "median", 
                     labels = {"Temp" : "Median Temperatures(in given periods, years, and country)"}, 
                     title = f"Median Temperatures(C) in {country} between" + 
                             calendar.month_name[begin_month] + " and " + calendar.month_name[end_month] + 
                             ", " + str(begin_year) + "-" + str(end_year))
    
    return fig
fig = temperature_changes_country(country = "India",
                                  begin_month = 1,
                                  end_month = 12,
                                  begin_year = 1990,
                                  end_year = 2010)
fig.show()

Another question that we can address given the dataset we have is: what regions have the most substantial variation in temperatures?

def temperatures_variation(country,begin_month,end_month,begin_year,end_year):
    conn = sqlite3.connect("temps.db") #Connects to our database
    cmd = \
    """
    SELECT C.name, T.year, T.month, T.temp, S.latitude, S.longitude
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C on T.fips = C.fips
    WHERE C.name = ? AND  T.month >= ? AND T.month <= ? AND T.year >= ? AND T.year <= ? 
    """
    para = (country,begin_month,end_month,begin_year,end_year) 
    df = pd.read_sql_query(cmd, conn, params = para)
    df = df.groupby(["Year","LATITUDE","LONGITUDE"])["Temp"].aggregate([np.var])
    df = df.reset_index()
    conn.close()
    
    fig = px.scatter_mapbox(df, 
                            lat = "LATITUDE", 
                            lon = "LONGITUDE", 
                            hover_name = "Year",
                            hover_data = ["var"],
                            color = "var",
                            zoom = 1,
                            height = 300,
                            mapbox_style="carto-positron",
                            opacity = 0.2)
#grey data points are na values
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return fig

fig = temperatures_variation("India", 2, 3, 1990, 1995)
fig.show()
Written on April 12, 2021