For this tutorial we will be taking a look at Climate Data for Washington D.C. We will be also looking at data relating to greenhouse gas emmissions, population size, and total energy usage for the district. The primary goal is to estimate the rate of temperature change we can expect in the area due to GHGE as well as overall climate change. The secondary goal is to test the correlation of D.C.'s temperature change to various factors. These will be accomplished through usage of linear regression techniques as well as the Pandas DataFrame structure.
Going into this, it is recommended to have some familiarity with the Pandas DataFrame structure, a tutorial on its usage can be found here and the documentation can be found here. Essentially the DataFrame is a method of structuring and holding data in a table format for easier analysis.
The second important library we will be using is Sklearn's linear regression library. Its usage is fairly simple to pickup, and there will be instructions on its usage in this tutorial, but in case you would like some experience using it prior to going into this, here is a simple tutorial using preloaded data.
For this particular tutorial we will be using the Matplotlib Pyplots library for which documentation may be found here. There are a number of plotting libraries available for use, but pyplot allows a more seamless use of the linear regression tool for customizing regression results. We will be using three distinct pyplot functions for this tutorial; SubPlot, Scatter Plot, and Plot.There will be a comments in the code about how/why we will be using these but for now its good to have a basic familiarity with what they do.
import pandas as pd
import matplotlib.pyplot as plt
import pylab
import numpy as np
from sklearn import model_selection, linear_model, datasets
import operator
import random
import statsmodels.api as sm
from scipy import stats
from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot
The DC climate data comes from the NOAA website where there was a PDF version of the data. That data was then downloaded and converted into .csv file using an external program prior to being used in this project. The .csv file will be included in the project links. The columns giving seasonal averages per year along with semi-annual averages were dropped because this project will be looking at either monthly data specifically or yearly averages.
The second data set is from the CAIT (Climate Analysis Indicators Tool) website. It holds greenhouse gas emmissions for each US state over a period of about 20 years.
#Setting up Green House Gas table
gas_data = pd.read_csv("CAIT 2.0 U.S. States GHG Emissions - csv.csv")
#Pulling relevant columns
rep_data = gas_data[['State', 'Year', 'Total GHG Emissions Including LUCF (MtCO2e)', 'Total CO2 (excluding LUCF) (MtCO2e)'
,'Total CH4 (MtCO2e)', 'Total N2O (MtCO2e)', 'State GDP (Million US$ (chained 1997/2005))',
'Population (People)', 'Total Energy Use (Thous. tonnes oil eq. (ktoe))']]
rep_data = rep_data.dropna(axis='columns')
#Getting DC specific data
for index, row in rep_data.iterrows():
if row['State'] != 'District Of Columbia':
rep_data = rep_data.drop(index)
gas_data = rep_data
#Setting up Temperature Table
temp_data = pd.read_csv("dcatemps.csv")
temp_data = temp_data.drop(['ANN','WINTER','SPRING','SUMMER','AUTUMN', '1ST HALF', '2ND HALF'], axis = 1)
temp_data = temp_data.dropna()
months = list(temp_data.columns.values)
years = temp_data['YEAR']
del months[0]
del years[len(years) - 1]
frame = pd.DataFrame()
#Rearranging the table for line graphing
for mon in months:
data = temp_data[mon]
del data[len(data) - 1]
data = np.array(data)
sers = pd.Series(data, index = years)
frame = frame.append(pd.Series(data, index = years), ignore_index = True)
frame['MONTH'] = months
frame['MONTH LABEL'] = [0,1,2,3,4,5,6,7,8,9,10,11]
frame = frame.set_index('MONTH') #Setting the index allows the graph later on to set Y-values as month
frame.head()
This table (frame) is a reorganization of the data so each column contains a year's worth of temperature data. This was done in order to graph each year as a line later on. Graphing rows in the structure of temp_data is difficult for matplot to do.
temp_data.head()
This second table (temp_data) holds the data in its original format where each row contains the year and the year's worth of data.
gas_data.head()
This third table holds the District of Columbia Greenhouse gas output data from 1990 - 2011, the population changes during that time, the GDP, and the total energy usage. In this table the unit of measurement for GHG emmission is the "Metric tons of carbon dioxide equivalent" or MtCO2e, and Kilo Tonnes of Oil Equivalent for the energy usage. Using equivalent measurements allows for easier standardization of a particular value or emission. For instance the GHG emissions are all measured in CO2 Equivalent even though we are measuring CH4 and N2O. Knowing that all these values are measured in this same way makes the data meaningful even if they're measuring strictly differnt things.
For this first part we will do a simple line graph plotting the changes in monthly temperature by year. This section will use our rearranged dataset "frame". As discussed we changed the format of it so that each years monthly averages would be accessed by row instead of column to make doing this graph easier. We will also be sampling the dataset. What this means is that instead of using all the data in the set, we will be selecting some amount of them (in this case 35) to be representative of the whole dataset. We will be doing this because in order to represent the total amount of years we have data for we would have to put almost 150 lines of varying colors, as well as markers to indicate those colors. Having this sheer amount of data wouldn't help to provide more insight and would instead hurt reader comprehension. Make sure to, when representing data, take into account whether or not you need to show all of the data in order to get your point across.
plt.rcParams['figure.figsize'] = 15, 10
#Line plots the graph data based on set of years
def plot_parts(years):
x = months
for year in years:
y = frame[year]
plt.plot(x, y, label = year)
pylab.legend(loc = 'upper left')
plt.title("Washington DC Average Monthly Temps from {} - {}".format(min(years), max(years)))
plt.xlabel("Month")
plt.ylabel('Temperature (°F)')
plt.show()
#Selecting sample because adding all of them inhibits graph comprehension
select = []
while len(select) <= 35:
val = random.choice(years.tolist())
if val not in select:
select.append(val)
plot_parts(sorted(select))
This section will have our introduction to linear regression! For this part of the project we will be graphing each months data individually as well as plotting a regression line to illustrate how the temperature is trending. We will also be making our first scatter plot as well as using SubPlot for the first time! The purpose of this section is to practice running linear regression over a single variable dataset as well as plotting this single variable along with the regression line. The regression object has a function .predict(X) that given the value input as X, it will output its estimate corresponding value. This will be key for graphing our regression line later.
#This function runs 100 linear regressions and returns a dictionary with each score as the key and a list containing the
#regression and the X & y test and training sets
def regr_lst(X, y):
best_regr = {}
for x in range(0, 100):
lst = []
#It's important to split the data into testing and training data, training data is what will be used to create the
#regression while testing data is what's used to determine how well that regression fits.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.3)
regr = linear_model.LinearRegression().fit(X_train, y_train)
lst.append(X_train)
lst.append(X_test)
lst.append(y_train)
lst.append(y_test)
lst.append(regr)
best_regr[regr.score(X_test, y_test)] = lst
return best_regr
try:
i = 1
for column in temp_data:
if column != "YEAR":
X = temp_data[column] #X is formatted like this [x1, x2,...,xn]
X = np.array(X).reshape((-1, 1))# for single value regression we need X to look like this [[x1],[x2],...,[xn]]
y = temp_data['YEAR']
best_regr = regr_lst(X, y)#Getting the dictionary of regressions
best_lst = best_regr[max(best_regr.keys())]#Getting the regression with the highest score
X_test = best_lst[1]
regr_l = best_lst[4]
plt.subplot(3, 4, i) #For subplots (x, y, i) x and y are how many total subplot you want so 3 rows of 4 plots
plt.subplots_adjust(wspace=.4, hspace = .4)
i += 1
plt.scatter(x = temp_data['YEAR'], y = temp_data[column]) #Graphing the given data
plt.plot(regr_l.predict(X_test),X_test, color = "black") #Graphing the black regression line
plt.title(column)
plt.xlabel("Year")
if column == 'JAN' or column == 'MAY' or column == 'SEP':
plt.ylabel("Temp (°F)") #I did this because having a y-label on each row instead of graph is easier to read
except ValueError:
pass
plt.show()
Now that we've plotted the all the data by month, it's now time to plot all the data by year. The above graphs all appear to be trending upwards, but are they? If so, by how much? That's what analysing the yearly changes can help us figure out. We will also be making use of Sklearn's score function for linear regression. This function gives a score for how well the regression fits the actual data. It runs an R^2 analysis for determining "goodness-of-fit". In addition to that analysis, we will be making a residuals plot to give visual evidence of best fit. If the residuals plot is randomly distributed, then that is evidence for our model being a good fit, if it's organized then we have a problem. You can read more about residuals plots here and here.
#Adds average yearly temp column to table
def add_avg_col(table):
avgs = []
for index, row in table.iterrows():
sum = row.sum() - row['YEAR']
avg = sum / 12
avgs.append(avg)
return avgs
temp_data['YEAR AVG'] = add_avg_col(temp_data)
X1 = temp_data['YEAR AVG'] #Setup for this regression is similar to the first due to the use of only one variable
X1 = np.array(X1).reshape((-1, 1))
y1 = temp_data['YEAR'].values
best_regr1 = regr_lst(X1, y1)
max_lst1 = best_regr1[max(best_regr1.keys())]
X_train1 = max_lst1[0]
X_test1 = max_lst1[1]
y_train1 = max_lst1[2]
y_test1 = max_lst1[3]
regr_l1 = max_lst1[4]
#We will be adding the score of the regression to the graph to provide some context for how well it fits the data
text = "".join(("D.C. SCORE USING RSS TEST: %.2f"%(100 * regr_l1.score(X_test1,y_test1)), "%"))
ax = plt.scatter(X_test1, y_test1, color = "red")
ax = plt.plot(X_test1, regr_l1.predict(X_test1), color = "black")
ax = plt.title("Washington D.C. Average Yearly Temperature")
ax = plt.xlabel("Average Temperature(°F)")
ax = plt.ylabel("Year")
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize = 10)
plt.show()
#This section will allow us to plot the residuals to help determine if our regression model is a good fit
ridge = Ridge()
visualizer = ResidualsPlot(ridge, hist = False)
visualizer.fit(X_train1, y_train1)
visualizer.score(X_test1, y_test1)
visualizer.poof()
For this section we will be using a regression analysis for predicting future temperatures given just past temperatures. New to us this section is the .coef_ function. This returns the coefficients of the regression. In the case of single variable it will return one coefficient. Given two variables it will return two coefficients and so on. Basically it gives us the rate of change for the regression line. Remember to take into account that this doesn't count outside factors and is purely estimating the temperature change based on historical data and doesn't take things like climate change into account.
def predict_range(regr_l):
years_lst = []
pred_lst = []
for x in range(2018, 2100):
years_lst.append(x)
pred_lst.append(regr_l.predict([[x]]))
pred_lst = [item for sublist in pred_lst for item in sublist]
text = "".join(("SCORE USING RSS TEST: %.2f"%(100 * regr_l.score(X_test,y_test)), "%"))
avg = "".join(("RATE OF PREDICTED TEMP INCREASE: %.2f"%regr_l.coef_, "°F", "/YEAR"))
plt.plot(pred_lst, years_lst, '.r-',markersize = 12)
plt.plot()
plt.title("Predicted Yearly Average Temperature of Washington DC from 2018 - 2099")
plt.ylabel("Year")
plt.xlabel("Predicted Temperature (°F)")
plt.annotate(text, xy=(0.05, 0.95), xycoords='axes fraction', fontsize = 10)
plt.annotate(avg, xy=(0.05, 0.9), xycoords='axes fraction', fontsize = 10)
plt.show()
X = temp_data['YEAR']
X = np.array(X).reshape((-1, 1))
y = temp_data['YEAR AVG'].values
best_regr = regr_lst(X, y)
max_lst = best_regr[max(best_regr.keys())]
X_test = max_lst[1]
y_test = max_lst[3]
regr_l = max_lst[4]
predict_range(regr_l)
Okay now that we've looked at past and future temperatures around D.C. its time to look at external factors that could contribute to these temperature changes! We will be using the subplots again, and this time we will be looking at greenhouse gasses, the population of D.C. and the amount of energy used by the district. We will be visualizing this data in order to get an idea of how they've changed over the last few decades and to provide us some context for our measured temperature data if it turns out that they have some correlation.
transplant = {}
#Pulling temperature data from other table
for index, row in temp_data.iterrows():
if row['YEAR'] >= 1990:
transplant[row['YEAR']] = row['YEAR AVG']
rows = []
for index, row in gas_data.iterrows():
if row['Year'] in transplant:
rows.append(transplant[row['Year']])
gas_data['Year Average Temp °F'] = rows
#Setting up the graphs(Some repetition)
#GHG Plot
plt.subplot(3, 1, 1)
plt.subplots_adjust(wspace=.4, hspace = .4)
plt.scatter(x = gas_data['Year'], y = gas_data['Total CO2 (excluding LUCF) (MtCO2e)'], color = "blue", label = "Total CO2")
plt.scatter(x = gas_data['Year'], y = gas_data['Total N2O (MtCO2e)'], color = "black", label = "Total N2O")
plt.scatter(x = gas_data['Year'], y = gas_data['Total CH4 (MtCO2e)'], color = "red", label = "Total CH4")
plt.gca().legend(('Total CO2','Total N2O', 'Total CH4'))
plt.title('Top 3 GHG Emissions in D.C.')
plt.xlabel('Year')
plt.ylabel('MtCO2e')
#Population Plot
plt.subplot(3,1,2)
plt.plot(gas_data['Year'],gas_data['Population (People)'])
plt.title('D.C. Population Over Time')
plt.xlabel('Year')
plt.ylabel('Population')
#Energy Usage Plot
plt.subplot(3,1,3)
plt.plot(gas_data['Year'], gas_data['Total Energy Use (Thous. tonnes oil eq. (ktoe))'])
plt.title('Total Energy Usage of D.C. (In Thousand Tons of Oil Equivalent)')
plt.xlabel('Year')
plt.ylabel('KTOE')
plt.show()
Now its time to do regression on multiple variables! Because of the limitations of visalization we won't be making a graph for this section. We will be using the Statmodels API to provide visuals in the form of summary charts. From these summary charts we can get numerical data to validate or contradict our hypothesis. Our current hypothesis is that at least one, if not many, of these factors affect temperature averages in the D.C. area.
gasses = gas_data.loc[:, 'Total CO2 (excluding LUCF) (MtCO2e)':'Total N2O (MtCO2e)'].values
pop = gas_data['Population (People)'].values
energy = gas_data['Total Energy Use (Thous. tonnes oil eq. (ktoe))'].values
each = gas_data.loc[:, 'Total CO2 (excluding LUCF) (MtCO2e)':'Total Energy Use (Thous. tonnes oil eq. (ktoe))'].values
y = gas_data['Year Average Temp °F'].values
test = sm.add_constant(gasses) #Creating the test from out dataset
estimate = sm.OLS(y, test) #Generating regressor
res = estimate.fit() #Fitting our regressor
print("\t\t TESTING GHG EMMISSIONS ON TEMP INFLUENCE\n", '-' * 75, "\n\n", res.summary())
In the summary we can see x1, x2, and x3, these represent coefficients in the regressor. Where x1 is the Total CO2, x2 is the Total CH4, and x3 is Total N2O. The P > |t| is a representation of each coefficients P-Value. If this P-Value is <= .05 then typically you can disprove the Null Hypothesis. Essentially saying that your hypothesis (that that molecule has a relationship to the temperature) is correct. In this case, none of the values were < .05 so it seems to indicate that taken together there is no relationship. The conclusion will explain why this may or may not be entirely true.
test = sm.add_constant(pop)
estimate = sm.OLS(y, test)
res = estimate.fit()
print("\t\t TESTING POPULATION ON TEMP INFLUENCE\n", '-' * 75, "\n\n", res.summary())
test = sm.add_constant(energy)
estimate = sm.OLS(y, test)
res = estimate.fit()
print("\t\t TESTING ENERGY CONSUMPTION ON TEMP INFLUENCE\n", '-' * 75, "\n\n", res.summary())
This section was testing the change in energy usage in the D.C. area and its P-Value is < .05 which would seem to indicate that it has a relationship to the change in temperature. This is the only variable to indicate this relationship so far.
test = sm.add_constant(each)
estimate = sm.OLS(y, test)
res = estimate.fit()
print("\t\t TESTING ALL ON TEMP INFLUENCE\n", '-' * 75, "\n\n", res.summary())
All of the sets, with the exception of energy usage over time, have a P-Value > .05. This doesn't necessarily confirm or reject the Null hypotheis for the different sets. Due to the fact that we started measuring greenhouse gas emissions relatively recently, we have limited data for specific cities like D.C. When all the factors are measured as a whole CO2 has a P-value of .01 which would invalidate the Null hypothesis. N20, population, and Total Energy Consumption all also fall under .05. This can also be misleading because, once again, we have a relatively small dataset and adding multiple variables to a small dataset can skew the results. However, just based on the results given in this limited test, it would appear that tracking D.C. energy usage would be a good predictor of the rate of energy increase.
Thank you so much for looking through this tutorial. I hope you learned something from it whether it be something simple like how to plot data or something more complex like regression testing. Interpreting climate data is a field that is becoming more and more important as the effects of climate change start to accelerate and having the ability to correlate factors leading to its acceleration is an important step in slowing it down. Furthermore the ability to accurately predict future outcomes from past events is a good skill to be able to employ in general.