Gina Schmalzle

An Introduction to Plotting and Mapping in Python

2015-04-14T15:57:00-04:00

Tutorial on Matplotlib and Basemap

On January 29, 2015 Mark Blunk and I prepared a workshop on IPython Notebooks, Matplotlib and Basemap held at Ada Developers Academy and sponsored by PyLadies Seattle. This blog goes over the Matplotlib and Basemap components of the workshop. The code, contained within Ipython notebooks, are located in this Github Repo.

The Matplotlib/Basemap part of the workshop focuses on:

1. Getting to the Basics -- Data Structures -- Brief overview of the data structures used in this workshop.

2. Prepare the data -- Prepare our data for plotting.

3. Time to Plot! General Scatter Plots -- Make some simple scatter plots and learn how to change their attributes.

4. Histograms! -- Make some simple histograms and learn about how to extract information from them.

5. Mapping -- Make some maps, and let's throw some data on them too!

The Data

I thought it would be fun to work with real data instead of some randomly generated data. The data we will use are modeled weather forecasts at weather stations across the United States. This information was collected from the OpenWeatherMap project who provides an API service to download weather forecasts, but unfortunately, does not keep a historical record of the forecasts (actual observations, yes, but modeled forecasts no). David Branner and I were curious about how accurate the forecasts were, and wanted to keep the forecasts to see how well they perform over time. Hence, we created a database that collects the weather forecasts for these stations. The file target_day_20140422.dat that is in the Github repo for this workshop was extracted from our database and contains weather forecasts for each station in the United States for the 'target day' of April 22, 2014. The stations themselves are defined by their latitude and longitude and the file contains forecasts that were done 0 to 7 days out, where day zero is the forecast made on April 22, 2014. Hence a forecast made one day out was made on April 21, two days out April 20th, etc.

1. Getting to the Basics -- Data Structures

A basic understanding of data structures is useful when playing with and visualizing data. If you are already familiar with data structures you can skip ahead to 2. Prepare the data.

In computer science, a data structure is a way to organize data in a computer that makes it computationally efficient. Three basic data structures are used in this workshop: lists, tuples and dictionaries.

Lists

Lists represent a sequence of values. In python a list is designated with square brackets []. The following are examples of lists:

a = []
b = ['a', 'b', 'c']
c = [4,1,6,9,2,10]
d = [[1,2,3],['a','n','q']]

The items in these lists are called elements or items. You can figure out how many elements are in these lists by asking for its length:

print (len(d))

The example d above has two lists as elements. d is called a list of lists.

So how do you retrieve an element of a list? Each element is assigned a number, starting at 0, that represents where it sits in the list. For example, element 0 of b is 'a'. It can be retrieved like this:

b[0]

Now you try -- What is c[4]? How about d[2]?

Great things about lists are that they are very simple to understand, and they take up relatively little amounts of memory. However they do have some limitations. Say if you have a long list of values, but you wanted to see if a certain value is in the list. You potentially would have to read through all the items in the list in order to see if it is in there. Hence, it can be computationally slow.

Tuples

Tuples are similar to lists in that they also represent a sequence of values, however they have a very special property -- i.e., they are immutable. This means that once they are created they can not be changed. They are represented by parentheses () rather than square brackets. So, in python, you could define a tuple like this:

a = ()
b = (32, 41)
c = ('x', 'y')

Similar to lists, you can access a specific element like so:

b[1]

This would produce the output of 41.

Tuples seem a lot more restrictive than a list, so you may ask, why would you ever use a tuple? Tuples are useful when you would like to describe something that needs multiple values to make sense, and these values cannot change. For example, you can create a tuple of a location on the surface of the earth that contains a latitude and longitude. The location would not make sense if one of those values were wrong or missing. Hence, having an immutable property that describes its location is appropriate in this case.

Dictionaries

Also known as associative arrays, maps, symbol tables or hash tables, this data structure is computationally fast, but uses lots of memory. A dictionary consists of key-value pairs, where the keys are all unique and refer to a specific value. Values among the keys can be identical, however. Dictionaries are designated with curly brackets {}. Here are examples of dictionaries:

dict_a = {}
dict_b = {'Hello beautiful': 'Ew, Gross', 'Goodbye Gorgeous':'Finally'}
dict_c = {'Bad Pickup Lines': {'example 1': 'Did it hurt when you fell from heaven?',
                               'example 2': 'Do you alway wear your shoes over your socks?'
                              }}

For dict_b, you can think of a bad pickup line as the 'key' to your response, or 'value'. For example, if someone said:

dict_b['Hello beautiful']

the response would be:

'Ew, Gross'

For dict_c, we have a dictionary of dictionaries. Here we have a dictionary of bad pickup lines that contain examples. To get to a nested dictionary, say you want the value for 'example 2', you would type:

dict_c['Bad Pickup Lines']['example 2']

Get it? If you need more help, I've put together a post on dictionaries here.

The great thing about dictionaries is that we can have a lot of data, but if we know the key, we can very quickly get the associated values. If this information were in a list, it could take a long time to read through the list to get to the value you want. The down side however, is that dictionaries could take up a lot of memory, but that's not a problem in this excersize on most modern computers.

2. Prepare the data

Retrieving the data

In this section we focus on reading in data and putting it into an appropriate data structure. These 'data' are modeled weather forecasts for individual weather stations across the United States. (I put quotes on data because these are modeled solutions, not actual observations). The file that will be read contains the forecast for one day (April 22, 2014) for 0 to 7 days prior, where the 0th day is the forecast on April 22nd:

# Read file
filename='target_day_20140422.dat'
f = open(filename, 'r')
contents = f.readlines()

Where contents looks like this:

['Lat, Lon, days_out, MaxT, MinT \n',
 '38.576698 -92.173523 0 18.71 6.97\n',
 '38.576698 -92.173523 1 21.03 8.7\n',
 '38.576698 -92.173523 2 20.67 9.72\n',
 '38.576698 -92.173523 3 19.01 7.23\n',
 '38.576698 -92.173523 4 22.08 9.07\n',
 '38.576698 -92.173523 5 21.68 9.53\n',
 '38.576698 -92.173523 6 22.33 10.22\n',
 '38.576698 -92.173523 7 16.18 12.14\n',
 '34.154179 -117.344208 0 17.37 6.16\n',
 '34.154179 -117.344208 1 19.66 7.48\n',
 '34.154179 -117.344208 2 21.24 6.27\n',
 '34.154179 -117.344208 3 21.71 5.5\n',
 '34.154179 -117.344208 4 18.34 8.88\n', ...]

Couple of things here -- we have a list of strings, where the end of the string is marked with an 'n'. This marker indicates that it is the end of the line in the file and will need to be accounted for when we ingest the data into a useable form.

Let's make a dictionary of values, where lat, long are the keys (in tuple form). The values are also dictionaries, where the number of days out are the keys, and MaxT and MinT are the values:

forecast_dict = {}
for line in range(1, len(contents)):
    line_split = contents[line].split(' ')
    try:
        forecast_dict[line_split[0], line_split[1]][line_split[2]] = {'MaxT':float(line_split[3]),
                                                                      'MinT':float(line_split[4][:-1])}
    except:
        forecast_dict[line_split[0], line_split[1]] = {}
        forecast_dict[line_split[0], line_split[1]][line_split[2]] = {'MaxT':float(line_split[3]),
                                                                      'MinT':float(line_split[4][:-1])}

Here forecast_dict looks like this:

{('19.068609', '-155.764999'): {'0': {'MaxT': 25.67, 'MinT': 24.45},
  '1': {'MaxT': 25.88, 'MinT': 24.66},
  '2': {'MaxT': 25.17, 'MinT': 24.49},
  '3': {'MaxT': 25.67, 'MinT': 24.37},
  '4': {'MaxT': 25.35, 'MinT': 23.76},
  '5': {'MaxT': 24.57, 'MinT': 23.27},
  '6': {'MaxT': 24.26, 'MinT': 23.33},
  '7': {'MaxT': 24.71, 'MinT': 23.78}},
 ('19.43083', '-155.237778'): {'0': {'MaxT': 25.38, 'MinT': 23.41},
  '1': {'MaxT': 25.39, 'MinT': 22.47},
  '2': {'MaxT': 24.77, 'MinT': 23.35},
  '3': {'MaxT': 25.38, 'MinT': 22.45},
  '4': {'MaxT': 24.36, 'MinT': 22.5},
  '5': {'MaxT': 23.92, 'MinT': 22.57},
  '6': {'MaxT': 23.21, 'MinT': 22.45},
  '7': {'MaxT': 23.56, 'MinT': 22.68}},...

So now we have for each site (defined by its latitude and longitude) the Maximum Temperature (MaxT) and Minimum Temperature (Min T) for each forecast done the day of (day '0') to 7 days prior. It's pretty easy to retrieve the stations (and hence the latitudes and longitudes) by typing:

forecast_dict.keys()

which gives:

[('37.224239', '-95.708313'),
 ('27.53587', '-82.561211'),
 ('32.709301', '-96.008301'),
 ('42.09808', '-88.28286'),
 ('36.424229', '-89.057007'),
 ('36.98801', '-121.956627'),
 ('43.02496', '-108.380096'),
 ('41.802601', '-71.88591'),
 ('37.99548', '-122.332748'),
 ('43.416679', '-86.35701'),
 ('41.85371', '-71.758118'),...

And you can extract values for a random station by selecting one of these keys, e.g.:

forecast_dict[('40.51218', '-111.47435')]

gives you:

{'0': {'MaxT': 17.45, 'MinT': 2.04},
 '1': {'MaxT': 17.95, 'MinT': 5.84},
 '2': {'MaxT': 18.33, 'MinT': 7.99},
 '3': {'MaxT': 18.16, 'MinT': 7.7},
 '4': {'MaxT': 13.75, 'MinT': 3.62},
 '5': {'MaxT': 14.58, 'MinT': 9.23},
 '6': {'MaxT': 14.58, 'MinT': 9.23},
 '7': {'MaxT': 13.08, 'MinT': -2.99}}

The output above shows the forecasted Max T and Min T values for 0-7 days prior for a specific station at Latitude 40.51218N, Longitude -111.47435E.

Prepare our data for Plotting

The plot will be Max T vs. day out for this one station. It will be a simple plot, but first, we need to make some lists that matplotlib can use to do the plotting. We will need a list of days, and a list of corresponding Max T values:

# First retrieve the days
day_keys  = forecast_dict[('40.51218', '-111.47435')].keys()

day_keys gives you:

['1', '0', '3', '2', '5', '4', '7', '6']

Dictionaries don't necessarily sort alphabetically or numerically, so let's sort them:

day_keys.sort()

returns:

['0', '1', '2', '3', '4', '5', '6', '7']

Matplotlib plots lists of one thing against another. So, let's make our lists:

# First define the variables as lists
day_list = []; maxt_list = []

# Then populate the lists
for day_key in day_keys:
    day_list.append(float(day_key))
    maxt_list.append(float(forecast_dict[('40.51218', '-111.47435')][day_key]['MaxT']))

Now the element in one list corresponds with an element in the other list, for a given element number. For example day_list[0] corresponds to maxt_list[0]

3. Time to Plot! General Scatter Plots

First let's import everything we will need:

%matplotlib inline  # In ipython or ipython notebook only
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np

Our most simple scatter plot can be made by typing:

plt.scatter(day_list, maxt_list)
# Let's add a line --
plt.plot(day_list, maxt_list)

This gives you:

Now let's jazz is up a bit -- Let's Make the lines red and dashed and change the size of the circles, change them to stars and make them green. Also, how is one to know what you just plotted? Let's add the axes labels and the title:

plt.plot(day_list, maxt_list, '.r--')
plt.scatter(day_list, maxt_list, s = 400, color='green', marker='*')
plt.ylabel ('Forecasted Max Temperature, Deg C')
plt.xlabel ('Days from Target day April 22, 2014')
plt.title ('Forecasted Max Temperature')
plt.show()

This will give you:

Click here for more marker fun, and more info on pretty-ing up lines can be found here.

Getting the idea?

Let's do another plot and this time look at all of the Max Temperature forecasts 2 days out, and plot them with respect to Latitude. We will need to pick out from forecast_dict all the Max T values for all of the weather stations made 2 days before April 22, 2014. First, we will need to get all the Latitudes and Longitudes for each site, then we will need to pick out all the Max T values for each of the stations for that day.

We will keep in mind that maybe in the future you might want to look at Min T, or a different day:

# Get keys of forecast_dict (lats and longs):
keys = forecast_dict.keys()
# Circle through all the keys to get the values for the 2nd day maximum temperature and the
# corresponding Lat and Longs
day_out = '2'       # 0-7
temp = 'MaxT'  # MaxT or MinT
temperature = []; lat = []; lon = []
for key in keys:
    temperature.append(float(forecast_dict[key][day_out][temp]))
    lat.append(float(key[0]))
    lon.append(float(key[1]))
# Now that those are collected, let's see what the Temperature as a function of Latitude is:
plt.scatter(temperature,lat)

This will give you:

Coloring Points in a Scatter Plot

Let's try again, but this time, color according to Longitude. Again, let's keep in mind we may want to color by something else. You can try playing with these:

color_by = lon
label = 'Long'  # Need to rename if 'color_by' is changed
max_color_by = max(color_by)
min_color_by = min(color_by)

fig, ax = plt.subplots()
s = ax.scatter(temperature, lat,
               c=color_by,
               s=200,
               marker='o',                   # Plot circles
              # alpha = 0.2,
               cmap = plt.cm.coolwarm,       # Color pallete
               vmin = min_color_by,          # Min value
               vmax = max_color_by)          # Max value

cbar = plt.colorbar(mappable = s, ax = ax)   # Mappable 'maps' the values of s to an array of RGB colors defined by a color palette
cbar.set_label(label)
plt.xlabel('{0} in Deg C, forecasted {1} days out'.format(temp,day_out))
plt.ylabel('Latitude, Deg N')
plt.title('{0} forecasted {1} Days out from target day April 22, 2014'.format(temp,day_out))
plt.show()

And now you have color:

Click here for more color mapping fun.

Any ideas what the blue blobs are? (Hint: they are not part of the contiguous United States!)

4. Histograms!

Let's take a step back and work on a histogram. What we are going to plot is the distribution of forecasted temperatures. Let's start with a very simple histogram of the temperature we left off with:

plt.hist(temperature)
plt.ylabel ('Counts')
plt.xlabel(temp)
plt.show()

This gives you a very simple histogram that looks like this:

Now let's try again and jazz it up... Let's increase the number of bins (bin size calculated by the difference Min and Max values, divided by the number of bins). Let's also change the color of the bars and make them a little translucent.

Python histograms give you some information about them. Let's explore:

n, bins, patches = plt.hist(temperature, 10, color='green', alpha=0.2)

Note that I've fattened up the bins again for this example... n are the number of counts for each bin:

[   69.,   322.,  1078.,  1732.,  2243.,  2285.,  2421.,  1267.,  275.,    38.]

bins are the x-centered location of the bins:

[  0.91 ,   4.425,   7.94 ,  11.455,  14.97 ,  18.485,  22., 25.515,  29.03 ,  32.545,  36.06 ]

And patches are a list of the matplotlib rectangle shapes that make the bins.

5. Mapping

Now that we have the basics down, let's start with mapping! We will be using Matplotlib's basemap: http://matplotlib.org/basemap/.

Let's make a simple Mercator Projection Map. The code in the next cell is straight from the Basemap example section -- http://matplotlib.org/basemap/users/merc.html:

# Define the projection, scale, the corners of the map, and the resolution.
m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,\
            llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')
# Draw the coastlines
m.drawcoastlines()
# Color the continents
m.fillcontinents(color='coral',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,91.,30.))
m.drawmeridians(np.arange(-180.,181.,60.))
# fill in the oceans
m.drawmapboundary(fill_color='aqua')
plt.title("Mercator Projection")
plt.show()

llcrnrlat,llcrnrlon,urcrnrlat,urcrnrlon are the lat/lon values of the lower left and upper right corners of the map. lat_ts is the latitude of true scale. resolution = 'c' means use crude resolution coastlines.

And here is the result:

Now let's change this map to do what we need. Let's 1. Change the area to the continental United States 2. Increase the resolution to intermediate ('i') 3. Remove the horrific ocean/land colors provided above:

m = Basemap(projection='merc',llcrnrlat=20,urcrnrlat=50,\
            llcrnrlon=-130,urcrnrlon=-60,lat_ts=20,resolution='i')
m.drawcoastlines()
m.drawcountries()
#m.drawstates()
# draw parallels and meridians.
parallels = np.arange(-90.,91.,5.)
# Label the meridians and parallels
m.drawparallels(parallels,labels=[False,True,True,False])
# Draw Meridians and Labels
meridians = np.arange(-180.,181.,10.)
m.drawmeridians(meridians,labels=[True,False,False,True])
m.drawmapboundary(fill_color='white')
plt.title("Forecast {0} days out".format(day_out))
plt.show()

Now the map looks like this:

Awesome, now we have the area of our interest -- a map of the contiguous United States. Let's put some data on this map. First, let's just start by putting the points on the map. Here I am just going to make some small changes to the code in the previous code block -- namely, I am going to take the latitudes and longitudes from our dataset and convert them into the map's projection. In this case, it will be converted into the mercator projection I've defined:

m = Basemap(projection='merc',llcrnrlat=20,urcrnrlat=50,\
            llcrnrlon=-130,urcrnrlon=-60,lat_ts=20,resolution='i')
m.drawcoastlines()
m.drawcountries()
# draw parallels and meridians.
parallels = np.arange(-90.,91.,5.)
# Label the meridians and parallels
m.drawparallels(parallels,labels=[False,True,True,False])
# Draw Meridians and Labels
meridians = np.arange(-180.,181.,10.)
m.drawmeridians(meridians,labels=[True,False,False,True])
m.drawmapboundary(fill_color='white')
plt.title("Forecast {0} days out".format(day_out))
x,y = m(lon, lat)                            # This is the step that transforms the data into the map's projection
m.plot(x,y, 'bo', markersize=5)
plt.show()

Now we have a map with the location of the weather stations mapped:

This is nice and all, but it would be great if we can color each of the points by their forecasted maximum temperature -- so let's do that! Here we have to define what points we want to color, and what we want to color them by:

m = Basemap(projection='merc',llcrnrlat=20,urcrnrlat=50,\
            llcrnrlon=-130,urcrnrlon=-60,lat_ts=20,resolution='i')
m.drawcoastlines()
m.drawcountries()
# draw parallels and meridians.
parallels = np.arange(-90.,91.,5.)
# Label the meridians and parallels
m.drawparallels(parallels,labels=[True,False,False,False])
# Draw Meridians and Labels
meridians = np.arange(-180.,181.,10.)
m.drawmeridians(meridians,labels=[True,False,False,True])
m.drawmapboundary(fill_color='white')
plt.title("Forecast {0} days out".format(day_out))
# Define a colormap
jet = plt.cm.get_cmap('jet')
# Transform points into Map's projection
x,y = m(lon, lat)
# Color the transformed points!
sc = plt.scatter(x,y, c=temperature, vmin=0, vmax =35, cmap=jet, s=20, edgecolors='none')
# And let's include that colorbar
cbar = plt.colorbar(sc, shrink = .5)
cbar.set_label(temp)
plt.show()

And finally, now we have a map with colored points:

Interested in playing with this more on your own? Here are a few exercises you can try:

In the first graph -- include the weather forecast through time for multiple stations. Color each set of lines differently for each weather station. Also color the points differently for each.

In the second graph -- Try creating a figure with subplots and show the forecasted Max Temperature and forecasted Min Temperature as a function of Latitude side by side.

In the histogram -- Try overlaying a histogram with of the distribution of Max T values for day 2 with the distribution of Min T values for the same day.

For the map -- Create a figure with multiple maps, where each map shows the forecasted distribution of temperature for each day out. Change the location of labels.

What is the difference of the temperature forecast made April 22, 2014 with the previous forecast days? Can you map the differences?

That's it for this workshop! Hope you had fun, and I would love to see what you come up with!

More Info on My Code

Interested in using the notebooks? Check out my Github page which includes the codes, data and instructions on how to use them. Any comments or suggestions are welcome!

Acknowledgements

Thanks to PyLadies Seattle, specifically Erin Shellman and Wendy Grus for organizing this fun little workshop! Also many thanks to Ada Developers Academy for providing the space.

The Million Song Database and Recommendation Systems

2014-07-27T15:56:00-04:00

Building Recommendation Systems

Recommender systems filter information to predict how much a user would like a given item. Companies like Netflix and Tivo use these types of filtering algorithms to try to figure out what a person will want. Unfortunately, these systems are not perfect, and sometimes can go horribly wrong, as elegantly described by Patton Oswalt on the Conan O'Brien show:

Yes, bad Tivo.

So how do we improve recommender systems? Companies as well as academics are trying hard to figure this out. Fortunately, some groups released large datasets so the anyone can play with them and try to solves these issues. One such publicly available dataset is the The Million Song Dataset -- a perfect dataset for building recommender systems! So, I thought I would give it a try.

For this project, I focused on the Taste Profile subset provided by Echonest, which includes information on user play lists to build my recommenders located on my Github page. I built two recommenders; one that figures out what songs a user would like by using an input of a selected song, and another that recommends songs based on what the user has in their play list.

Both recommenders use a combination of Collaborative filtering techniques with vote counting. Collaborative filtering makes recommendations by collecting taste preferences and comparing them to other users. Here we assume that others that have the same song in their play list have similar tastes. Therefore, songs in the other users play lists would be good ones to recommend. In these recommenders I ultimately get to a list of songs that were provided by other users. I then count up how many times a song appears in other people's play lists (vote counting) and spit out the top counted songs as the top recommended songs. In this blog I briefly describe the approach for both the simple, single song recommender and the slightly more complex user recommender for users with a play list.

The Data

The Taste Profile subset contains over a million users with over 380,000 unique songs. I only use a very small subset of data that includes:

A unique user ID
All the songs in the user play list including:

Song name and id

Artist name and id

The number of times the song was played by the user

The Simple Recommender

For my simple recommender I don't know anything about the person selecting the song. All I know is the selected artist and song. The steps for this recommender include:

Find all users that have the song in their play list
Make a list of all songs from each person's play list
Count how many times a unique song appears in the list
Print out the songs in the order of most counts that was not the original input song

Easy cheesy, right?

To illustrate the outcome of this recommender, here is a plot of the top 10 most counted songs from other people's play lists given the song Yeah! by Usher (keep in mind these are the counts for my much smaller subset of data):

Adding User Play List into a Recommender

Adding a user play list into a recommender is slightly more complex. Here, I want to know what other users are most similar to the recommendee (for lack of a better term, I define the recommendee as the person who is going to get the recommendation), then suggest songs from the similar users' play lists. The steps for this recommender include:

For each song in the recommendee play list, make a list of all users that also have that song in their play list.
Count the number of times a unique user is in the list. The user with the most counts is the most similar to the recommendee.
Pick the most similar users and concatenate a list of songs that were not in the recommendee's play list.
Count the number of times a song shows up in the list
Print out the songs in order of most counted

Slightly more complicated than the simple recommender, but generally the same idea.

Pitfalls

There are issues with these simple approaches. They work well for the small data set that I downloaded, but as the dataset gets larger, the lists and dictionaries that I make in my code also get larger. So, this approach will take up increasing amounts of memory to make my lists, and increasing amounts of time to sort the lists and count the number of songs. Model-based approaches help to minimize these issues. Another issue is making recommendations based on new songs, or songs that very few people have listened to. In these cases other information about the song, such as genre, would be needed to make recommendations.

More Info on My Code

Interested in using my recommenders? Check out my Github page which includes the codes, instructions on how to use them, and some more information on how the codes work. Any comments or suggestions are welcome!

Acknowledgements

Thanks to Stella Rowlett, Jason Gowans and Manju Muthukumaresan for suggesting this project!

My big fat shoe-shopping adventure: Iterative sampling in R

2014-07-27T14:56:00-04:00

R helped me figure out how many shoes I can buy

One of the things I love about coding and data science is that I get to work on a lot of interesting problems. One of my good friends Craig Faunce Craig Faunce approached me over a beer with a problem. It seems he had been asked to determine how many items he could buy given a certain budget. Ok, if each and every item costs the same this is simple math, which has me puzzled. Of course it’s not that easy, since each and every item has a different cost. Ok, still not that difficult. It only becomes something that I think you would be interested in when he gets to the next part, where he says: "I'm asked to sample one population of items at a given rate, and then with my left-over money, determine at what rate I can afford to sample a second, totally different population of items with totally different costs per item."

Ok! We have an interesting little sampling project. Since Craig works for a large employer, he can't really divulge every gory detail about this issue, and obviously getting the real data isn't going to happen here. Besides, it sounded pretty boring to me, so I thought about something that I can relate to - shoes!

Figure 1 Ahh, too cute...

So I reframed the questions.

My first question is: If this year (hopefully during a big Sale) I were to blindly have an assigned shopper (or better yet, a blind assigned shopper) randomly buy a set percentage of the store, how much money would I spend? The reason we want to sample in this exercise is due to the fact that the answer depends on which shoes are purchased, since each one has a different price. So we are interested in building a distribution of potential outcomes from shoe-shopping, so we can build a range of likely outcomes from the adventure.

We will need the following libraries:

require(plyr)
require(ggplot2)

The actual data doesn't really matter for this exercise, so lets generate some with these parameters:

nshoe1 <- 1000            # Number of shoes in the store in the first year.
meanprice1 <- 100         # Mean price of shoes in the first year.
pricesd1 <- 50            # Standard deviation of the price of shoes in the first year.
R <- 0.01                                     # The sampling rate of my shopper in the first year.
it <- 200                                     # The number of iterations to build our distribution of outcomes.

I created a makedata function to create a dataframe in R consisting of nshoe rows with the associated price (called bucks) generated from a known distribution (in this case the normal, but who cares?) with a mean price of meanprice1 and a standard deviation of pricesd1:

makedata <- function (numberofshoes, dm, sdv){
  # Assign number of shoes
  df <- data.frame(shoes = seq(1:numberofshoes))
  # Assign random # of bucks for each shoe
  df$bucks <-  rnorm(n = numberofshoes, mean = dm, sd = sdv)
  return (df)
 }

The function sampleme samples from the dataframe that was created from the makedata function above:

sampleme <- function(dataframe, samplerate){
  # Generate a subsample of shoe numbers, then take the associated
  # bucks and stick them into sdf.
  sdf <- data.frame(shoes=sample(1:nrow(dataframe), size = (samplerate*nrow(dataframe))))
  sdf <- merge(sdf,dataframe,all.x=TRUE)
  return (sdf)
}

Finally, a third function storesamples enables the outcome of each random sample to be stored and appended to prior samples for later use:

storesamples<-function(iteration, df, sr){
  for (iter in 1:iteration){
    sdf <- sampleme(dataframe = df, samplerate=sr)
    sdf$index <- iter
    ifelse(iter == 1, allsdf <-sdf, allsdf <-rbind(allsdf,sdf))
  }
  return(allsdf)
}

Note that the function storesamples calls function sampleme.

Now that I have my functions, let's figure out how much money I spend if I buy 1% of the store's inventory:

# make a dataframe
shoesinstore1 <- makedata(nshoe1, meanprice1, pricesd1)
# calculate how much $$ you spent by buying 1% of the inventory
moneyIspent <- storesamples(it,shoesinstore1,R)

Now let's make a summary of the money I just spent and print it out:

summarya <- ddply(moneyIspent, .(index), summarize, Totalbucks = floor(sum(bucks)))
summary(summarya$Totalbucks)

In my last run, here are my results:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
604.0   897.8  1009.0  1010.0  1120.0  1383.0

So I can expect my blind shopper to come back with a Visa/AmEX/Mastercard charge of around a thousand bucks, but it could be as low as $600, or as high as $1383 (still within my spending limit- whew!). Now let's plot our results using a histogram:

 (ggplot(summarya, aes(x=Totalbucks))
+ geom_histogram()
 )

This gives you:

Now for my second question. The following year I am given the same amount of money I spent last year as my budget. What percentage of the store's inventory in year 2 can I buy given the amount of money I spent last year?

Here we have reversed the sampling question from year 1: instead of sampling at a fixed rate to generate a distribution of credit card debts, we now have a distribution of available spending limits, and are asked to generate a distribution of expected percentage of the store purchased.

To ensure we don't go over our budget, we can't create a single sample of a given number of shoes as above- we have to select a single pair of shoes, evaluate its cost against our remaining funds, and then repeat until we have no more money. Of course in addition we need to count the number of shoes. We select each pair of shoes and conduct our evaluation with our shoesIcanbuy function:

shoesIcanbuy <- function(dataframe,mypurse){
numofshoepairs <- 0
while (mypurse > 0)  {
    Shoe.pair<-dataframe[sample(nrow(dataframe),1),] # Pick a random pair of shoes
    if (mypurse >= Shoe.pair$bucks){                 # As long as I have enough money in my purse
      mypurse<-mypurse-Shoe.pair$bucks               # Buy a pair of shoes and subtract their price from my budget
      numofshoepairs <- numofshoepairs + 1           # Record the number of shoes I bought
    }
    else {
      break
    }
  }
  return(numofshoepairs)                             # Return the number of shoes I bought
}

However the above function only gets us so far- our real interest lies in the summary of multiple shoe-shopping extravaganzas, which- you guessed it- we will conduct with another function:

how_many_shoes_in_store_I_bought <- function(dataframe, summarya, it){
  numofshoepairs <- array()                             # Declare an array
  for (i in 1:nrow(summarya)) {                         # Use each row in summarya as my starting budget
    mypurse<-summarya[i,2]
    for (j in 1:(it)){                                # Figure out how many shoes I bought with each starting budget
      numofshoepairs[j] <- shoesIcanbuy(dataframe, mypurse)
    }
    numofshoepairs.df<-data.frame(Shoes=numofshoepairs)
    ifelse(i==1, numofshoepairs.masterdf<-numofshoepairs.df,
           numofshoepairs.masterdf<-rbind(numofshoepairs.masterdf,numofshoepairs.df))
  }
  return(numofshoepairs.masterdf)
}

Now let's make this a little more realistic by making a completely different shoe line-up in the store for year 2:

shoesinstore2 <- makedata(nshoe2, meanprice2, pricesd2)

Now collect information on how many shoes I bought, and the corresponding percentage of how many shoes I bought in the store:

numofshoepairs.masterdf <- how_many_shoes_in_store_I_bought(shoesinstore2,summarya,it)

Calculate a percent of the store by taking the number of shoes I bought and dividing it by the corresponding number of shoes in the store, and multiplying by 100:

numofshoepairs.masterdf$Percent<-(numofshoepairs.masterdf$Shoes/nrow(shoesinstore2))*100

OK, let's see how much of the store I bought out:

summary(numofshoepairs.masterdf$Percent)

which gives:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.2143  0.5000  0.5714  0.5736  0.6429  1.0710

and how many shoes I bought:

summary(numofshoepairs.masterdf$Shoes)

which gives:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.000   7.000   8.000   8.031   9.000  15.000

So, I bought about 8 pairs of shoes.

Finally, let's plot a histogram of the percentage of shoes in the store I bought:

(ggplot(numofshoepairs.masterdf, aes(x=Percent))
 + geom_histogram(aes(y=..density..), fill="gray", color="black", binwidth = .1)
 + theme_bw()
 + geom_vline(x=mean(numofshoepairs.masterdf$Percent), color="blue")
)

And you get:

And that's our shoe-shopping adventure: Sampling with the built-in function of sample in R, where we determined the size of a single sample through our rate, and secondly with the supplied function where we sample individual elements in a population and evaluate each outcome against a set threshold. Sampling forwards and backwards- have fun, and good shopping!

Interested in getting your hands on the code? Check it out in my Github Repo.

SQLite3 Databases: Creating, Populating and Retrieving Data, Part 3

2014-07-27T13:56:00-04:00

In Part 1 Creating a Database with SQLite3 we built a database. In Part 2 Populating an SQLite Database using Python we inserted values into TABLES within our database using Python 3.4 and SQLite3. Here we continue using the functionality of Python 3.4 to retrieve and visualize forecasts contained within our database. Again, I cannot thank enough David Branner, for his efforts with this project!

Our desired end-product will be to produce the map below of the differences between forecasts that were made for a specific calendar day and the forecast for that day.

Figure 1. Maps of forecasted differences (the difference between the day of forecast and the forecast for X days out).

Part 3. Retrieving data from an SQLite Database using Python

First we need to retrieve the weather forecast data we made in the previous posts. Our database contains the forecasted maximum temperature (maxt), minimum temperature (mint), rain and snow for a given day that were made the day of to fourteen days prior. So, we need to be able to extract this information from the database. This code uses the sqlite3 module in python to extract the information:

import os
import sqlite3
def get_single_date_data_from_db(exact_date, db='weather_data_OWM.db'):
    """Retrieve forecasts for single date."""
                # exact date should be in the form YYYYMMDD
    connection = sqlite3.connect(db)
    with connection:
        cursor = connection.cursor()
        try:
            cursor_output = cursor.execute(                     # This should all be old hat to you now...
                '''SELECT lat, lon, '''
                '''maxt_0, mint_0, rain_0, snow_0, '''
                '''maxt_1, mint_1, rain_1, snow_1, '''
                '''maxt_2, mint_2, rain_2, snow_2, '''
                '''maxt_3, mint_3, rain_3, snow_3, '''
                '''maxt_4, mint_4, rain_4, snow_4, '''
                '''maxt_5, mint_5, rain_5, snow_5, '''
                '''maxt_6, mint_6, rain_6, snow_6, '''
                '''maxt_7, mint_7, rain_7, snow_7, '''
                '''maxt_8, mint_8, rain_8, snow_8, '''
                '''maxt_9, mint_9, rain_9, snow_9, '''
                '''maxt_10, mint_10, rain_10, snow_10, '''
                '''maxt_11, mint_11, rain_11, snow_11, '''
                '''maxt_12, mint_12, rain_12, snow_12, '''
                '''maxt_13, mint_13, rain_13, snow_13, '''
                '''maxt_14, mint_14, rain_14, snow_14 '''
                '''FROM locations, owm_values '''
                '''ON owm_values.location_id=locations.id '''
                '''WHERE target_date=?''', (exact_date,))
        except Exception as e:                                                                                                                  # What exceptions may we encounter here?
            print(e)
    retrieved_data = cursor_output.fetchall()                   # We receive list of simple tuples from database.
    composed_data = generate_dict_of_tuples(retrieved_data)     # Now we need to build some function that converts the retrieved data into a dictionary.
    return composed_data

Note the line:

composed_data = generate_dict_of_tuples(retrieved_data)

Here we need some way to make a usable form of the dataset. In this case the function generate_dict_of tuples receives the raw data from the SQLite3 database and converts it into a more usable dictionary of tuples:

def generate_dict_of_tuples(retrieved_data):
    """Compose the data into a succinct dictionary of tuples."""
    # Our re-composed data type is a dictionary of tuples.
    # Each tuple contains three items:
    #     sub-tuple containing latitude and longitude (floats);
    #     list of 15 sub-sub-tuples, each containing
    #         maxt, mint, rain, snow (floats).
    # For dates where the database contains no data, the forecast tuple
    # would be: `(None, None, None, None)` but this is replaced by `None`,
    # using an `if-else` clause.
    composed_data = {}
    for item in retrieved_data:
        lat_lon = item[0:2]
        forecasts = [subitem
                    if subitem[0] or subitem[1] or subitem[2] or subitem[3]
                    else None
                for subitem in
                zip(item[2::4], item[3::4], item[4::4], item[5::4])]
        composed_data[lat_lon] = forecasts
    return composed_data

Now having both of these functions in place, if we run:

get_single_date_data_from_db(20140522)

We get a dictionary that looks like this:

 {(38.576698, -92.173523): [(18.71, 6.97, 0, 0),
 (21.03, 8.7, 0, 0),
 (20.67, 9.72, 0, 0),
 (19.01, 7.23, 0, 0),
 (22.08, 9.07, 0, 0),
 (21.68, 9.53, 0.34, 0),
 (22.33, 10.22, 0, 0),
 (16.18, 12.14, 1.23, 0),
 (19.05, 12.02, 10.08, 0),
 None,
 None,
 None,
 None,
 None,
 None],
(34.154179, -117.344208): [(17.37, 6.16, 0, 0),
 (19.66, 7.48, 0, 0),
 (21.24, 6.27, 0, 0),
 (21.71, 5.5, 0, 0),
 (18.34, 8.88, 0, 0),
 (20.78, 4.73, 0, 0),
 (20.78, 4.73, 0, 0),
 (22.96, 7.06, 0, 0),
 (20.78, 4.73, 0, 0),
 None,
 None,
 None,
 None,
 None,
 None],
       .
       .
       .}

The keys are the location's latitude and longitude, and the values are the forecasts. In this example we have 9 forecasts: one for the day of and 8 days out (other values that are not present are marked as ‘None’).

Fabulous. In Figure 1 we focus only on the maximum temperature (maxt) forecasts. We visualize the absolute differences between the maximum forecasted values for the day of and the forecasted value for that day at some time in the past. The differenced values are presented on a map of the United States using warm colors to reflect that the forecast the day of was warmer and cooler colors to reflect cooler temperatures (pun intended). With our data extracted, we need only to calculate the differences and we will plot the data using python's matplotlib with the basemap toolkit.

This visualization will include six subplots- one for each successive day leading up to our target date. Thinking about this another way, if our target date is April 22, 2014 (20140422), and we assign that the letter t, then we are making a subplot for differences between t, the day of forecast, and the forecast made at t-1, t-2, t-…n days.

To collect the data for our target date we run the function below, which makes lists containing the latitude, longitude and differences, and sends them off to be processed by our next function:

def make_map(target_date=20140422):
'''Make a basic map of the United States'''
# target_date is the day the forecasts were made for
lat=[]; lon=[]; diff=[]
forecasts = get_single_date_data_from_db(target_date)                           # Call earlier function to get dictionary
for city in forecasts:
        # First collect the lats and longs of the cities
        lat.append(city[0])
        lon.append(city[1])
        #Collect differenced maxt values
        diff.append([
                (forecasts[city][0][0]-forecasts[city][1][0]),
                (forecasts[city][0][0]-forecasts[city][2][0]),
                (forecasts[city][0][0]-forecasts[city][3][0]),
                (forecasts[city][0][0]-forecasts[city][4][0]),
                (forecasts[city][0][0]-forecasts[city][5][0]),
                (forecasts[city][0][0]-forecasts[city][6][0]),
                (forecasts[city][0][0]-forecasts[city][7][0]),
                (forecasts[city][0][0]-forecasts[city][8][0])])
make_basemap(lon,lat,diff,target_date)                                                                                  # Send this information to make_basemap --> our next function!
plt.show()

The second function we have named "make_basemap”, and does the mapping work:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap, cm
def make_basemap(lon,lat,diff,target_date):
        for day in range(1,7):                                                                                                                                          # Run this for each forecasted difference
                subdiff = []
                for city in range(0,len(diff)):
                        subdiff.append(diff[city][day])
                plt.subplot(3,2,day)                                                                                                                                            # Define where the subplot will lie on figure
                mindiff = min(subdiff)
                maxdiff = max(subdiff)

                # create Mercator Projection Basemap instance.
                m = Basemap(projection='merc',\
                                                                llcrnrlat=25,urcrnrlat=50,\
                                                                llcrnrlon=-130,urcrnrlon=-60,\
                                                                rsphere=6371200.,resolution='l',area_thresh=10000)
                # draw coastlines, state and country boundaries, edge of map.
                m.drawcoastlines()
                m.drawstates()
                m.drawcountries()
                # draw parallels.
                parallels = np.arange(0.,90,10.)
                m.drawparallels(parallels,labels=[1,0,0,0],fontsize=10)
                # draw meridians
                meridians = np.arange(180.,360.,10.)
                m.drawmeridians(meridians,labels=[0,0,0,1],fontsize=10)

                # draw Circles on the map
                # Determine min and max differenced values
                jet = plt.cm.get_cmap('jet')
                x,y = (m(lon,lat))
                sc = plt.scatter(x, y, c=subdiff, vmin=mindiff, vmax=maxdiff, cmap=jet, s=8, edgecolors='none' )
                # add colorbar
                plt.colorbar(sc)
                # add title
                plt.suptitle("Differenced Max Temperatures (degrees C) for day "+str(target_date), fontsize=18)
                plt.title("Forecast Day 0 - Day "+str(day))

Executing make_map() we get the Figure 1. Note that a subplot is created for each differenced forecast through a for loop which also defines the subplot being created.

Like what you see? Stay tuned, the next step on my agenda is making an interactive website that will allow users to play with the data! Thanks for reading!

SQLite3 Databases: Creating, Populating and Retrieving Data, Part 2

2014-07-09T14:56:00-04:00

In Part 1 Creating a Database with SQLite3 we built a database. Here we will use the functionality of Python 3.4 to help populate our database created in Part 1 with data and weather forecasts. This blog assumes you followd Part 1, and have some prior knowledge of Python. Much of this work was done with David Branner, who was incredibly patient in teaching me how to do this... Kudos, David!

Part 2. Populating an SQLite Database using Python

Let's put some data into our database! First, let's fill up our locations TABLE. We collected and keep a list of the cities, their unique codes provided by Open Weather Map, their latitudes and longitudes and their country codes here, in a file called city_list_normalized_20140425-1923.txt. This file contains information on the cities and looks like this:

id     nm      lat     lon     countryCode
819827 Razvilka        55.591667       37.740833       RU
524901 Moscow  55.752220       37.615555       RU
1271881        Firozpur Jhirka 27.799999       76.949997       IN
1283240        Kathmandu       27.716667       85.316666       NP
703448 Kiev    50.433334       30.516666       UA
1282898        Pokhara 28.233334       83.983330       NP
3632308        Merida  8.598333        -71.144997      VE
.
.
.

We need a way to grab this file and read the contents in python. Let's create a function that will do just that. If we are in the directory containing the file called city_list_normalized_20140425-1923.txt, we can call the file in python and read its contents:

def isolate_city_codes():
   filename = 'city_list_normalized_20140425-1923.txt'
   with open(filename, 'r') as f:
       contents = f.read()
   list_of_lines = [line.split('\t') for line in contents.split('\n')[1:]]
   # Latitude and longitude should be numbers
   for i in range(1, len(list_of_lines)-1):
       list_of_lines[i][2] = float(list_of_lines[i][2])
       list_of_lines[i][3] = float(list_of_lines[i][3])
   return list_of_lines

Let's break down what the function is doing. The first thing is that it defines a string called filename as 'city_list_normalized_20140425-1923.txt'. The next two lines of code are contained in a 'with statement'. A 'with statement' is a context manager, which provides a way to safely close the opened file and exit out of the python script in case of an error. The contents of the file are read and placed into the variable contents that looks something like this:

'id\tnm\tlat\tlon\tcountryCode\n819827\tRazvilka\t55.591667\t37.740833\tRU\n524901\tMoscow\t55.752220\t37.615555\tRU ...
...
\n895417\tBanket\t-17.383329\t30.400000\tZW\n'

Notice the once tab seperated entries of the file are now separated with '\t' and the lines are now separated with '\n'. The next line of our program defines list_of_lines, which runs a loop through contents that splits out each line (defined by '\n') and each tab separated space (defined by '\t'). list_of_lines now looks like this:

[['819827', 'Razvilka', '55.591667', '37.740833', 'RU'],
 ['524901', 'Moscow', '55.752220', '37.615555', 'RU'],
 ['1271881', 'Firozpur Jhirka', '27.799999', '76.949997', 'IN'],
 .
 .
 .
 ['895417', 'Banket', '-17.383329', '30.400000', 'ZW']
]

So, list_of_lines is a list of lists, where a list contains the contents within a set of square brackets. The problem with the current list_of_lines is that the latitude and longtitudes are strings and must be converted into floats, which is done using the for statement. Finally, the revised list_of_lines is returned with floats for the latitude and longitude.

Now, let's populate the TABLE locations in the sqlite3 database weather_data_OWM.db. We write another function that calls the previous function to grab the data, then it populates the locations TABLE with those values:

import sqlite3
def populate_db_w_city_codes(db='weather_data_OWM.db'):
   connection = sqlite3.connect(db)
   with connection:
       city_codes = isolate_city_codes()
       cursor = connection.cursor()
       for code in city_codes[1:-1]:
           if code == ['']:
               print('\n    Empty tuple found; skipping.\n')
               continue
           cursor.execute(
                   '''INSERT INTO locations VALUES''' +
                   str(tuple(code)))

Note that we have to import the python module sqlite3. This module allows you to 'connect' with a specified database. Once you have a connection, you can create a cursor object that calls its execute() method to perform SQLite3 commands. In the function described above, we create a connection which connects to our database (db = 'weather_data_OWM.db'). Then we apply a context manager (the with statement) to:

Collect the information contained within city_list_normalized_20140425-1923.txt by calling our previous function, isolate_city_codes(). The returned list_of_lines from isolate_city_codes() is now labeled as city_codes.

Open a cursor that will execute subsequent SQLite3 commands.

Insert the values of city_codes into the locations TABLE.

Notice that the SQLite3 commands are imbedded into cursor.execute. The lists within city_codes are already in the order we want them (same order as the database columns were set up in, see Part 1). They have been 'tuple-ized' and 'string-ified' since this is the format SQLite3 understands.

After executing, you can now check if they were inserted into your database by entering the sqlite3 repl:

sqlite3 weather_data_OWM.db

Once in the sqlite repl type:

SELECT * FROM locations;

The output should look something like this:

.
.
.
894413|Chakari|-18.062941|29.89246|ZW
894460|Centenary|-16.722891|31.11462|ZW
895057|Binga|-17.620279|27.341391|ZW
895417|Banket|-17.383329|30.4|ZW

Your new table should have data that include: city id|city name|latitude|longitude|two letter country code.

Splendid! One table down, one related table to go! The second table is a bit more complicated. It involves data that was downloaded through the Open Weather Map API that allows for easy access to their data products that are available in XML and JSON formats. Since this blog is focusing on building and populating databases, I assume that you already have the data downloaded in JSON format. I will not get into how to download the data here, but for more information on how to do this, David developed a nifty little python script called requests.py that allows you to download data using an API key that is hidden from public access (important when allowing public access to your files in Github).

We use the JSON formatted files to populate our database. JSON files are in the form of a dictionary, also known as an associative array. If you are not familiar with this data structure, I recommend you read this little ditty before continuing. Otherwise, keep on reading!

Below is an example of a JSON file obtained using the Open Weather Map API that has been prettified using http://jsbeautifier.org/. The JSON file contains the forecasts and information for a single city:

{
   'cod': '200',
   'message': 0.005,
   'city': {
       'name': 'Bay Minette',
       'id': 4046255,
       'coord': {
           'lat': 30.882959,
           'lon': -87.773048
       },
       'population': 8044,
       'country': 'US'
   },
   'list': [{
       'weather': [{
           'description': 'few clouds',
           'icon': '02d',
           'main': 'Clouds',
           'id': 801
       }],
       'temp': {
           'max': 27.32,
           'min': 18.14,
           'eve': 24.57,
           'day': 27.22,
           'night': 18.14,
           'morn': 27.22
       },
       'deg': 199,
       'clouds': 12,
       'pressure': 1020.38,
       'humidity': 42,
       'dt': 1398186000,
       'speed': 2.11
   }, {
       'weather': [{

       .
       .
       .

   }],
   'cnt': 15
}

Now you see that it is just one giant dictionary, right? So if we import this into python, then we can call certain values by their keys. For example, if we call this dictionary x, then we can retrieve the latitude of the city by typing:

x['city']['coord']['lat']

In this JSON file the 'city' key contains the information about the city itself, and the 'list' key contains information on the weather forecasts, where the first value contains information on the weather forecasts for the day the file was downloaded. The second value in 'list' contains the forecast for the next day, etc.

You can see that the file contains the minimum and maximum temperature, and, if it exists, also contains snow and rain amounts. 'dt' is the day the forecast is for in Unix Time. The 'query date' which is the day the file was downloaded is not included in these files but is important because this will tell you which day is the day-of forecast. We dealt with this problem by downloading the JSON files for each city into a directory with the download date.

The first thing we will need to do here is extract the information we need from these JSON files. For the sake of simplicity, I assume you know the download date and specify it in the python code (rather than extracting it from the directory name). Depending on what region you are collecting, you may have thousands of files for one download date, each corresponding to an individual location. We would like a function that

Ingests these JSON formatted files and stores the contents as a dictionary

Create a smaller dictionary called forecast_dict that contains only the information that we need for our database. The smaller dictionary should have a 'key' that relates to the city_id, and values that contain the forecasted values.

I assume that you have the names of your files in a list called files that were collected on a specified query_date. I use an example query date of 20140422:

files = [yourfile1.json, yourfile2.json, yourfile3.json... ]  # example files

import ast
def retrieve_data_vals(files, query_date='20140422'):
   forecast_dict = {'query_date': query_date}                 # Assign query_date to dictionary
   files.sort()
   for file in files:
       forecast_list_pruned = []
       try:
           with open(file, 'r') as f:
               contents = f.read()                            # Read in file as a string
       except Exception as e:
           print('Error {}\n    in file {}'.format(e, file))
       if contents == '\n':
           print('File {} empty.'.format(file))
           continue
       content_dict = ast.literal_eval(contents)              # Convert to dictionary
       city_id = (content_dict['city']['id'])                 # Assign city_id
       forecast_list_received =(content_dict['list'])         # Assign everythin in 'list' to forecast_list_received
       for i, forecast in enumerate(forecast_list_received):  # For each forecast in the dictionary
           if 'rain' in forecast:                             # Assign rain, if exists,
              rain = forecast['rain']                         # Otherwise make 0
           else:
              rain = 0
           if 'snow' in forecast:                             # Same with snow
               snow = forecast['snow']
           else:
               snow = 0
           forecast_tuple = (                                 # Assign forecast information in tuple form that is SQLite3 readable (if stringified)
                   forecast['dt'],
                   float(forecast['temp']['max']),
                   float(forecast['temp']['min']),
                   float(rain),
                   float(snow),
                   )
           forecast_list_pruned.append(forecast_tuple)        # Collect all forecasts for that file
       forecast_dict[city_id] = forecast_list_pruned          # and assign to the forecast_dict for each city
   return forecast_dict

Phew! Extracting the data from the JSON files and putting it into an SQLite3 friendly format is the toughest part. Now that we have forecast_dict, however, we can populate our database! Our next function will use some of the same techniques described above, which include using the sqlite3 module to make a connection with the sqlite database and execute SQLite3 commands:

import sqlite3
def populate_db_w_forecasts(db='weather_data_OWM.db'):
   forecast_dict = retrieve_data_vals(files)                 # Run the function retrieve_data_vals above which returns the forecast dictionary
   query_date = forecast_dict['query_date']                  # Assign query_date
   connection = sqlite3.connect(db)                          # Create the SQLite3 connection
   with connection:
       cursor = connection.cursor()
       for key in forecast_dict:
           if key == 'query_date':
               continue                                      # After here, "key" is a location_id.
           for i,item in enumerate(forecast_dict[key]):
                  target_date = datetime.datetime.fromtimestamp(int(item[0])).strftime('%Y%m%d')                         # Convert the Unix time to human readable string
                  maxt, mint, rain, snow = item[1:]          # Remember forecast dict contains dt, maxT, minT, rain and snow, so we want everything past dt (hence item[1:])
                  i = str(i)
                  fields = ','.join([                        # 'fields' contains question marks that indicate where values will be inserted later in the code
                          'maxt_' + i + '=?',
                          'mint_' + i + '=?',
                          'rain_' + i + '=?',
                          'snow_' + i + '=?'
                          ])
                  try:
                      cursor.execute(                        # Insert the location_id (key) and target_date
                              '''INSERT INTO owm_values '''
                              '''(location_id,target_date) '''
                              '''VALUES (?,?)''', (key, target_date))
                  except sqlite3.IntegrityError as e:
                      pass
                  cursor.execute(                            # Insert forecast values
                       '''UPDATE owm_values SET ''' + fields +
                       ''' WHERE id='''
                       '''(SELECT id FROM owm_values '''
                       '''WHERE location_id=? AND target_date=?)''',
                       (maxt, mint, rain, snow, key, target_date)

Let's talk a little about cursor.execute. Here we do a little python trick to insert values into the SQLite code. In cursor.execute, we state the SQLite3 commands, but we include question marks (?). After the SQLite commands we place a comma then a tuple of values. These tuple values are inserted into where the question marks appear in the code in the order the question marks appear. So, in the case of:

cursor.execute(
        '''INSERT INTO owm_values '''
        '''(location_id,target_date) '''
        '''VALUES (?,?)''', (key, target_date))

The SQLite3 command is:

INSERT INTO owm_values (location_id, target_date) VALUES (key, target_date)

where 'key' is the location city id, and 'target_date' is the date the forecast is for. Note the location_id of the owm_values TABLE refers to the city_codes of the locations TABLE.

There you go! You now have a relational database that has been populated with data! Now what to do with it... Stay tuned for Part 3 Retrieving data from an SQLite Database using Python.

SQLite3 Databases: Creating, Populating and Retrieving Data, Part 1

2014-07-04T14:56:00-04:00

Structured Query Language (SQL) is a langauge that is used to design and manage data held in a relational database. A relational database is a database that contains multiple tables that contain related values. For example, one table may contain names of people and their ages, and another may contain names of people and their favorite color. The names of the people are the related values. SQL provides a relatively easy (and commonly used) way of extracting only the data you want from the database that can later be analyzed or visualized.

David Branner, a fabulous python coder who dabbles in creating and using SQLite databases, and knows a thing or two about the Chinese Language, and I are working on The Weather Project where we intend to examine the accuracy of weather forecasts. In order to do that, we need to collect weather forecasts that will be analyzed. We decided to use weather forecasts from Open Weather Map, a website that gives open access to weather forecasts through an API key. Through the API key, we are able to download JSON files that contain information on the weather forecasts at specific locations around the world. Our goal is for each day to collect weather forecasts for that day and from 1 day before to about two weeks out. We collect the maximum temperature (maxt), the minimum temperature (mint), snow and rain forecasts for each of the forecasts. Then we subtract the predicted value from the observation to estimate how much the forecast predicts warmer/cooler temperatures or more/less snow and rain. Hence, we need to collect a lot of information and organize it in a way that will be relatively easy and consistent to retreive. To do that, we created an SQLite3 database. This blog is the first of three, and focuses on creating a Database with SQLite3. The next blogs will cover Populating an SQLite Database with Data using Python and Retrieving data from an SQLite Database using Python.

Part 1: Creating a Database with SQLite3

SQLite is a compact and self contained relational database management system. We decided to use SQLite3 (Mac OS X's version of SQLite) because

It is included on the Mac OS X operating system (/usr/bin/sqlite3)

It does not require a server and no need for an administrator

It does not include any configuration files

No action is required after a system crash

Certainly, there are issues with SQLite, but for our humble little project SQLite provides all the functionality we wanted. If you are running Mac OSX you can use SQLite3. Be sure that /usr/bin/ is in your path (it already should be there). You can check to see if you have it by typing:

which sqlite3

Let's get started. First, a few things about sqlite3. You can enter the sqlite3 repl by simply typing sqlite3 at the command line. Or, you can type:

sqlite3 mydatabase.db

to ensure your creations/populations/extractions are all for the database mydatabase.db (or whatever you want it named). If you make a sqlite3 script that is applied to mydatabase.db called myscript.sql, you can run it at the command line by typing:

sqlite3 mydatabase.db < myscript.sql.

Our SQLite database that we named weather_data_OWM.db is set up with multiple tables. Information within those tables are related, and is referred to as a relational database. As previously mentioned, a relational database is setup so that there is some common information between tables that helps link them. Our database tables are linked by city id. The city id is simply a unique number assigned to each location that has a weather forecast. In one table we keep the properties of each location, such as the latitude, longitude, city name, etc. In the other, we assign the forecasts to each city id. Let's take a closer look at how this works.

The first thing we did was create a TABLE called locations which contains the id, name, latitude, longtiude and country:

DROP TABLE IF EXISTS locations;
CREATE TABLE locations (
   id TEXT PRIMARY KEY UNIQUE,
   name TEXT,
   lat NUMBER,
   lon NUMBER,
   country TEXT
);

Eeeek! The "DROP TABLE" part of this code is a little scary -- here we are saying if there is already a table in our database called locations then remove it! The table locations will be completely removed and can not be recovered. You may ask, why would you want to do that??? Well, this code is simply meant to provide the bones for our database. The only reason we are running this script is to make a database from scratch, and if one exists, it should be removed. It is also recommended because you don't want to confuse the current data with other data sets if a table called locations exists. So BE CAREFUL with this command.

The next command lines create the table with columns that are defined as containing a certain type of field. The columns that we have are id, name, lat, lon and country and are either TEXT (strings) or NUMBERS (floats). The id column is special because it also contains PRIMARY KEY command. The PRIMARY KEY command ensures that all rows in that column are uniquely identifiable. To be extra certain of this (but may be a little redundant), we also included UNIQUE, which ensures that all values in the column are different.

How can we tell if the table was made properly? If you entered the commands above in the repl, then type:

SELECT * FROM sqlite_master WHERE type='table';

What should print out is information on your new table, including its structure:

table|locations|locations|2|CREATE TABLE locations (
id TEXT PRIMARY KEY UNIQUE,
name TEXT,
lat NUMBER,
lon NUMBER,
country TEXT
)

Very good! Now we have a table that will contain some characteristics of each city. Now let's make a second TABLE that includes the weather forecasts and will be related to the first one by the city code. We are collecting forecasts for up to 14 days before a target_date which we define as the day being forecasted. We want to know the forecasts for rain and snow, as well as the minimum and maximum temperatures for the target_date. As before, we first need to DROP any existing tables, then we create the table:

DROP TABLE IF EXISTS owm_values;
CREATE TABLE owm_values (
   id INTEGER PRIMARY KEY AUTOINCREMENT,
   location_id TEXT,
   target_date INTEGER,
   maxt_0 NUMBER,
   mint_0 NUMBER,
   rain_0 NUMBER,
   snow_0 NUMBER,
   maxt_1 NUMBER,
   mint_1 NUMBER,
   rain_1 NUMBER,
   snow_1 NUMBER,
   maxt_2 NUMBER,
   mint_2 NUMBER,
   rain_2 NUMBER,
   snow_2 NUMBER,
   maxt_3 NUMBER,
   mint_3 NUMBER,
   rain_3 NUMBER,
   snow_3 NUMBER,
   maxt_4 NUMBER,
   mint_4 NUMBER,
   rain_4 NUMBER,
   snow_4 NUMBER,
   maxt_5 NUMBER,
   mint_5 NUMBER,
   rain_5 NUMBER,
   snow_5 NUMBER,
   maxt_6 NUMBER,
   mint_6 NUMBER,
   rain_6 NUMBER,
   snow_6 NUMBER,
   maxt_7 NUMBER,
   mint_7 NUMBER,
   rain_7 NUMBER,
   snow_7 NUMBER,
   maxt_8 NUMBER,
   mint_8 NUMBER,
   rain_8 NUMBER,
   snow_8 NUMBER,
   maxt_9 NUMBER,
   mint_9 NUMBER,
   rain_9 NUMBER,
   snow_9 NUMBER,
   maxt_10 NUMBER,
   mint_10 NUMBER,
   rain_10 NUMBER,
   snow_10 NUMBER,
   maxt_11 NUMBER,
   mint_11 NUMBER,
   rain_11 NUMBER,
   snow_11 NUMBER,
   maxt_12 NUMBER,
   mint_12 NUMBER,
   rain_12 NUMBER,
   snow_12 NUMBER,
   maxt_13 NUMBER,
   mint_13 NUMBER,
   rain_13 NUMBER,
   snow_13 NUMBER,
   maxt_14 NUMBER,
   mint_14 NUMBER,
   rain_14 NUMBER,
   snow_14 NUMBER,
   UNIQUE (location_id, target_date),
   FOREIGN KEY (location_id) REFERENCES locations(id)
   );

In this table, each forecast is given its own, unique id (called id). In addition, it contains a location_id, which will refer to id in our first TABLE, locations. These values 'link' the two tables, creating a relational database. The FOREIGN KEY statement defines this relationship, stating that the location_id of the TABLE owm_values is REFERENCED to the id of TABLE locations. We also created columns in our TABLE that will store forecasts from the day of (*_0) to 14 days out (*_14). UNIQUE ensures that both the location_id and the target_date are unique in this table (i.e., every city will have its own unique id, and every city will have forecasts for unique target dates).

Now if you type into the repl:

SELECT * FROM sqlite_master WHERE type='table';

Two tables should print out -- the first one being the locations table, the second your brand new owm_values table.

Congratulations! You have now set up a database in SQLite3 that contains two tables. Now for Part 2 Populating an SQLite Database using Python coming soon...

The Dictionary Data structure

2014-07-01T14:56:00-04:00

Dictionaries

This page briefly reviews a dictionary, also known as an associative array, a map or a symbol table. A dictionary is composed of a collection of keys and values, and each key appears only once in a collection. JSON files, human-readable files that are commonly used for web applications, contains data objects in the form of a dictionary.

Let's demonstrate how a dictionary is defined in python. Open a python repl by typing on the command line: python. Once in your repl type:

a = { 'Hello': 'How are you?'}

'a' is a dictionary. Notice how it was defined with the curly brackets. This dictionary contains one key on the left of the colon, and one value on the right side of the colon. You can retrieve the value by typing in the repl:

a['Hello']

and pressing enter, which returns 'How are you?'. You can retreive the list of keys by typing:

a.keys()

and pressing enter, which returns 'Hello'. Now let's make this dictionary a little more complicated:

a = { 'Hello': 'How are you?', 'Goodbye': 'See you Later!' }

Here you now have two keys that have their own associated values. Here you can still run:

a['Hello']

which will still return 'How are you?', but now you can type:

a['Goodbye']

which returns 'See you Later!' A list of keys can be obtained by typing:

a.keys()

which returns the keys in a list form: ['Hello', 'Goodbye']. These keys can be retrieved individually by specifying their element indices starting at 0. For example:

a.keys()[0]

returns 'Hello'. Making sense? OK, let's add one more layer of abstraction which is that the values (of the key/value pair) can be a dictionary. Changing our dictionary again, type:

a = { 'Hello': 'How are you?', 'Goodbye': {'a':'See you Later!','b':'Later Gator' }}

into the python repl. Notice that the key 'Goodbye' now points to a dictionary. If you type into the python repl:

a ['Goodbye']

you now get the dictionary {'a': 'See you Later!', 'b': 'Later Gator'}. You can choose a particular value of this sub-dictionary by specifying:

a ['Goodbye']['b']

which returns 'Later Gator'.

Return to Populating a Database!

The Cascadia Subduction Zone gets Creepier and Creepier...

2014-05-23T14:56:00-04:00

Cascadia subduction zone creep

This blog is a continuation of the original 'Why the Cascadia Subduction Zone is Creepy' blog posted a few weeks ago, and many of the terms used in this post are defined there. My collaborators on this project are Rob McCaffrey at Portland State University and Ken Creager at the University of Washington. The paper this blog is based on is found here.

The Cascadia subduction zone is not just creepy, but it is creepy on many different levels (Figure 1).

Figure 1. I had to include this figure. So funny, I laughed for hours, and I'm not even a fan of pet photos.

No, I don't mean that kind of creepy. What I mean is that the tectonic plates that make up the Cascadia subduction zone between major earthquakes are in some places stuck together, but in others are partially slipping (aka, creeping). My previous blog talks about regions on the subduction fault that are stuck (or 'locked'), and regions undergoing persistent fault creep between major earthquakes, where persistent fault creep means just that -- between earthquakes the plates are constantly, slowly slipping. However, there is yet another slip phenomenon that periodically occurs between major earthquakes called slow slip, and is the topic of this blog.

Figure 2 is a cross section of the Cascadia subduction zone that shows the Juan de Fuca plate subducting beneath the North America plate. On the up-dip portion of the fault, the plates are stuck in between large earthquakes. This region is expected to be where the next megathrust earthquake (magnitude ~9) will occur. Much further down-dip, the plates slide freely past each other. Between these two regions, however, it was fairly recently discovered using continuously recording GPS that the two plates periodically slip over a period of weeks to months (Dragert et al., 2001), and in doing so accumulate enough slip to be equivalent to moment magnitude 6-7 earthquakes! Interestingly, you can't physically feel these periodic slow slip events (or SSEs) because they happen so slowly compared to an earthquake, which can last for a few seconds to minutes. Major slow slip events happen every 10 to 24 months, depending where you are observing along the Cascadia subduction zone. We will talk about how often these events occur a little later in the blog.

Figure 2. Cross-sectional view of the Cascadia Subduction Zone. Image from John Delaney. White oval indicates region that experiences slow slip and non-volcanic tremor. The little devil guy, that is courtesy of too much coffee. Thanks to Aaron Wech for giving me the idea (BTW -- that is not Aaron Wech in the photo, though maybe it should be...).

Not too much time after the discovery of SSEs, periodic bursts of noise were observed at nearly the same time among multiple local seismometers (Figure 3).

Figure 3. Figures modified from the Natural Resources Canada webpage. (A) Map of seismometer network. (B) Example seismic records for corresponding seismometers located in (A).

Soon scientists realized what they were seeing wasn't noise at all -- it was actually a seismic signal generated from these periodic SSEs that became known as non-volcanic tremor, sometimes referred to simply as tremor. Figure 4 demonstrates how well in time the non-volcanic tremor correlates with GPS detection of periodic slow slip. The blue dots in Figure 4 are the east component positions of a GPS site on Vancouver, WA. The time series produces a saw-tooth pattern. Each drop indicates that the motion of the station temporarily reverses (indicating an SSE). Non-volcanic tremor activity is also plotted on Figure 4 and shows that the non-volcanic tremor peaks during these GPS detected SSEs.

Figure 4. Modified from the Natural Resources Canada webpage. Blue circles are daily east position time series of a GPS site near Victoria. The green line is the long term eastward motion of the site (with respect to North America), and the red saw-tooth line shows the motion of the site between events is faster than the long-term motion. The bottom black line shows the number of hours of tremor activity observed on southern Vancouver Island.

The combination of periodic slow slip and non-volcanic tremor together was coined by the Canadian Geologic Survey as 'Episodic Tremor and Slip (ETS)' (Rogers and Dragert, 2003). Intriguingly, non-volcanic tremor and SSEs are not observed together or at all for all subduction zones, but that is a topic for another blog.

Subsequent studies have shown that in Cascadia ETS recurrence varies along strike of the subduction zone. Figure 5, (from Brudzinski and Allen, 2007) color codes select continuously recording GPS (squares) and broadband seismometers (triangles) by how often they detect periodic slow slip and tremor, respectively. Warmer colors indicate the site detected them more often. What Brudzinski and Allen, 2007 found is that ETS recurrence seems to be segmented along the margin, with ETS events happening every ~10 months in northern CA, ~24 months in central to northern Oregon, and about every 14 months in Washington (Figure 5).

Figure 5. Map of the Cascadia subduction zone modified from Brudzinski and Allen, 2007. Squares and triangles represent locations of high precision GPS and broadband seismometers, respectively, and are colored by how often slip and tremor are detected.

So the recurrence of these events are not the same along the margin, but does that mean that the amount of tremor and slip along the margin also differ? First, let's look at the tremor. The Pacific Northwest Seismic Network, operated out of the University of Washington, keeps a continuously updating catalog of tremor along the entire margin. For some interactive tremor fun, you might want to check out their tremor mapping tool. Figure 6 is a tremor density map -- in other words, it takes how many tremors were detected over a specified region (the squares on the map), applies that number to a color scale that is then used to color the region. Dark blue colors indicate regions where the tremor counts are higher. Correlating well with the how often tremor and SSEs are detected, tremor counts for the time period of August 2009-August 2013 (2009.6-2013.6) are elevated where the recurrence time is shorter and lower where the tremor and SSEs are detected less (Figures 5 and 6).

Figure 6. Non-volcanic tremor density map of the Casacadia subduction zone. Tremors from August 2009-August 2013 are used. Tremor counts larger than 400 are colored blue. Tremor locations from the Pacific Northwest Seismic Network tremor catalog. Solid red line marks the 10 mgal gravity anomaly from Blakely et al., 2005.

So what about periodic SSEs? The total amount of slip on a fault due to periodic SSEs over time is a little more difficult to estimate because our observations are on the surface of the earth, but we really want to know what is going on down on the fault. In order to figure that out, we will need to build a mechanical model, but we will get to that part in a minute. For now, let's take a look at the data. In Figure 4 the east component GPS time series of a site in Vancouver, Canada is shown. The GPS east position time series in this figure has a slope (notice that the time series position starts at about 1996 at -5 mm, and ends in 2004 at about 28 mm. The slope (calculated by eye) is then (28mm- -5mm)/(2004-1996) = 33mm/8yrs = 4.125 mm/yr). This slope marks the long term velocity of the time series, which is illustrated in Figure 4 as a green line. Notice that in between slow slip events the slope is larger (red line), and indicates the inter-SSE velocity, which in Cascadia seems to be pretty consistent between SSEs. To better visualize the GPS offsets from SSEs along the Cascadia subduction zone, the inter-SSE velocity is simply subtracted from the time series. Figure 7 displays the time series from select sites from Canada down to northern California. Note that the SSEs (marked by jumps in the time series) are well defined and fairly frequent in the north, reduce in amplitude and recurrence as we enter Oregon, then pick up again as we move into southern Oregon and northern California. South of about 40 degrees latitude SSEs are not detected with GPS.

Figure 7. Map of inter-SSE GPS velocities (black arrows) with select GPS monuments labeled (a). East component of detrended GPS position time series (red dots) with model fit (black line) for sites labeled on the map (b). The site name, latitude of the site (Lat), and the east and north velocity components (Ve and Vn, respectively) are given. Figure from Schmalzle et al., 2014.

Now, let's get into how we figure out what is going on at the fault during the SSEs. As mentioned previously, the GPS position time series are observations on the surface of the earth, but we would like to know how much periodic slow slip is occurring on the fault. Similar to my previous blog, we use a mechanical model. To briefly review, a mechanical model mathematically mimics the behavior of the earth and the math behind these models are based on what we think the earth is doing. In the previous blog, we used a mechanical block model to explore how much the tectonic plates are stuck in between earthquakes, but in this blog we are interested in seeing how much and where the plates are slipping during SSEs. We again use the block modeling software TDEFNODE, which breaks up the region of interest into tectonic blocks (Figure 8). Instead of using the long-term pre-estimated GPS velocities with the model we use GPS time-series directly. What I want you to take away here is that the model mimics how much the tectonic plates are stuck between the SSEs, and how much they slip during the SSEs. It estimates slip for 16 SSEs that occur between 2005.5 and 2011 throughout the Cascadia subduction zone. The goal here is to add up all the SSE slip from that time period and see how it changes as we go from north to south. For details on the modeling, please refer to Schmalzle et al., 2014.

Figure 8. Geometry of the three dimensional block model. Thick black lines mark block boundaries, dots the three dimensional subduction interface. Block names are labeled.Figure modified from Schmalzle et al., 2014.

Now let's look at the results! The black lines in Figure 7b GPS position time series are modeled east positions over time for points that colocate with the observed GPS monuments. Figure 9 are examples of model estimated fault slip patterns for two SSEs in 2007, plotted next to tremor detected during the same time period.

Figure 9. Slip distributions for two SSEs in 2007 estimated using the block model. The left-hand images show slip patterns (colors) overlain with estimated GPS displacement vectors for that event (red arrows). The images to the right show non-volcanic tremor locations that occurred in the same time period as the SSEs. Blue dots are tremor from the Pacific Northwest Seismic Network, and red dots are tremors from the Miami University catalog, courtesy of M. Brudzinski. Figure modified from the Supplementary material of Schmalzle et al., 2014.

As expected, Figure 9 shows that regions experiencing non-volcanic tremor seem to be the same regions the model detects slip for a given time period. Phew. So, now, let's add up all the slip from all the slow slip events and see what we get. Figure 10a and b show cumulative GPS displacements and modeled cumulative slow slip on the fault, respectively, for the time period between 2005.5 and 2011. Figure 10c plots the cumulative tremor counts (blue line) and the sum of slow slip estimated at each node for each down-dip row of nodes as a function of latitude. The non-volcanic tremor data used in this plot spans from 2009.8 to 2013.0, whereas the estimated slip is from all SSEs between 2005.5-2011. Hence, some discrepancies are apparent. However, it is noted that in both cases, more tremor and slow slip occur in northern California and Washington. Both are suppressed between ~42-43 and 46 degrees latitude.

Figure 10. Figure from Schmalzle et al., 2014. (a) Sum of all SSE displacements detected at GPS (red and black vectors) from 2005.5-2011. Red vectors indicate sites that were in operation for 90% of the study. (b) Summed plate interface slow-slip from 2005.5 to 2011. Black vectors are North America relative convergence rates and directions. Thick, solid black lines mark the 10 mgal gravity anomaly contour of Blakely et al., 2005. (c) Cumulative node depth profile interface slow-slip from 2005.5 to 2011 (red line) and 50 km binned cumulative tremor counts from 2009.8 to 2013.0 acquired from the Pacific Northwest Seismic Network tremor catalog (blue line). Thick black line represents latitudes with high gravity anomalies.

Let's recap our observations:

Brudzinski and Allen, 2007 demonstrate using GPS and seismometers that SSEs and non-volcanic tremor detection times are segmented (Figure 5). In other words, between 40 and about 43 degrees north, ETS occurs about once every 10-11 months, ~24 months between 43 and 46-47 degrees north, and about every 14 months north of 47 degrees north.
Tectonic tremor counts are increased south of 43 degrees N and north of 47 degrees N (Figure 6 and Figure 10c, blue line).
Slow slip peaks in northern California and Washington, but is suppressed in Oregon (Figure 10).

These observations sound awfully reminiscent of the observations in my last blog post. The observations there were:

Reduced uplift between major subduction zone earthquakes along the coast between 43 and 46 degrees latitude (Schmalzle et al., 2014).
Reduced paleoseismically derived subsidence for multiple Cascadia earthquakes between 43.5 and 46 degrees latitude (Leonard et al., 2010).

In the last blog post, I talk about how observations 4 and 5 could be explained by persistent fault creep; in other words, if the fault is slipping in between large earthquakes, then it is slowly relieving stress that would have built up if the plates were stuck together. This results in less subsidence during an earthquake, and less coastal uplift between earthquakes. This idea is taken one step further and we suggest that the Siletzia terrane may be the culprit behind the persistent fault creep. The Siletzia terrane is a dense, accreted basalt that can be mapped with gravity surveys (Blakely et al., 2005). The 10 mgal gravity anomaly is plotted as a thick red line in Figure 6 and a thick black line in Figure 10b. We suggest something similar to the conceptual model presented by Reyners and Eberhart-Phillips, 2009, where the Siletzia terrane, if impermeable (i.e., water cannot pass through it), increases pore fluid pressures at the fault by not allowing water to percolate into the overriding crust. High pore fluid pressures at or near the plate interface encourages creep, since these conditions are thought to promote fault slip (Segall and Rice, 1995; Hillers and Miller, 2006).

Brudzinski and Allen, 2007 noted that the thickest accumulations of Siletzia terrane near the coast were also the regions that experience major slow slip and non-tectonic tremor events less often. In Schmalzle et al., 2014 we see that the total amount of tremor and the total amount of slow slip is also reduced in the region. But what does that mean???

Similar to the arguments posed for locking, Audet et al., 2010 suggest that fluids trapped beneath a seal at the plate boundary increase pore fluid pressures. Although the plates in the region of slow slip are stuck most of the time, they suggest that the increased pore fluid pressures allow the plates to slip with small changes in stress. They suggest that once the fault begins to slip, the pore fluid pressure decreases and the plates become stuck again, stopping the slow slip and re-enforcing the new seal. You can imagine then, that variations in the permeability of the upper crust could influence the occurrence of periodic slow slip. If the Siletzia terrane is less permeable, then it may offer a stronger seal than surrounding regions, producing higher pore fluid pressures, which may encourage more of a partial fault creep environment than one that periodically slips.

Figure 11. Map of the Cascadia subduction zone, with the Gamma-style locking model of the subduction fault presented in my previous post. Red indicates areas that are completely stuck, blue areas that are freely slipping. Colors between red and blue indicate regions that are partially creeping. White line marks the region where 95% of non-volcanic tremor occurred between 2009-2012.

Figure 11 is a map that contains the results from the Gamma-style locking model described in my previous post, where red represents areas that are estimated to be completely stuck, and blue represents areas that are freely slipping. The colors in between represent regions that are partially creeping. Also plotted is an outline of where 95% of the tremor occurred between 2009 and 2012. Together, it shows that partial fault creep is up-dip of the tremor. The conundrum here is: if persistent partial fault creep is occurring up-dip of the zone of tremor and slow slip, then wouldn't this increase the stress on the region of tremor and periodic slow slip and foster more slow slip events? If so, then why do we see the opposite -- we see less tremor and slow slip where we have more persistent fault creep! We suggest that the partial fault creep must extend into the zone of non-volcanic tremor and slow slip. Both the locking and the periodic slow slip are thought to be promoted by high pore fluid pressures. So, as fluid pressure increases due to a better seal (maybe the Siletzia?), perhaps persistent partial fault creep is the dominant mode of slip. If correct, then it is possible that the make up of the over-riding crust may determine if the fault slips as ETS or persistent fault creep (Peng and Gomberg, 2010).

For a deeper discussion of the observations and hypotheses presented in this blog, please read Schmalzle et al., 2014.

Thanks for reading and keep in touch! Contents of this blogsite are updated at http://geodesygina.com/. See other contact information below.

Acknowledgments: This work was funded by the National Science Foundation (NSF) Postdoctoral Fellowship Program, award 0847985 (Schmalzle), NSF award EAR-1062251 (McCaffrey), and USGS National Earthquake Hazards Reduction Program, Award G12AP20033 (Schmalzle and Creager). Some of the figures I made myself using General Mapping Tools (GMT), but some figures I took from random places on the web. For any of those images I say where the figure was taken. Many thanks to Reed Burgette and an anonymous reviewer for their thoughtful comments and suggestions that greatly improved this research. Thanks to Mike Brudzinski and Aaron Wech for providing their tremor catalogs. Thanks to Rick Blakeley for providing gravity data. Thanks to PBO and PANGA for providing access to GPS data products. Craig H. Faunce, Bruce Nelson, Steve Malone, Justin Sweet, David Schmidt, Aaron Wech, Tom Pratt, Brian Atwater, Sarah Minson, Lorraine Wolf, and Aimee Schmalzle provided useful comments and insight. Thanks to PBO and PANGA for providing access to GPS data products.

Why the Cascadia Subduction Zone is Creepy

2014-05-01T14:56:00-04:00

Cascadia subduction zone creep

On April 29, 2014, I presented a talk for Data Rave, NYC at EBay. My collaborators on this project are Rob McCaffrey at Portland State University and Ken Creager at the University of Washington. This blog covers the talk, and the original slides can be found in my github repo. The paper for which the talk and this blog are based is found here. Like what you are reading? Check out my follow-up blog on slow slip and tremor!

A key question that geophysicists try to answer is “How much are tectonic plates “stuck” together?" The answer to this question has major implications on seismic hazard since it is thought that the more plates are stuck, the larger the earthquake will be. This blog will discuss how much the North American plate is stuck (or not stuck) to the Explorer, Juan de Fuca and Gorda plates within the Cascadia subduction zone, which resides in the Pacific Northwest corner of the United States.

First, I want to share with you the story of the blind men and the elephant because this story pretty much sums up my experience as a scientist, and my experience with this study (Figure 1). In the story of the blind men and the elephant, there were a group of blind men that decided to figure out what an elephant was really like, having never seen one before. So, they all approach the elephant from different angles and examine it. One blind man approaches a tusk and examining it declares, hey, an elephant is like a spear! But then another one came up to its wriggling trunk and decided it was more like a snake. In fact, each blind man made their own conclusion of what an elephant was really like having observed different parts of the elephant, but each of them only observed a small piece of the elephant. And so, these blind men all bickered about what an elephant really was like, and in reality they were all right because the elephant in fact has all of the properties the blind men said it had, but at the same time, they were all wrong because they weren't able to see the full picture.

Figure 1. Cartoon of the blind men and the elephant. G. Renee Guzlas, artist, source: http://www.nature.com/ki/journal/v62/n5/fig_tab/4493262f1.html

In this study I bring together interdisciplinary data sets, observed by myself and by other scientists in order to get a fuller picture of what is going on with the tectonic plates in the Cascadia Subduction zone. To my knowledge, this is the first study that brings together these datasets in a comprehensive way to better understand subduction zone mechanics.

Before diving into the research, I am going to quickly review some key concepts for this blog. Figure 2 is a map of the major tectonic plates that cover the earth. Each color represents a major plate. The plates rotate and translate with respect to each other all the time. The red arrows show how these plates are moving with respect to each other. They can move apart, which is what is observed in Iceland and mid-oceanic ridges. The plates can move laterally past each other, like the San Andreas fault, or, they can move toward each other. When an oceanic plate collides with a continental plate, the oceanic plate moves underneath ("subducts" beneath) the continental plate. These areas are known as subduction zones and they can produce some of the largest earthquakes in the world. These earthquakes can reach magnitude 9 or more on the Richter scale and can produce tsunamis. The 2011 Japan earthquake and the 2004 Sumatra Earthquake were examples of megathrust earthquakes (ie., large, magnitude ~9 earthquakes) that occurred in subduction zones.

Figure 2. Major tectonic plates of the world. Image from http://www.sanandreasfault.org/Tectonics.html

This blog focuses on the Cascadia subduction zone, found in the Pacific Northwest of the United States. Figure 3 is a zoomed in view of the Cascadia subduction zone. The Explorer, Juan de Fuca and Gorda plates are currently being subducted beneath the North America plate. The last major earthquake occurred in 1700, and we know precisely when because of records of a tsunami in Japan (Atwater et al., 2005)! Scientists have estimated that this quake was approximately 1000km in length and the plates slipped about 20m (Satake et al 2003)! Yikes! The estimated moment magnitude for this quake was approximately 9.

Figure 3. Close up map view of the Cascadia Subduction Zone. Topography data from ETOPO1 Topography Model. Figure made with GMT. Red circles outline oceanic plate names.

Let’s look a little deeper as to what is going on here. Figure 4 is a cross-section of the Cascadia subduction zone. You can see the Olympic Peninsula and Puget Sound. Below is an artist’s rendition of the Juan de Fuca oceanic plate subducting beneath the North America plate. The shallow, up-dip area is where the plates are thought to be stuck, or locked together. Further down-dip, the plates transition from fully locked, to fully creeping, where creeping is a measurement of how much the plates are slipping between large earthquakes. So, in the regions where the plates are stuck, lots of stress is building up, and is where megathrust earthquakes are thought to occur.

Figure 4. Profile cross-sectional view of the Cascadia Subduction Zone. Image from John Delaney.

So, what happens when the plates are stuck? The two plates are moving toward each other. In order to accommodate that motion, the two plates that are stuck together must begin to bend and deform. The continental crust begins to shorten and the ground near the coast begins to uplift.

When an earthquake happens, the two plates quickly slide past each other. The continental plate suddenly expands and subsides near the coast, and uplifts offshore. You can imagine the dire consequences of this – The uplifting crust shifts the entire water column up, possibly generating a massive wave which will eventually propagate to shore, but the shore line has also gone down, allowing the tsunami wave, once it hits, to reach further inland and be more destructive. As an example, Japan experience about 0.5-1 meter of subsidence during the 2011 quake (http://blogs.agu.org/mountainbeltway/2011/03/15/new-gps-vectors/) that also generated a tsunami that reached 33 ft high (http://en.wikipedia.org/wiki/2011_T%C5%8Dhoku_earthquake_and_tsunami). Yikes.

Figure 5. Cartoon of crustal deformation due to fault locking between earthquakes (top) and during an earthquake (bottom). Figure from Leonard et al., 2003.

The punch line of this study is that the amount of locking changes along the Cascadia Subduction zone--the plates are more stuck off the coast of Washington, southern Oregon and California, and less stuck in northern and central Oregon. This conclusion was reached by bringing together observations from a variety of cross-disciplinary studies, and like the blind men mentioned earlier, I attempt to piece together these data sets to make a simple, consistent story that explains all of them.

Let’s dive into the first data set – High precision Global Positioning Systems (GPS). A GPS satellite emits two wavelengths and some other information that help determine the distance of the satellite to a receiver on the ground (say your smart phone). It is important that two different wavelengths are emitted because it helps in calculating some distortions in the passing through the ionosphere. In the most simplistic view of how distance is calculated, one can take the time difference between the emission of the signal from the satellite and the detection of the signal at the ground reciever and multiply that differenced time by the speed of light. That will give the satellite line of site distance. To convert it to a 3 dimensional position, one would need the calculated range from at least 4 different satellites. There are currently 32 healthy GPS satellites in orbit, which means that any place on the earth, except maybe at the poles, can see at least 4 satellites at any given point in time.

Figure 6. Horizontal arrow points to an image of a GPS satellite from http://www.geosoft-gps.de/english/gps_infos/info_2_e.html. Vertical arrow points to a picture of the Death Star. GPS satellites and the Death Star should not be confused.

Back here on earth, we have permanently installed GPS monuments. These guys are usually installed in bedrock, if possible, or some other sturdy structure. You may have seen some of these monuments, the top left corner of Figure 7 is an example of what one may look like. Below that is a Trimble 5700 GPS and a Zephyr Geodetic antenna – a little larger than your smart phone. The antenna is usually set up on top of a tripod that is centered over the monument. The right hand photo of Figure 7 shows the antenna on top of a tripod with a protective cover that helps keep snow off. The antenna detects the signals from the satellite, which is then sent to the connected receiver, that collects that information.

Figure 7. Upper left: photo of Geodetic monument from http://en.wikipedia.org/wiki/Survey_marker. Lower left: photo of a Trimble 5700 GPS and a Zephyr Geodetic antenna from http://facility.unavco.org/. Right: Picture of an operating GPS from https://earthdata.nasa.gov/featured-stories/featured-research/looking-mud.

Daily positions of the GPS can be estimated. Figure 8 is an example of a GPS position time series for its three components – North, East and Vertical. The blue dots mark the daily position estimate, and the vertical black lines the uncertainties. Interestingly at this particular site there was a small earthquake nearby which caused this jump in the position time series. But, you can imagine that, ignoring the earthquake we can calculate the rate at which this monument is moving by taking the slope of the time series for each component.

Figure 8. GPS position time series for site BEMT, taken from UNAVCO website.

Focusing on the horizontal velocities, we can estimate by eye that this site moved about 30 mm per 6 years, or 5 mm/yr. Similarly, we can estimate by eye that the north component moves at about 8 mm/yr. By taking the square root of the squares of these velocities we can calculate a magnitude and we can also calculate the direction it was moving by taking the arctangent of the two components. This gives you an idea of how a velocity can be calculated by eye. Calculating the time series velocities for this study is a little more rigorous, however, since other signals, such as earthquakes and seasonal effects convolute the velocity estimate. Using the least squares method, velocities in this study are calculated by fitting the time series to the linear linear equation:

where: p = position, po = initial position, v = velocity, t = time, H = Heaviside function (step function) for earthquakes or equipment changes, A = amplitude of offset, and U1-4 = constants for seasonal variations.

Another data set used was tide and leveling data from Burgette et al., 2009. Remember in between major earthquakes the region near the shoreline uplifts– which means it would look like sea level is lowering. This can be measured over time, and an estimate can be made on how much vertical movement happened over time.

Figure 9. Photo of a tide station. Photo from http://www.oco.noaa.gov/tideGauges.html.

Let’s look at the data! In Figure 10, the map on the left has horizontal GPS velocities that are estimated from daily position time series from 1997 to 2013. These velocities are referenced to stable North America, so you could imagine standing in Nebraska, looking longingly to the west coast, and watching the plates move as indicated by these arrows. The arrows here originate at the GPS monument, are sized according to their magnitude, and point in the direction of motion. Note the reference scale arrow in black is 5 mm/yr. Now let’s look at the vertical data set. For better illustration, I’ve color coded them so that warm colors represent more uplift. The key thing to notice about this data set is that there is more uplift in the north and in the south, and a reduced amount of uplift in central and northern Oregon.

Figure 10. Maps of GPS horizontal velocities (left) and the combined GPS vertical velocities with tide and leveling uplift rates (right). Vertical rates colored according to their magnitude. Warm colors indicate uplift.

Geophysicists try to figure out how the world works by applying geophysical data to a mechanical model. What I mean is we think we know some basic concepts behind how the world works, so we build a mechanical model that will actually mimic what the earth is doing based on these concepts. One such model is called a block model. This type of model divides the corner of the earth you are working on into tectonic blocks that can move and rotate, strain and bend due to fault motions. We can use these models, along with the GPS data to estimate how much the plates are stuck together. The modeling program that I use is called TDEFNODE and it is a massive, wonderful beast of a code, written in Fortran! Yes, Fortran. It is based off of the models presented in McCaffrey et al., 2007. Figure 11 is a map of the Cascadia subduction zone with the block model geometry laid over top (solid black lines). The dots represent the interface between the subducting oceanic plate and the continental plate. It looks flat here, but really the fault interface is going down into the page.

Figure 11. Geometry of three dimensional block model. Thick black lines mark block boundaries, dots the three dimensional subduction interface. Block names are labeled.

OK -- We have our data, and we have our model. Only we have a big problem – The locking, which is what we are trying to solve, is mostly offshore, where we don’t have any data to constrain the model! This means that the model is heavily reliant on the user assumptions. Hence, I've described the model in two ways -- The first I call the Gaussian model, which assumes that the locking is distributed along strike in a Gaussian way, where it is minimal at the trench, crescendos to a maximum, and then tapers off down-dip. The second way assumes that the fault is completely locked from the trench to some distance down-dip before it begins to taper off.

Figure 12 are maps of locking distributions for the Gaussian (a) and Gamma (b) models. The green lines mark the modeled block geometery, and the colors are the locking fraction – where red indicates that the plate are stuck together more, and the cooler colors mean that the plates are less stuck. The residuals for each model are nearly identical in both cases, even though at first glance these models seem very different. But let’s take a closer look here. Both models show much more intense locking offshore in Washington and in northern California and southern Oregon. And the other distinguishing feature is that there is a wide transition zone between about 43 and 46 degrees north in central and northern Oregon. So, for these models, locking must be reduced in order to fit the reduced GPS, tide and leveling uplift rates in this region.

Figure 12. Locking distributions for the Gaussian (a) and Gamma (b) locking distribution models. Green lines mark block model boundaries, warm colors indicate regions that are more locked.

Up until now we have been talking about what happens in between major earthquakes. Let’s change gears a bit and think about what happens during an earthquake. Remember that during an earthquake, the continental crust uplifts offshore, potentially displacing the water column and producing a tsunami. Near the coast the ground subsides, allowing tsunami waters to inundate the shore line much further than in between earthquakes, bringing with it sediment and debris that would eventually settle out of the water column and form a geologic layer. These tsunami deposits can be seen in the geologic record. From these geologic layers, paleoseismologists can deduce how much subsidence occurred. Diatoms and other organic matter can help date when these layers were formed.

Leonard et al., 2010 compiled subsidence records from a plethora of studies that include earthquakes from the past 6500 years in Cascadia. Figure 13 shows a subset of subsidence data from some of these Cascadia earthquakes. The figure displays subsidence as a function of latitude, ranging from 50 degrees latitude (Canada) to 40 degrees latitude (northern California). What Leonard et al., 2010 observed is that for multiple past earthquakes, subsidence was reduced between 43.5 and 46 degrees north latitude. In their study, they state that reduced subsidence in central Cascadia is a persistent feature of Cascadia subduction earthquakes.

Figure 13. Subsidence records compiled in Leonard et al., 2010. Reduced subsidence is observed between ~43.5 and 46 degrees North.

Hmmph...

So let’s recap what we have so far – in the same region, at about 43-46 degrees north, we have both reduced inter-earthquake uplift as well as reduced subsidence due to earthquakes!

Now we come to my elephant -- that is, my interpretation of these observations. One way we can explain these observations is by fault creep in central Cascadia. In the locked scenario, the two plates are pushing together, creating uplift, which we see in Washington and California. This builds up a lot of stress which is later released in a big earthquake (Figure 14). In the case where the plates may be partially creeping the two plates are actually sliding past each other in between major earthquakes and stress doesn’t accumulate to the same extent – this means that we are less likely to see as much uplift in between earthquakes, and when an earthquake does happen the slip is expected to be less since much of it was already accommodated between earthquakes (Figure 14).

Figure 14. Subduction zone locking and creep scenarios. The top row shows the expected deformation for a locked subduction zone -- the continental crust uplifts near the coast in between earthquakes, and experience lots of subsidence during an earthquake. Alternatively (bottom row), if the subduction zone is creeping then the two plates release stress between earthquakes so that when an earthquake happens less slip is expected.

So, now we have our theory, based on interdisciplinary research using GPS, tide gauge, leveling and paleoseismic datasets. The theory, however doesn't explain why the plates are creeping in central Cascadia. Burgette et al., 2009 present a model similar to the Gamma model above, but they enforce a narrow locking transition width. In order to fit the reduced uplift rates in central Cascadia, their model shifted the locked region offshore. They note that the locking pattern in their model correlates well with the location of a dense, Eocene age (~50 Ma) accreted basalt known as the Siletzia terrane, and suggest it may influence the locking. The Siletzia itself is pretty rigid -- it has few earthquakes within its body (Parsons et al., 2005). Some studies suggest that it may also be less permeable (Calkins et al., 2011). Seismic surveys indicate that the Siletzia terrane is thickest in coastal Cascadia in central Oregon, where it extends as much as ~35km offshore. In Washington, the Siletzia terrane is not present in large quantities in the Olympics, but is observed further east in the Puget Sound region (Parsons et al., 2005).

Because the Siletzia terrane is dense, it can be mapped in gravity surveys. Blakely et al., 2005 present gravity data sets that map out the extent of the Siletzia terrane. We use the 10 mgal contour line of this data set to map out the thickest accretions of the Siletzia terrance (Figure 15). The gravity anomaly outlined shows the largest block extends from about 44 to 46 degreen north latitude, and extends through our region with reduced interseismic uplift and reduced coseismic subsidence. The outline is mapped on top of the Gaussian locking distribution model in Figure 15, but please note that their is no preference between model solutions.

Figure 15. Gaussian locking distibution model plotted with 10 mgal contour line (white line) from Blakely et al., 2005.

This study builds off of a conceptual model from Reyners and Eberhart-Phillips, 2009. In their study of locking distributions in the Hikurangi Subduction Zone (HSZ) under the North Island of New Zealand, they note that the less permeable Rakaia terrane, seems to also be influencing the locking. They suggest that the impermeable Rakaia terrane prevents water generated from the basalt-eclogite transition from percolating up into the overriding crust. Instead, pore fluid pressures increase at or near the plate interface and influences the locking. Intriguingly, their study suggests that the Rakaia terrane is more locked than its surroundings. We suggest a similar conceptual model for Cascadia, where the Siletzia terrane, if impermeable, increases pore fluid pressures by not allowing water to percolate into the overriding crust. We suggest, however, that the high pore fluid pressures at or near the plate interface encourages creep, since these conditions are thought to favor stable sliding (Segall and Rice, 1995; Hillers and Miller, 2006.

So, here is the recap of this work:

We observe reduced interseismic uplift rates in central Cascadia determined with GPS, tide and leveling data sets. Models of plate interface locking indicate that locking has to be reduced with a wide transition zone (this study) or move offshore in order to fit these data.
Paleoseismologists observe reduced coseismic subsidence from past great earthquakes in central Cascadia.
The above observations can be explained by creep in central Cascadia.

For a deeper discussion of the observations and hypotheses presented in this blog, please read Schmalzle et al., 2014.

Thanks for reading!

Like what you read? Check out my follow-up blog on periodic slow slip and tremor!

Acknowledgments: This work was funded by the National Science Foundation (NSF) Postdoctoral Fellowship Program, award 0847985 (Schmalzle), NSF award EAR-1062251 (McCaffrey), and USGS National Earthquake Hazards Reduction Program, Award G12AP20033 (Schmalzle and Creager). Some of the figures I made myself using General Mapping Tools (GMT), but some figures I took from random places on the web. For any of those images I say where the figure was taken. Many thanks to Reed Burgette and an anonymous reviewer for their thoughtful comments and suggestions that greatly improved this research. Thanks to Rick Blakeley for providing gravity data. Craig H. Faunce, Bruce Nelson, Steve Malone, Justin Sweet, David Schmidt, Aaron Wech, Tom Pratt, Brian Atwater, Sarah Minson, Lorraine Wolf, and Aimee Schmalzle provided useful comments and insight. Thanks to PBO and PANGA for providing access to GPS data products. Special thanks to David Branner, Ruby Childs and Nick Collins for organizing Data Rave and inviting me to give a talk.

Movie of March 11, 2011 Japan Earthquake and its aftershocks

2014-04-18T16:56:00-04:00

Movie of the 2011 Japan earthquake and its aftershocks

On March 11, 2011 a magnitude 9 earthquake occurred off the coast of Japan, generating a massive tsunami that caused major damage to infrastructure and loss of life. The magnitude 9 earthquake was followed by thousands of smaller earthquakes known as aftershocks in the hours after the earthquake. The magnitude of aftershocks is known to decreases exponentially after the main shock and can be seen in the data set, as shown in Figure 1.

Figure 1. Screen shot of the earthquake movie . The map on the left plots earthquakes at the end of March 11, 2011. Circles are sized by magnitude and colored by depth in kilometers. The plots on the right show the earthquake magnitudes plotted over time, colored by depth. The top figure spans the entire day, where as the lower figure shows a zoomed-in view of the 5th through the 10th hour of the day.

Interestingly, the earthquake magnitudes fluctuate over time. Omi et al. [2013] note this behavior and model it in their paper "Forecasting large aftershocks within one day after the main shock".

Using earthquake data provided by the ANSS database, Ville Juutilainen, Gayane Petrosyan, Sean Mathew Lawrence and I worked on the Japan Earthquake Movie. It allows the user to choose the plot they want to see while the movie plays. The first default is the plot shown above that displays Magnitude vs. Time. The other shows Depth vs. Profile Length. The Depth vs. Profile length option allows the user to select two points to define a profile. Points within a certain distance defined by the user are plotted on the map and on the plot. The user must select two points to see the plot. The code is written in Javascript, taking advantage of the D3 framework. You can access the code in my github repo. The code has been tested for Chrome, but has yet to be tested for other browsers.

Elastic half space model of a vertical strike slip fault

2014-04-18T14:56:00-04:00

Elastic Half Space Model of a Vertical Strike Slip Fault

In between major earthquakes, the ground deforms due to movement of tectonic plates. For strike-slip faults, such as the San Andreas Fault, the ground deforms in an 'S' shape that can be modeled as an arctangent. To better understand what is observed at the surface, imagine a fence built perpendicularly across a strike-slip fault. When the fault is first built it is nice and straight, but over time it starts to deform and look kind of like an "S". When the earthquake occurs the ground (and the fence) will snap, and the two sides of the fence will become straight again at some time after the earthquake, although displaced.

Figure 1. Deformation of tectonic plates between major earthquakes. D is the locking depth (the depth at which the fault is stuck). Figure modified from http://geologycafe.com/california/pp1515/chapter7.html

A simple model of this movement for vertical strike slip faults in a homogenous elastic half space [Weertman and Weertman, 1964; Savage and Burford, 1973] can be described as:

V(x) = (vo/pi)arctan(x/D)

where V(x) is the velocity of points estimated along a perpendicular profile across the fault, v is the far field velocity, x is the distance from the fault, and D is the dislocation depth.

High precision GPS can be used to observe the deformation around fault systems. The velocity, position and uncertainty estimates can be compared directly to the model to estimate the goodness of fit of the model. The chi2 statistic is give by:

chi2 = SUM ((dataR -vel)/(sig))**2

where chi2 = chi2 statistic, dataR = GPS estimated rate for a position aling the profile, vel = model calculated velocity, and sig = GPS velocity uncertainty. SUM is the sum of all chi2 for each GPS datum.

The reduced chi2 is also calculated and is given by:

reduced chi2 = chi2 / (N-v-1)

where N = number of data, v = number of variable parameters (in this case 2, R and d). An ideal reduced chi2 should equal 1.

In this blog, I use high precision GPS velocities from Schmalzle et al, [2006], and compare them to the model described above. I use the chi2 statistic and the reduced chi2 to determine the goodness of fit of the model to the data. My githup repo has versions of how to do this in Fortran with visualization using General Mapping Tools (GMT), and how to do it in Python. This blog will cover only the techniques used in the Python version of the code.

I have a few external files that are read in the code and are also available in the my githup repo. The file param.py holds the values of the far field velocity and locking depth for a given model run and looks like this:

xmin=-150.
xmax=150.
int=1.
Vo=34.
d=15.
dmin=1.
dmax=100.
Vmin=0.
Vmax=100.

The file data.py contains the GPS data in the form of x (in km), Rate (in mm/yr) and uncertainty (in mm/yr) and looks like this:

-92.60814 15.0918 -3.561012106
-90.65163 15.4416 -3.592941176
-92.60814 15.1681 -1.42072184
-71.08653 14.3386 -1.661513002
-69.13002 16.4146 -1.773867193
-47.60841 10.1976 -1.511861585
-60.65181 14.0536 -1.95124564
-70.43436 14.4541 -2.901815856
-2.282595 0.82942 -1.622786425
-27.39114 8.70825 -1.672086072
0.65217 0.18103 -1.550890689
7.499955 -6.89121 -1.799743676
19.5651 -11.9668 -1.632284566
-17.60859 4.82287 -1.62481939
12.39123 -8.5554 -1.223772558
33.91284 -15.2699 -1.190422007
65.86917 -15.6106 -1.351439143
122.60796 -15.6614 -6.776253439
3.26085 -5.95536 -1.524006894
2.60868 -2.39876 -1.179208687
6.5217 -5.81965 -1.62421975
-13.0434 5.08107 -1.23548631

Now for the coding! First, import the following modules:

import numpy as np
import math
import matplotlib.pyplot as plt

and import the param.py file:

import param

I collect the information from the param file and computed the surface velocities like this:

f = open('vel.txt','w')
listx = []
listVel = []
x=param.xmin
while (x <= param.xmax):
    Vel=-((param.Vo/np.pi)*math.atan(x/param.d))
    print >> f, x, Vel
    listx.append(x)
    listVel.append(Vel)
    x = x + param.int

This calculates the predicted velocity for a defined increment along a profile of a strike slip fault. I keep the x's and calculated velocities in lists that will be used later in the program for plotting purposes.

Now let's open the GPS file and read its contents:

g=np.loadtxt('data.py')
gx = g[:,0]
gVel = g[:,1]
gsig = g[:,2]

Now calculate the expected velocity or at each GPS position:

VelC = -((param.Vo / np.pi) * np.arctan ([ gx/param.d ]))

and calculate the chi2 and reduced chi2

chi = ((gVel - VelC)/ (gsig))**2
chi2 = sum(chi.T)
redchi = chi2/(len(gVel)-3)

Now you have the model fit to the data for a modeled fault rate and locking depth! The model fit to the data looks like Figure 2.

Figure 2. Modeled velocities across a vertical strike slip fault (solid lines) compared to GPS velocities (triangles) with velocity uncertainty error bars. Both gridsearch estimated and inversion estimated low misfit rates are shown for a locking depth of 15km. The reduced chi2 is given.

But suppose you want to know which combination of modeled fault rate and locking depth give you the best fit to the data. One way you can do this is by running a whole suite of models that include different combinations of fault rate and locking depth values. This is called a gridsearch approach, and is perhaps the simplest (although most time consuming) method. The param.py files contains user input values for a range of modeled parameters. Grabbing those values we can then perform a while loop to loop through those ranges:

dmin=param.dmin
dmax=param.dmax
Vmin=param.Vmin
Vmax=param.Vmax

d=param.dmin
gridredchi = np.array([V, d, chi])
grc = []
c = open('chi.py','w')

while (d <= dmax):
 V=param.Vmin
 while (V <=  Vmax):
   gridVelC = -((V / np.pi) * np.arctan ([ gx/d ]))
   gridchi = ((gVel - gridVelC)/ (gsig))**2
   gridchisum = np.matrix.item(sum(gridchi.T))
   gridrchi= gridchisum/(len(gVel)-3)
   newrow =  [ V, d, gridrchi ]
   gridredchi = np.vstack([gridredchi, newrow])
   print >> c, V, d, gridrchi
   plt.scatter(V,d, c=gridrchi, marker='s',lw=0,  s=40, vmin=0, vmax=10)
   V = V + param.int
 d = d + param.int

By performing the gridsearch, you can contour the estimated chi2 value with the defined model rate and locking depth as shown in Figure 3.

Figure 3. Contour plot of the chi2 statistic (colors, cooler colors indicate lower misfit) given modeled values of fault rate and locking depth. The white star marks the low misfit model.

Performing a gridsearch can take a long time, but it has the advantage that it is a straightforward method to estimate the low misfit model. By imaging our chi2 distribution, like in Figure 3 we can also easily see if there are other minima that could provide an alternative model that fits the data just as well for our given parameter ranges. The down side, however, is that this method is really slow, especially for more complicated models that require longer computation times.

An alternative method is to linearly invert the data with a little bit of matrix algebra. A great book that clearly describes this technique is:

Aster, R., Borchers, B., Thurber, C., Parameter Estimation and Inverse Problems, 301 pp, Elsevier Academic Press, 2004.

I highly recommend this book for further reading on this subject. I am not going to go over these concepts in this blog, but these methods are used in the scripts in my github repo. I use a linear inverse approach which is only valid for linear parameters, hence I can use it to estimate the best fitting rate, but not the locking depth. Using this method, the model is run for a locking depth of 15 km to find the best fit model in Figure 2.

Mapping and Plotting data with Generic Mapping Tools (GMT)

2014-04-14T14:56:00-04:00

Generic Mapping Tools

Update, Feb. 7, 2015: A comment made by Joseph below alerted me that I had been referring to GMT as General Mapping Tools, which was incorrect! The correct name is Generic Mapping Tools and has since been updated. Many thanks to Joseph for pointing this out, and my apologies for my gaffe! BTW -- I love the commentary -- it provides great feedback for improving my blog. Please don't be shy!

Original (Revised) Text:

Before my time here at Hacker School I put together a short, hands on course on how to use Generic Mapping Tools (GMT). GMT is a very powerful mapping and data visualization package. This class was based around using GMT 4. GMT 5 has slightly different syntax and functionality and will not work with the scripts presented here. I intend to update the scripts to GMT 5 in the near future.

The class was taught at the University of Washington Earth in the Department of Earth and Space Science, so it has an earth science theme -- earthquakes! Here, students learn how to make a map with layers that include mapping topography and earthquake locations. A subset of data is extracted from a profile line (red line in map below) so that earthquake depths can be plotted as a function of distance. The end goal of the class is to produce the following map and transect:

Figure 1. Map of the Cascadia subduction zone in northwestern United states overlain with ETOPO1 topography model (relief map), and earthquake locations from the ANSS catalog (circles) colored according to earthquake depth. Red line marks profile line from which data are extracted and plotted the depth vs. longitude graph below the map.

All files are contained in my Github repo except for the ETOPO1 topography model dataset. Unfortunately, this is a gigantic file that I could not push to the github repo without errors. The grd file can be downloaded on the original website, however, at http://www.ngdc.noaa.gov/mgg/global/relief/ETOPO1/data/bedrock/grid_registered/netcdf/ETOPO1_Bed_g_gmt4.grd.gz.

The file that made the above figure (cascadia_seis.com) is also in the repo, but I am also including a very verbose version below that goes through and explains each step, including some basics of bash scripting:

#!/bin/bash
# This file is for the first class on data visualization with GMT
# Written by Gina Schmalzle
# This code is a bash shell script (http://en.wikipedia.org/wiki/Bash_(Unix_shell)) that calls GMT files.
# awk is used to manipulate data files (http://en.wikipedia.org/wiki/AWK).

# A unix shell is a "command line interpreter" that allows definition of variables,
# and execution of commands that could be done via the command line. A nice shell script
# will make it easier to change variables, and easier to implement several lines of code.
# Shell scripts are commonly used with GMT programs.

# Most if not all of the commands in this document have associated 'man' (manual) pages. To access them type:

# man whatever_your_command_is

# If you cannot access your man pages through your command prompt, an alternative would be to type man command in
# google.

# To make this file executable, you will have to change the mode of the file (ie, read, write and/or execute)
# In your directory you will need to type:

# chmod u+x ./cascadia_seis.com

# The '#' marks to the left indicates a comment. Anything written after them is not read when the file is executed.

# This file will create a map of Cascadia that includes a grid of topography data from ETOPO1 (ETOPO1_Bed_g_gmt4.grd) and
# seismicity data. These data will be applied in "layers", very similar to how GIS packages have layers. The layers may be
# turned off or on by commenting/uncommenting lines.

# The grd file is already in gmt format. Generating and using grids is another class in itself, but here I will introduce
# you to using GMT formated grid files.

# Also included on the map are earthquakes locations color coded by depth from the ANSS catalog for 2000 to 2012 (anss_eq_2000_2012.dat)

#MAKE A MAP!

# Define the names of the input and output files
out=cascadia_seis.ps # This will be the name of your map generated by this file
seis_data=anss_eq_2000_2012.dat # ANSS earthquake catalog
topo=./ETOPO1_Bed_g_gmt4.grd # ETOPO1 topography grid

# Define map characteristics
# Define your area
north=50
south=40
east=-118
west=-132

# Define your map boundary annotation
# Here we define tick marks every 2 degrees and we print the degree on the West and South sides of the plot
# and keep the ticks (but don't label) on the east and north sides
tick='-B2/2WSen'

# Define Map Projection
# Here we define a Mercator Projection of size = 15
proj='-JM15'

#Start with GMT commands with embedded definitions....
# Help with any of these commands can be obtained by looking at the 'man' files. Simply type at the command line: man gmt_command
# If the man files are not properly installed you can also type in man gmt_command (e.g., man psbasemap) in google and it will come up.

#This line sets up the 'basemap' meaning here you will define the region, boundary annotations and projections.
#You can accomplish this also with other commands (including psxy, pscoast, etc...), but it is good many times to start with psbasemap.

psbasemap -R$west/$east/$south/$north $proj $tick -P -Y12 -K > $out

# This is your first line of GMT Code!!! Whoo-hoo! In long hand this line would look like this:
#
# psbasemap -R-132/-118/40/50 -JM15 -P -Y12 -K > cascadia_seis.ps

# What the options mean:
# psbasemap = plots postscript basemaps
# -R -- defines the area of your map (note that we defined north, south, east and west above and they are inserted into the -R option.
# The Projection (-JM) and tick marks (-B) were defined above.
# Note that when you call a defined variable, you must include a '$' before the variable name
# -P Sets the figure to "Portrait" mode. No -P is landscape.
# -Y Orients the figure vertically (-X orients it horizontally).
# -K means that there will be more 'stuff' appended to the postscript file.
# '>' means that the command output, which would normally print to screen will be directed into your new file (cascadia_seis.ps, shown here as $out)
# In addition it means that it believes cascadia_seis.ps is a new file. If it is not, it will erase all existing info in the file and re-write it with
# the new information.

#plot grid
# We would like the topography to be the map background, so it needs to be the first layer. Hence, we get started with a hard part...

# Helpful hint...
#
# use grdinfo your_grd_file.grd
# to find info about your grid file, such as the min and max values
#
# You will need to make some color palettes. These are files that tell what colors certain properties are displayed.
# For example, your ETOPO grid has a latitude, longitude and a elevation, and you want to color code the topography
# by elevation. The following lines will tell you how to do that...

# First, Make a color palatte
# Typing: makecpt
# at the command line will give you information on pre-existing color schemes
# This will make a color pallete of typical, pre-defined topography colors:

makecpt -Crelief -T-8000/8000/500 -Z > topo.cpt

#makecpt = makes GMT color palette tables
#-C tells GMT what pre-defined color palette to use
#-T defines the range and increment
#-Z states that the colors will change continuously (rather than discretely)
#topo.cpt is a new file containing your color pallete information that will be used later.

#This next line is not necessary, but may be used to make the image appear sharper.
#grdgradient helps to illuminate ridges in the topography from a specified angle.
#grdgradient $topo -A135 -Ne0.8 -Gshadow.grd

#grdgradient=Makes illumination shadow
#-A is the angle from which the light is shown
#-N normalizes the shadow according to equations stated in man grdgradient
#-G lists the name of your output grid

# Overlay the grid onto your map
# Here you are adding the grid as a layer to your postscript file

# This command includes a shadow grid file:
# grdimage $topo -R -J -O -K -Ctopo.cpt -Ishadow.grd >> $out

# This command omits the shadow file:
grdimage $topo -R -J -O -K -Ctopo.cpt >> $out

#grdimage = creates an image from a 2D netcdf grid file
#-R = Sets the region. Notice here I don't have to state the min and max values again.
#-J = Sets the projection. Again the type and size don't have to be restated.
#-O = Overlay. The output for this line is being appended to a previous postscript code
# i.e., you are adding another layer
#-K = You will be appending another layer
#-C = You will be using the color pallete topo.cpt

# Now, back to the easy stuff..
# Add coastlines

pscoast -R -J -O -K -W2 -Df -Na -Ia -Lf-130.8/46/10/200+lkm >> $out

#pscoast = adds coastlines
#-W = Sets the line width and color. Default color = black = 0 and does not have to be explicitly stated.
#-Df = What is the resolution of the coasline dataset? f = fine
#-Na = Draws politcal boundaries, a = draw all the boundaries, see man pscoast for more options
#-Ia = Draw Rivers, a = draw all rivers, see man pscoast for more options
#-Lf = Draw a fancy map scale, f = fancy, centered on -130.8, 46 degrees. +200 = length, +lkm = kilometers

# Add seismic locations and color code them by depth
# Make color palette
# Ahh, another color pallete...
# This time, let's make it rainbow colored and call is seis.cpt

makecpt -Crainbow -T0/50/10 -Z > seis.cpt

# Columns 4, 3 and 5 of the data file are the longitude, latitude and depth, respectively. This is the order
# your data need to be in for psxy (see man file)

awk '{print($4,$3,$5)}' $seis_data | psxy -R -J -O -K -W.1 -Sc.1 -Cseis.cpt -H15 >> $out

# psxy = Plot 2D lines, polygons and symbols on a map. Fun fact -- psxyz plots in 3D.
# -W.1 = Draws the black outline of the circles.
# -Sc.1 = Defines the shape and size; c = circle, size = 0.1
# -H = Header. The first 15 lines of the file contain header information and will not be read.
# -C = defines the color palette to be used for the depth. We could also make all the circles one color.
# In this case, remove the -C option and use -G instead. -G defines the color of the circle in either white-black
# or red/green/blue format. Example colors: -G0 (black); -G255 (White); -G255/0/0 (Red)
# GMT has made this a little easier. You could also say -Gblack or -Gred, but there are a limited amount of colors
# you could do that with.

# Add a scale
psscale -D0/3.2/6/1 -B10:Depth:/:km: -Cseis.cpt -O -K >> $out

# pscale = Adds a scale to go with your color palette
# -D = set the position of the scale
# -B = set and annotate the scale tick marks and lables.
# -C = specify your color palette

# Now, let's take a subset of seismic data and project them onto a line....
#First, let's view the transect line

#Plot transect line
psxy center.dat -R -J -O -K -W1 -Sc.3 -G255/0/0 >> $out
psxy center.dat -R -J -O -K -W5/255/0/0 >> $out

# You should know the options by now ;-)

#This ends the map making part of this exercize, now we move onto making a scatter plot from the seismic data.

# PROJECT DATA
# Here we use the GMT code project to take all the data within a certain region and project them onto a line

awk '{print($4,$3,$5)}' $seis_data | project -C-124/47 -A90 -W-.2/.2 -L0/4 -H15 > projection.dat

# project = projects data onto a transect
# Note that the options are different for this command
# -C = defines the center of your transect
# -A = azimuth of transect (CW from N)
# -W = Width of the transect in degrees
# -L = length of transect in degrees
# -H = Header declaration
# projection.dat = new file with the original data and the projected locations

# MAKE SCATTER PLOT
#We want the scatter plot to be on the same page as the map, but just below it, so we need to redefine our
#region, projection and tick marks...

east=-120
west=-124
dmin=0
dmax=50

proj=-JX15/-5
tick=-B1:Longitude:/10:Depth:WSen

awk '{print($6,$3)}' projection.dat | psxy -R$west/$east/$dmin/$dmax $proj $tick -W1 -Sc.2 -G200 -O -K -Y-8 -P >> $out

# Columns 6 and 3 are the projected longitude and the Depth, repectively
# -Y = Shift the new plot down 8 units. You can designate if you want to shift in centimeters (c), inches (i),
# meters (m), or pixels (p). Otherwise it shifts by whatever is in your gmtdefaults.

# Last, but not least, image your map!
# Common postscript viewers: gs, gv, ggv, open, gimp
# What, you don't like postscript files? That's ok, uncomment this line:
# ps2pdf $out

open $out

Iteration and Recursion in Python

2014-04-10T14:56:00-04:00

Iteration vs. Recursion in Python

For the past week at Hacker School, I took a step back from making a cool and awesome projects like the Vector Projector or the Japan Earthquake projects and looked at some good, old-fashioned computer science concepts. Much of what I did was repetive; I found a really simple problem, solved it using some method, and then tried solving it again using a variety of other methods. One project I worked on was programming the n'th Fibonacci number using python. In this blog I will describe iterative and recursive methods for solving this problem in Python.

What are Fibonacci numbers (or series or sequence)? From the Fibonacci Wiki Page, the Fibonacci sequence is defined to start at either 0 or 1, and the next number in the sequence is one. Each subsequent number in the sequence is simply the sum of the prior two. Hence the Fibonacci sequence looks like:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55,...

or:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,...

It is mathmatically defined as:

F(n) = F(n-1) + F(n-2)

My first approach was an iterative method -- basically, hardwire the case for F(0) and F(1), and then iteratively add the the values for larger values of n:

def F_iter(n):
      if (n == 0):
              return 0
      elif (n == 1):
              return 1
      elif (n >1 ):
              fn = 0
              fn1 = 1
              fn2 = 2
              for i in range(3, n):
                      fn = fn1+fn2
                      fn1 = fn2
                      fn2 = fn
              return fn
      else:
              return -1

OK, great, this works just fine... Now, let's try writing this recursively. You may ask, what is recursion in computer science? -- I certainly did about a week ago. Recursion, according to the Recursion Wiki page is a method where the solution to a problem depends on solutions to smaller instances of the same problem. Computer programs support recursion by allowing a function to call itself (Woa! -- this concept blew my mind).

A recursive solution to find the nth Fibonacci number is:

def F(n):
     if (n == 0):
             return 0
     elif (n == 1):
             return 1
     elif (n > 1):
             return (F(n-1) + F(n-2))
     else:
             return -1

Notice that the recursive approach still defines the F(0) and F(1) cases. For cases where n > 1, however, the function calls itself. Let's take a look at what is happening here. Suppose we call F(4). F(4) will forgo the n = 0 and n = 1 cases and go the the n > 1 case where it calls the function twice with F(3) and F(2). F(3) and F(2) then each subsequently call the function again -- F(3) calls F(2) and F(1), and F(2) calls F(1) and F(0), as shown in the tree structure below (Figure 1). The F(1) and F(0) cases are the final, terminating cases in the tree and return the value of 1 or 0, respectively. So, just as Wikipedia said, the recursive case breaks down the problem to smaller instances, and it does that by allowing the user to define a function that calls itself:

Figure 1. Tree structure demonstrating the recursion flow for a case where n = 4. Figure made using D3.js.

Fantastic! Now we have both the iterative and recursive styles of this code, and we can now see how recursive coding works. Playing around with these codes, however, I noticed that the run time on the recursive case was MUCH slower than the iterative case for large n. I included a timer (time.time()) to measured how quickly each run is estimated for a range of n. What I got is a figure that looks like this:

mpld3 plot

Figure 2. Run times for calculating a range of n (0-25) using the iterative (blue) and recursive (red) approaches. Figure made with mpld3.

Where the red line represents recursive times and the blue iterative times. For small values of n the two methods are relatively the same, but for large values of n the time really starts to lengthen for the recursive case! So, why is this? Let's take another look at Figure 1. The tree structure shows that the F(4) and F(3) cases are only calculated once, but each subsequent case is calculated multiple times! For example, the F(1) case is calculated 3 times! Hence, the recursive method calculates the same thing multiple times, wasting valuable run time.

So why use recursion? Well, the recursive code is a lot easier to read. For the n > 1 case, the mathematical equation is explicitly written out, whereas in the iterative case the programer has to step through the script to understand what is going on. But the obvious gorilla in the room is that recursion in python is REALLY slow. Memoization (pronounced like Elmer Fud trying to say memorization) is a technique used to deal with this problem. Memoization and memorization are kind of synonomous in this case -- we want to make the program 'memorize' the result from previous runs. These memorized runs will be used for subsequent, repeated calls. We do this by assigning the value to a hash table. This involves a simple modification of the code:

mem = {}
def F_mem(n):
     if (n == 0):
             return 0
     elif (n == 1):
             return 1
     elif (n > 1):
             if n not in memo:
                     memo[n] = (F_mem(n-1) + F_mem(n-2))

             return memo[n]
     else:
             return -1

Now, let's run our timer again, but this time use the memoized recursion:

mpld3 plot

Figure 3. Run times for calculating a range of n (0-200) using the iterative (blue) and memoized recursive (red) approaches. Figure made with mpld3.

Woa! As Mary Rose Cook would say, "Now we're cookin' with gas." Run times for the recursive method are now out-pacing the iterative method! So, by memoizing your recursive function you can get your code to run REALLY fast! A down side to this, however, is that you are going to take up memory by storing information in your hash table.

Hope you liked my blog! 'Till next time!

Projecting GPS velocity vectors onto a profile

2014-03-19T14:56:00-04:00

Global Positioning Systems

Global Positioning Systems (GPS) are used to measure the three dimensional position of a point over time. High precision GPS are used to measure tectonic plate motion by measuring the position of a permanently installed geodetic monument over time. The GPS instruments are either permanently installed over the monument and continuously recording its position over time, or the GPS monuments are perodically measured. With either method, three dimensional position estimates are made over time.

This is an image of a high precision GPS antenna, whose image I took from the UNAVCO website:

And this is an example of the GPS site BEMT position 3D time series taken from UNAVCO:

Blue dots are daily position estimates in the north (top), east (middle) and vertical (bottom) components. This site experienced an offset due to an earthquake in 2010. For more information on how GPS works, please visit: http://www.unavco.org/edu_outreach/teachers/teachers.html

GPS Velocities

The rate at which a geodetic monument moves can be estimated by taking the slope of the time series for each component. Take for example the time series shown above. This is going to be a rough estimate, but between 2004 and 2010, the monument north component position moved from -15 mm to +15 mm, totalling 30 mm of displacement over 6 yrs, indicating that it is moving at 5 mm/yr in the north position.

Using the same logic, the east component moved ~-50 mm over about 6 years, giving the east component a rate of 8.3 mm/yr to the west.

The Magnitude and direction of the horizontal GPS velocity vector can now be calculated:

Vector Projection

In map view small variations in GPS velocities may be difficult to see, hence it is sometimes useful to plot GPS velocities along a profile. The profile line can follow a fault line, and, if it does, one can calculate the fault parallel and perpendicular components of motion. Fault parallel motion will give you an idea of lateral motion across the fault (as in strike-slip fault systems), and fault perpendicular motion will tell you if the two sides of the fault are separating or converging. In this section, we will talk about deriving the profile parallel and perpendicular components of the GPS vectors using vector projection.

Here the fault perpendicular velocity is: R perp= R*sin(t) and the fault parallel velocity is: R par= R*cos(t)

The Vector Projector

Stuart Sandine, Andrea Fey and Thomas Ballinger and I created a web app called The Vector Projector that calculates the magnitude, transect parallel and transect perpendicular components of GPS velocities along a profile. In this app, you are given the option of several GPS velocity fields, calculated with respect to stable North America. For now you can choose your profile width and you can filter what data you would like to use by their uncertainties (i.e., uncertainties that are more than the value specified are not used). This beta version does not plot uncertainties, which we plan to change in the future. Give it a try! Go to the Vector Projector.

About Me

2014-03-13T12:40:00-04:00

Hello! I'm Gina.

I am a geodesist who recieved her PhD at the University of Miami Rosenstiel School of Marine and Atmospheric Sciences in Miami FL. Since then I have been a postdoctoral scholar and research scientist at the University of Washington in Seattle, WA, and along with Scott Baker and Batuhan Osmanoglu, started a geodetic services company called BOS Technologies LLC. We specialize in high precision Global Positioning Systems (GPS) and Interferometric Synthetic Aperture Radar (InSAR). I am currently an Assistant Scientist at the University of Miami working remotely from Seattle, WA. I have an extensive background studying the tectonics of the west coast of the United States. I started my geophysics career studying the San Andreas fault, and I am currently studying the Cascadia Subduction Zone and Southern California.

I love data. I love visualizing and analyzing it, and just getting my hands dirty with it. Recently I have been working on developing interactive websites for geophysical datasets. Many geophysical datasets are publicly available, but it is difficult for the general public to use and play with these data. Presented on this website are two interactive websites that I built with a little help from my friends at Hacker School, NYC. The Vector Projector is a nifty little tool we made that visualizes high precision GPS velocity fields that measure how much tectonic plates move over time. The Japan Earthquake Movie is an animation showing the locations of the main shock and aftershocks of the 2011 Japan earthquake, that simultaneously plots with charts of magnitude versus time.

Thanks for reading my website! I'd love to hear your thoughts, so keep in touch!

Setting up Custom Domain Names with Github Pages

2014-03-11T13:40:00-04:00

Hello World!

This is my first blog post generated by pelican and hosted by github with my snazzy new domain name geodesygina.com! That is pronounced: GEE-ODD-ESS-Y-GEE-NA. Ha! I'm such a geek.

Anyway, thanks to Amy Hanlon who posted directions on how to set up a blog with Pelican at http://mathamy.com/.

I purchased my domain name (geodesygina.com) with godaddy.com, for $12.19 for 2 years. Getting the domain name to point to github was a little tricky. Github posted a "Setting up a custom domain with Pages" site (https://help.github.com/articles/setting-up-a-custom-domain-with-pages) that goes through how to set up your domain name in your repository. This part is pretty clear until you need to set up your DNS. I can only speak for my experience at godaddy.com, but this is how you set up your DNS through them:

Log into your godaddy account.
Go to the My Account page and click on the green launch button to the right of where it says domains.
On the Domains page, click on the domain name. On the page it brings you to click the tab that says DNS Zone File then click edit.
Go to the A(host) section. There should be 1 record with @ as the host. Github has two different IP addresses. Edit the current one. Keep the @ as the host and replace the IP address with 192.30.252.153. Once that is in, click "Quick add" right underneath and for the new record, put @ as the host again and enter the IP address 192.30.252.154.

That should be it! It may take a little time for your website to update.

My Publications

2014-03-10T02:45:00-04:00

Schmalzle, G. M., McCaffrey, R., Creager, K., Central Cascadia Subduction Zone Creep, Geophysics, Geochemistry, Geosystems, doi: 10.1002/2013GC005172, 2014.

Karimzadeh, S., Cakir, Z., Osmanoglu, B., Schmalzle, G. M., Miyajima, M., Amiraslanzadeh, R., and Djamour, Y., Interseismic strain accumulation across the North Tabriz Fault (NW Iran) deduced from InSAR time series, Journal of Geodynamics, 66, doi: 10.1016/j.jog2013.02.003, 2013.

Gourmelen, N., Dixon, T.H., Amelung, F., Schmalzle, G. M., Acceleration and Evolution of Faults: An Example from the Hunter Mountain-Panamint Valley Fault Zone, Eastern California, Earth and Planetary Science Letters, 10.1016/j.epsl.2010.11.016, 2010.

Fulton, P., Schmalzle, G. M., Harris, R., Dixon, T. H., Reconciling patterns of interseismic strain accumulation with thermal observations across the Carrizo segment of the San Andreas Fault, Earth and Planetary Science Letters, 10.1016/j.epsl.2010.10.024, 2010.

Schmalzle, G. M., The Earthquake Cycle of Strike-Slip Faults, PhD Thesis, University of Miami, Miami, FL, 211 pp., 2008.

Biggs, J., Burgmann, R., Freymueller, J., Lu, Z., Parsons, B., Ryder, I., Schmalzle, G. M., Wright, T., The postseismic response to the 2002 M7.9 Denali Fault earthquake: constraints from InSAR, Geophysical Journal International, 175 (3), 10.1111/j.1365-246X.2008.03932.x, 2008.

Schmalzle, G. M., Dixon, T.H., Malservisi, R., Govers, R., Strain accumulation across the Carrizo segment of the San Andreas Fault, California: Impact of laterally varying crustal properties, Journal of Geophysical Research, B, Solid Earth and Planets, 111, doi:10.1029/2005JB003843, 2006.

Sabburg, J., Kimlin, M. G., Rives, J. E., Meltzer, R. S., Taylor, T. E., Schmalzle, G. M., Zheng, S., Huang, N., Wilson, A. R., Udelhofen, P. M., Comparisons of corrected daily integrated erythemal UVR from the U.S. EPA/UGA network of Brewer spectroradiometers with model and satellite data, Proceedings of SPIE, 4482, doi:10.1117/1112.452955, 2002.

Sabburg, J., Rives, J. E., Meltzer, R. S., Taylor, T. E., Schmalzle, G. M., Zheng, S., Huang, N., Wilson, A. R., Udelhofen, P. M., Comparisons of corrected daily integrated erythemal UVR data from the U.S. EPA/UGA network of Brewer spectroradiometers with model and TOMS-inferred data, Journal of Geophysical Research, 107, doi:10.1029/2001JD001565, 2002.

REFERENCE MANUALS

Schmalzle, G. M. (2005), Survival Guide for the Geodesy Lab, edited, p. 30, University of Miami, Rosenstiel School of Marine and Atmospheric Sciences Miami, FL.

Thomas, T., and Schmalzle, G. M. (2000), The Site Operator's Standard Operating Procedure for the Brewer Spectrophotometer, edited, NOAA.