Wed 01 March 2017

Visualizing Geospatial Data in Python Using Folium

Summary

Use Python to display location data on interactive map

Cluster points in a heatmap to visualize usage

One of my favorite things about the Python programming language is that I can always import a library to do the heavy lifting and focus my attention on application logic. It's a testament to the community that so many people are willing to put in the time to create a package and make it easier for other people to get started.

As of this writing, there are around 100,000 packges on PyPI. In this series I call "Exploring PyPI", we will highlight some of the best packages and hidden gems of PyPI.

This post will focus on Folium, the Python interface to the Leaflet JavaScript mapping library. I will explore some of the features of Folium by analyzing data shared by the the City of Chicago's Bike Share system, Divvy.

Divvy Bikes came to Chicago in 2013 and celebrated their ten millionth trip in early 2017. There are stations located all over the city and you can get a 1 day pass for $10 or a yearly membership for $100. Each time you check out a bike, you have 30 minutes to return it to any station before late fees apply. You can check out as many 30 minute blocks as your pass permits.

What You Need to Follow Along

Development Tools (Stack)

Python 3
PyData stack (Pandas, matplotlib, seaborn)
Folium

Code

Jupyter Notebook on Github

Setting Up Environment

import folium
from folium import plugins
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Downloading Data

(Divvy Data - website)

!wget "https://s3.amazonaws.com/divvy-data/tripdata/Divvy_Trips_2016_Q3Q4.zip"

!mkdir data
!mv Divvy_Trips_2016_Q3Q4.zip data/Divvy_Trips_2016_Q3Q4.zip

!unzip data/Divvy_Trips_2016_Q3Q4.zip -d data/Divvy_Trips_2016_Q3Q4

Directory structure should look like:

.
├── 1-create-divvy-station-heatmap.ipynb
├── README.md
└── data
    ├── Divvy_Trips_2016_Q3Q4
    │   ├── Divvy_Stations_2016_Q3.csv
    │   ├── Divvy_Stations_2016_Q4.csv
    │   ├── Divvy_Trips_2016_Q3.csv
    │   ├── Divvy_Trips_2016_Q4.csv
    │   ├── README_2016_Q3.txt
    │   └── README_2016_Q4.txt
    └── Divvy_Trips_2016_Q3Q4.zip

Loading Data into Pandas

divvyStations_q3 = pd.read_csv('data/Divvy_Trips_2016_Q3Q4/Divvy_Stations_2016_Q3.csv')
divvyStations_q4 = pd.read_csv('data/Divvy_Trips_2016_Q3Q4/Divvy_Stations_2016_Q4.csv')

# combine and keep the first instance of id
divvyStations = pd.concat([divvyStations_q3, divvyStations_q4], axis=0).drop_duplicates(subset=['id'])

divvyStations.head()

	id	name	latitude	longitude	dpcapacity	online_date
0	456	2112 W Peterson Ave	41.991178	-87.683593	15	5/12/2015
1	101	63rd St Beach	41.781016	-87.576120	23	4/20/2015
2	109	900 W Harrison St	41.874675	-87.650019	19	8/6/2013
3	21	Aberdeen St & Jackson Blvd	41.877726	-87.654787	15	6/21/2013
4	80	Aberdeen St & Monroe St	41.880420	-87.655599	19	6/26/2013

Using Folium to Visualize Station Data

m = folium.Map([41.8781, -87.6298], zoom_start=11)
m

# mark each station as a point
for index, row in divvyStations.iterrows():
    folium.CircleMarker([row['latitude'], row['longitude']],
                        radius=15,
                        popup=row['name'],
                        fill_color="#3db7e4", # divvy color
                       ).add_to(m)

# convert to (n, 2) nd-array format for heatmap
stationArr = divvyStations[['latitude', 'longitude']].as_matrix()

# plot heatmap
m.add_children(plugins.HeatMap(stationArr, radius=15))
m

Load Divvy Activity Files and Analyze Same Station Rides

The Divvy activity files are split by quarters. We want to concatenate the result of multiple calls to pd.read_csv() so let's be Pythonic and use generators. I adapted a solution I found on Stack Overflow. (Praise Be)

divvyTripFiles = ['data/Divvy_Trips_2016_Q3Q4/Divvy_Trips_2016_Q3.csv',
                  'data/Divvy_Trips_2016_Q3Q4/Divvy_Trips_2016_Q4.csv']

divvyTrips = (pd.read_csv(f) for f in divvyTripFiles)
divvyTripsCombined = pd.concat(divvyTrips, ignore_index=True)

divvyTripsCombined.describe(include='all')

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_id	from_station_name	to_station_id	to_station_name	usertype	gender	birthyear
count	2.125643e+06	2125643	2125643	2.125643e+06	2.125643e+06	2.125643e+06	2125643	2.125643e+06	2125643	2125643	1590189	1.590343e+06
unique	NaN	1846706	1779098	NaN	NaN	NaN	584	NaN	584	3	2	NaN
top	NaN	8/28/2016 13:34:22	7/23/2016 12:54:37	NaN	NaN	NaN	Streeter Dr & Grand Ave	NaN	Streeter Dr & Grand Ave	Subscriber	Male	NaN
freq	NaN	7	17	NaN	NaN	NaN	59247	NaN	64948	1589954	1180122	NaN
mean	1.169926e+07	NaN	NaN	3.251980e+03	1.008547e+03	1.799161e+02	NaN	1.803518e+02	NaN	NaN	NaN	1.980787e+03
std	7.313875e+05	NaN	NaN	1.730437e+03	1.816103e+03	1.305240e+02	NaN	1.304878e+02	NaN	NaN	NaN	1.075399e+01
min	1.042666e+07	NaN	NaN	1.000000e+00	6.000000e+01	2.000000e+00	NaN	2.000000e+00	NaN	NaN	NaN	1.899000e+03
25%	1.106635e+07	NaN	NaN	1.755000e+03	4.160000e+02	7.500000e+01	NaN	7.500000e+01	NaN	NaN	NaN	1.975000e+03
50%	1.170150e+07	NaN	NaN	3.446000e+03	7.160000e+02	1.570000e+02	NaN	1.570000e+02	NaN	NaN	NaN	1.984000e+03
75%	1.233129e+07	NaN	NaN	4.802000e+03	1.195000e+03	2.680000e+02	NaN	2.720000e+02	NaN	NaN	NaN	1.989000e+03
max	1.297923e+07	NaN	NaN	5.920000e+03	8.636500e+04	6.200000e+02	NaN	6.200000e+02	NaN	NaN	NaN	2.000000e+03

# Three usertypes? This wasn't mentioned in the readme. Let's investigate
divvyTripsCombined.groupby('usertype')['trip_id'].agg(len)

usertype
Customer       535649
Dependent          40
Subscriber    1589954
Name: trip_id, dtype: int64

## There is a 'Dependent' usertype with 40 entries, let's drop this
divvyTripsCombined = divvyTripsCombined[divvyTripsCombined.usertype != 'Dependent']

# How many trips went from Station A all the way to Station A
totalTrips = len(divvyTripsCombined)
sameStationTrips = divvyTripsCombined[divvyTripsCombined.from_station_id == divvyTripsCombined.to_station_id]
sameStationTrips.is_copy = False

totalTrips = len(divvyTripsCombined)
sameStationTrips = divvyTripsCombined[divvyTripsCombined.from_station_id == divvyTripsCombined.to_station_id]
sameStationTrips.is_copy = False

print ('Total number of trips: {:,}'.format(totalTrips))
print (('Trips from A -> A: {:,}').format(len(sameStationTrips)))
print (('Trips from A -> A: {:.1%}').format(len(sameStationTrips) / totalTrips))

Total number of trips: 2,125,603
Trips from A -> A: 73,730
Trips from A -> A: 3.5%

# % of same station trips by usertype
num_trips_by_usertype = divvyTripsCombined.groupby('usertype')['trip_id'].agg(len)
num_same_station_trips_by_usertype = sameStationTrips.groupby('usertype')['trip_id'].agg(len)
num_same_station_trips_by_usertype / num_trips_by_usertype

usertype
Customer      0.090636
Subscriber    0.015838
Name: trip_id, dtype: float64

This seems like a pretty high percentage of rides to me. Let's look into this a bit more:

How long are these trips? Accidental checkouts?
Who is taking them?
Where are they taking them to?
Tourists who don't know where the next station is? Or subscribers running errards around the neighbourhood?
Can we draw conclusions?

sameStationTrips['tripInMinutes'] = sameStationTrips['tripduration'] / 60
sameStationTrips.describe()

	trip_id	bikeid	tripduration	from_station_id	to_station_id	birthyear	tripInMinutes
count	7.373000e+04	73730.000000	73730.000000	73730.000000	73730.000000	25204.000000	73730.000000
mean	1.151064e+07	3219.399769	2360.523261	192.533473	192.533473	1980.622798	39.342054
std	6.886428e+05	1731.021515	4078.805054	153.248962	153.248962	11.690790	67.980084
min	1.042666e+07	1.000000	60.000000	2.000000	2.000000	1900.000000	1.000000
25%	1.091862e+07	1716.000000	504.000000	69.000000	69.000000	1974.000000	8.400000
50%	1.144692e+07	3395.000000	1383.000000	164.000000	164.000000	1985.000000	23.050000
75%	1.205419e+07	4764.000000	2888.000000	291.000000	291.000000	1989.000000	48.133333
max	1.297922e+07	5919.000000	86365.000000	620.000000	620.000000	2000.000000	1439.416667

# check out trip in minutes distrubtion
tripDuration = sameStationTrips['tripInMinutes']
tripDuration[tripDuration < tripDuration.quantile(.95)].plot(kind='hist')

Trip Distribution in Minutes

# distribution grouped by subscriber type
sameStationTrips[tripDuration < tripDuration.quantile(.95)] \
  .groupby('usertype')['tripInMinutes'].plot(kind='hist', stacked=True, legend=True)

Trip Distribution in Minutes by Customer Type

# Divvy's charge more for rides larger than 30 minutes so lets see that
# distribution of usage for less than 30
sameStationTrips[sameStationTrips['tripInMinutes'] < 30] \
    .groupby('usertype')['tripInMinutes'] \
    .plot(kind='hist', stacked=True, legend=True, bins=20)

Trip Distribution in Minutes by Customer Type

There seems to be a lot of same station trips < 3 minutes. A three minute station to station trip probably implies a mistaken bicycle checkout. Let's go ahead and exclude rides that are less than 3 minutes and see how the distribution looks.

old_count = len(sameStationTrips)
sameStationTrips = sameStationTrips[sameStationTrips.tripInMinutes > 3]
new_count = len(sameStationTrips)
print ('Num records dropped: {}'.format(old_count-new_count))

Num records dropped: 9398

# examine the <30 minute ride distribution again
sameStationTrips[sameStationTrips['tripInMinutes'] < 30] \
    .groupby('usertype')['tripInMinutes'] \
    .plot(kind='hist', stacked=True, legend=True, bins=20)

Trip Distribution in Minutes by Customer Type for Trips > 3 minutes

# Acording to Folium docs, we can pass the heatmap function a (n, 3) ndarray
# with weights in the column
sameStationCountsByStationID = sameStationTrips.groupby('from_station_id')['trip_id'].agg('count')
sameStationCountsByStationID.name = 'numTrips'

sameStationCountsDF = pd.merge(left=divvyStations, right=sameStationCountsByStationID.to_frame(), how='outer',
                               left_on='id', right_index=True)
sameStationCountsDF['numTrips'] = sameStationCountsDF['numTrips'].fillna(value=0)
sameStationCountsDF.head(2)

	id	name	latitude	longitude	dpcapacity	online_date	numTrips
0	456	2112 W Peterson Ave	41.991178	-87.683593	15	5/12/2015	6.0
1	101	63rd St Beach	41.781016	-87.576120	23	4/20/2015	137.0

Write Helper Function to Simplify Function Calls

We want to map the data we put together; even though Folium makes this relatively easy, we still spend a lot of time wrangling data.

In his book Effective Python, Brett Slatkin makes a case for creating functions with default keyword arguments specified in the definition:

The first advantage is that keyword arguments make the function call clearer to new readers of the code. With the call remainder(20, 7), it’s not evident which argument is the number and which is the divisor without looking at the implementation of the remainder method. In the call with keyword arguments, number=20 and divisor=7 make it immediately obvious which parameter is being used for each purpose.

The second impact of keyword arguments is that they can have default values specified in the function definition. This allows a function to provide additional capabilities when you need them but lets you accept the default behavior most of the time. This can eliminate repetitive code and reduce noise.

Source

We will create a helper function with default keyword arguments to abstact away Folium's complexity. This leaves us with a simple API we can use going forward.

def map_points(df, lat_col='latitude', lon_col='longitude', zoom_start=11, \
                plot_points=False, pt_radius=15, \
                draw_heatmap=False, heat_map_weights_col=None, \
                heat_map_weights_normalize=True, heat_map_radius=15):
    """Creates a map given a dataframe of points. Can also produce a heatmap overlay

    Arg:
        df: dataframe containing points to maps
        lat_col: Column containing latitude (string)
        lon_col: Column containing longitude (string)
        zoom_start: Integer representing the initial zoom of the map
        plot_points: Add points to map (boolean)
        pt_radius: Size of each point
        draw_heatmap: Add heatmap to map (boolean)
        heat_map_weights_col: Column containing heatmap weights
        heat_map_weights_normalize: Normalize heatmap weights (boolean)
        heat_map_radius: Size of heatmap point

    Returns:
        folium map object
    """

    ## center map in the middle of points center in
    middle_lat = df[lat_col].median()
    middle_lon = df[lon_col].median()

    curr_map = folium.Map(location=[middle_lat, middle_lon],
                          zoom_start=zoom_start)

    # add points to map
    if plot_points:
        for _, row in df.iterrows():
            folium.CircleMarker([row[lat_col], row[lon_col]],
                                radius=pt_radius,
                                popup=row['name'],
                                fill_color="#3db7e4", # divvy color
                               ).add_to(curr_map)

    # add heatmap
    if draw_heatmap:
        # convert to (n, 2) or (n, 3) matrix format
        if heat_map_weights_col is None:
            cols_to_pull = [lat_col, lon_col]
        else:
            # if we have to normalize
            if heat_map_weights_normalize:
                df[heat_map_weights_col] = \
                    df[heat_map_weights_col] / df[heat_map_weights_col].sum()

            cols_to_pull = [lat_col, lon_col, heat_map_weights_col]

        stations = df[cols_to_pull].as_matrix()
        curr_map.add_children(plugins.HeatMap(stations, radius=heat_map_radius))

    return curr_map

Let's test it out:

mapPoints(sameStationCountsDF, plotPoints=True, drawHeatmap=True, heatMapWeightsNormalize=True,
          heatMapWeightsCol='numTrips')

Looks good!

Subscriber vs Customer Heatmaps

sameStationCountsByUserTypeStationID = \
    sameStationTrips[['from_station_id', 'usertype', 'trip_id']] \
    .groupby(('usertype', 'from_station_id'))\
    .agg('count')

selected = 'Customer'
sameStationCountsDF = pd.merge(left=divvyStations, right=sameStationCountsByUserTypeStationID.loc[selected],
                               how='outer', left_on='id', right_index=True)
sameStationCountsDF['numTrips'] = sameStationCountsDF['trip_id'].fillna(value=0)

map_points(sameStationCountsDF, plot_points=False, draw_heatmap=True,
           heat_map_weights_normalize=False, heat_map_weights_col='numTrips', heat_map_radius=9)

selected = 'Subscriber'
sameStationCountsDF = pd.merge(left=divvyStations, right=sameStationCountsByUserTypeStationID.loc[selected],
                               how='outer', left_on='id', right_index=True)
sameStationCountsDF['numTrips'] = sameStationCountsDF['trip_id'].fillna(value=0)

map_points(sameStationCountsDF, plot_points=False, draw_heatmap=True,
           heat_map_weights_normalize=False, heat_map_weights_col='numTrips', heat_map_radius=9)

Conclusion

From the maps above, we can see that day pass customers tend to use stations around Lake Michigan and The Loop (Chicago's downtown). This matches up with what I was expecting: that tourists get on a bike to experience Chicago in a different way before continuing with other parts of their trip.

The annual members, on the other hand, have a usage map that is a bit more spread out to the neighborhoods. This could suggest they are running errand before returning their bike to the station where they retrieved it from. I've used Divvy in this manner to pick up take out from a restaurant outside walking distance.

Next Steps

In a future post, I will demostrate how to create choropleth maps to quicky visualize state/county data.

Siv Scripts

Visualizing Geospatial Data in Python Using Folium

Summary