Siv Scripts

Solving Problems Using Code

Sat 23 December 2017

Fastest Way to Load Data from MongoDB into Pandas

Posted by Aly Sivji in Data Analysis   

In this post, I will describe how to use BSON-NumPy to pull data out of Mongo and into pandas. While this library is still in the prototype stage, it's hard to to ignore the 10x speed improvement that comes from reading BSON documents directly into NumPy.

For this example, I will be analyzing a collection of my tweets stored in a MongoDB instance running on MLab.

Note: BSON-Numpy is under active development; API is subject to change.


What You Need to Follow Along

Development Tools (Stack)

Code


Setting up Environment

In [1]:
# standard library
from collections import namedtuple
import os

# pydata
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# bson-numpy libraries
import numpy as np
from bson import CodecOptions
from bson.raw_bson import RawBSONDocument
from pymongo import MongoClient
import bsonnumpy

# other libraries
import maya
In [2]:
# get mongodb params from environment
mlab_uri = os.getenv('MLAB_URI')
mlab_collection = os.getenv('MLAB_COLLECTION')

Overview of Data & Schema

Twitter is a microblogging platform that allows users to post content, aka tweets, that are limited to 280 characters. There are two types of tweets: tweets you write ("original tweets") and tweets that other people write & you repost ("retweets").

Every 12 hours, an AWS Lambda function goes out and downloads my latest tweets and stores them in a Mongo collection. The script leverages the Tweepy API wrapper, which does introduce some quirks to the process.

Let's start off by looking at the data we have inside Mongo.

Original Tweets

Looks like we can capture the full text of the tweet from the appropriately named full_text field:

Schema for original tweet

Retweets

The full_text field is cut off, but it appears that the nested retweeted_status.full_text field has everything we are looking for.

Schema for rewtweet

There is currently no way of handling varied documents using BSON-NumPy; we'll have to pull our data separately and combine it with pandas.

We can use the $exists operator in our query as follows:

  1. .find({"retweeted_status": {"$exists": False}})
  2. .find({"retweeted_status": {"$exists": True}})

Downloading Data

Set up the connection to Mongo:

In [3]:
codec_options=CodecOptions(document_class=RawBSONDocument)
client = MongoClient(mlab_uri)
db = client.get_default_database()
collection = db.get_collection(
    mlab_collection,
    codec_options=codec_options
)

We'll have to define a np.dtype variable which specifies:

  • the fields to extract from Mongo
  • the NumPy types to convert to

This variable has to match the structure of the MongoDB document. If we want to pull a nested field, we will need to define a variable with a nested structure.

Original Tweets

In [4]:
dtype_original = np.dtype(
    [
        ('full_text', 'S400'),
        ('created_at', 'S64'),
    ]
)
In [5]:
# query Mongo (using a generator expression)
ndarray_original = bsonnumpy.sequence_to_ndarray(
    (doc.raw for doc in collection.find({"retweeted_status": {"$exists": False}})), 
    dtype_original, 
    collection.count(),
)
In [6]:
print(ndarray_original[:2])
[ (b'Still. API keys in ipython notebooks! Yikes', b'Sat Dec 16 21:23:32 +0000 2017')
 (b'Fixing up a project I worked on in April 2016. Good job me for documenting sorta well', b'Sat Dec 16 21:22:47 +0000 2017')]
In [7]:
tweets_original = pd.DataFrame(ndarray_original)
In [8]:
# convert bytes to string!
tweets_original['full_text'] = tweets_original['full_text'].str.decode('utf-8')
tweets_original['created_at'] = tweets_original['created_at'].str.decode('utf-8')
In [9]:
tweets_original.head()
Out[9]:
full_text created_at
0 Still. API keys in ipython notebooks! Yikes Sat Dec 16 21:23:32 +0000 2017
1 Fixing up a project I worked on in April 2016.... Sat Dec 16 21:22:47 +0000 2017
2 https://t.co/6Ubtx6S1O8\n#bookmark Sat Dec 16 20:45:00 +0000 2017
3 VSCode November release \nhttps://t.co/U4nzPLE8QI Sat Dec 16 20:23:18 +0000 2017
4 @TooManyNickLees It does for this project I'm ... Sat Dec 16 20:08:26 +0000 2017

Retweets

In [10]:
dtype_rt = np.dtype(
    [
        ('retweeted_status', np.dtype([('full_text', 'S400')])),
        ('created_at', 'S64'),
    ]
)
In [11]:
ndarray_rt = bsonnumpy.sequence_to_ndarray(
    (doc.raw for doc in collection.find({"retweeted_status": {"$exists": True}})), 
    dtype_rt, 
    collection.count(),
)
In [12]:
tweets_rt = pd.DataFrame(ndarray_rt)
In [13]:
tweets_rt['full_text'] = tweets_rt['retweeted_status'].map(lambda value: value[0].decode('utf-8'))
tweets_rt['created_at'] = tweets_rt['created_at'].str.decode('utf-8')
In [14]:
tweets_rt.head()
Out[14]:
retweeted_status created_at full_text
0 (b'A list of things that might be useful when ... Sat Dec 16 20:29:52 +0000 2017 A list of things that might be useful when ans...
1 (b'OUT: import pdb; pdb.set_trace()\nIN: break... Sat Dec 16 20:08:52 +0000 2017 OUT: import pdb; pdb.set_trace()\nIN: breakpoi...
2 (b'Great news, Guido has pronounced dicts to r... Sat Dec 16 16:46:55 +0000 2017 Great news, Guido has pronounced dicts to reta...
3 (b'First ever PyCon happening in Pakistan http... Sat Dec 16 13:46:57 +0000 2017 First ever PyCon happening in Pakistan https:/...
4 (b'#python news: \xf0\x9f\x98\x80 @gvanrossum... Sat Dec 16 05:14:45 +0000 2017 #python news: 😀 @gvanrossum just pronounced t...

Combine Datasets

In [15]:
len(tweets_original) + len(tweets_rt)
Out[15]:
2611
In [16]:
tweets = pd.concat([tweets_rt, tweets_original])
In [17]:
len(tweets)
Out[17]:
2611
In [18]:
tweets['retweeted'] = tweets['retweeted_status'].notna()
tweets = tweets.drop(columns=['retweeted_status'])
In [19]:
tweets.describe(include='all')
Out[19]:
created_at full_text retweeted
count 2611 2611 2611
unique 2611 2604 2
top Mon May 22 01:46:23 +0000 2017 How I Git\n{author: @amention }\nhttps://t.co/... False
freq 1 2 1595

Add Additional Information to Data

  • Parse datatime and convert to correct timezone (assumption: most of my tweets were made around Chicago)
  • Add tweet character count
In [20]:
ConvertTZArgs = namedtuple("ConvertTZArgs", ["dt_col", "to_timezone"])

def convert_timezone(row, *args):
    # get datetime
    dt_col = args[0].dt_col
    to_timezone = args[0].to_timezone
    dt = row[dt_col]
    
    dt = maya.parse(dt).datetime(to_timezone=to_timezone)
    return dt
In [21]:
tweets['created_at'] = tweets.apply(
    convert_timezone,
    axis=1, 
    args=(ConvertTZArgs('created_at', 'US/Central'),)
)
In [22]:
tweets['chars'] = tweets['full_text'].str.len()
In [23]:
tweets.head()
Out[23]:
created_at full_text retweeted chars
0 2017-12-16 14:29:52-06:00 A list of things that might be useful when ans... True 166
1 2017-12-16 14:08:52-06:00 OUT: import pdb; pdb.set_trace()\nIN: breakpoi... True 90
2 2017-12-16 10:46:55-06:00 Great news, Guido has pronounced dicts to reta... True 246
3 2017-12-16 07:46:57-06:00 First ever PyCon happening in Pakistan https:/... True 62
4 2017-12-15 23:14:45-06:00 #python news: 😀 @gvanrossum just pronounced t... True 137

Explore and Analyze Data

In [24]:
tweets['retweeted'].value_counts()
Out[24]:
False    1595
True     1016
Name: retweeted, dtype: int64

Investigating Number of Characters

In [25]:
g = sns.FacetGrid(
    tweets,
    col="retweeted",
    margin_titles=True,
    size=6,
)

g.map(
    plt.hist,
    "chars",
    bins=20,
    color="steelblue",
    lw=0,
)
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x107bdd5f8>
In [26]:
tweets[tweets['retweeted'] == 0]['chars'].describe()
Out[26]:
count    1595.000000
mean       93.228840
std        35.649305
min         2.000000
25%        67.000000
50%        95.000000
75%       122.000000
max       280.000000
Name: chars, dtype: float64
In [27]:
tweets[tweets['retweeted'] == 1]['chars'].describe()
Out[27]:
count    1016.000000
mean      112.819882
std        38.045216
min        19.000000
25%        85.000000
50%       115.000000
75%       137.000000
max       299.000000
Name: chars, dtype: float64

Examining Time Between Posts

In [28]:
# creating function
def time_between(times):
    sorted_times = times.sort_values()
    time_between = sorted_times - sorted_times.shift(1)
    return time_between
In [29]:
def to_seconds(time):
    return time / np.timedelta64(1, 's')

Original Tweets

In [30]:
between_my_posts = time_between(tweets[tweets['retweeted'] == 0]['created_at'])
In [31]:
between_my_posts.describe()
Out[31]:
count                      1594
mean     1 days 03:26:02.309912
std      2 days 10:13:51.045480
min             0 days 00:00:08
25%      0 days 01:50:48.500000
50%      0 days 08:17:50.500000
75%      1 days 00:33:25.500000
max            31 days 21:24:50
Name: created_at, dtype: object
In [32]:
to_seconds(between_my_posts).plot(
    kind='hist',
    range=(0, 100_000)
)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x107c63470>

All Tweets (Original Tweets + Retweets)

In [33]:
between_all_posts = time_between(tweets['created_at'])
In [34]:
between_all_posts.describe()
Out[34]:
count                      2610
mean     0 days 16:45:16.904980
std      1 days 17:19:51.559515
min             0 days 00:00:03
25%      0 days 00:43:58.750000
50%             0 days 03:09:13
75%             0 days 13:47:39
max            26 days 03:23:07
Name: created_at, dtype: object
In [35]:
to_seconds(between_all_posts).plot(
    kind='hist',
    range=(0, 100_000)
)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x108277390>

Twitter Activity Heatmap

What time of the week am I most activate on Twitter? We'll use seaborn.heatmap to explore this in more detail.

In [36]:
tweets['hour'] = tweets['created_at'].dt.hour
tweets['day_of_week'] = tweets['created_at'].dt.dayofweek
days_of_week = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

All Tweets (Original Tweets + Retweets)

In [37]:
tweets_timeline = tweets.groupby(by=['hour', 'day_of_week'])['full_text'].count()
In [38]:
plt.figure(figsize=(15, 15))
ax = (sns.heatmap(tweets_timeline.unstack(),
                  cmap='Blues',
                  xticklabels=days_of_week))

for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontsize(20)

Original Tweets

In [39]:
tweets_timeline = tweets[tweets['retweeted'] == 0].groupby(by=['hour', 'day_of_week'])['full_text'].count()
In [40]:
plt.figure(figsize=(15, 15))
ax = (sns.heatmap(tweets_timeline.unstack(),
                  cmap='Blues',
                  xticklabels=days_of_week))

for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontsize(20)

Retweets

In [41]:
tweets_timeline = tweets[tweets['retweeted'] == 1].groupby(by=['hour', 'day_of_week'])['full_text'].count()
In [42]:
plt.figure(figsize=(15, 15))
ax = (sns.heatmap(tweets_timeline.unstack(),
                  cmap='Blues',
                  xticklabels=days_of_week))

for label in (ax.get_xticklabels() + ax.get_yticklabels()):
    label.set_fontsize(20)

Conclusion

I'm more active on Twitter in the mornings and if you want me to Retweet something, best to catch me on a Sunday.

BSON-NumPy is a library that allows us to convert BSON documents into Numpy arrays. While I haven't run any benchmarks comparing it to PyMongo, I much prefer the direct conversion versus having to cobble together a DataFrame from returned documents.

Looking forward to seeing what Mongo does with this library.



 
    
 
 

Comments