Siv Scripts

Solving Problems Using Code

Sun 02 April 2017

Generating HTML Pages from MongoDB with MongoEngine and Jinja2 (Flask Part 1)

Posted by Aly Sivji in Tutorials   

(Note: This post is part of my reddit-scraper series)

Summary

  • Overview of MongoDB
  • Discussion of Object-Relational Mapping (ORM)
  • Use MongoEngine to get items out of MongoDB
  • Render HTML pages using Jinja2
  • Interact with REST API to send emails with Requests

Previously on Siv Scripts, we implemented a web scraping pipeline to store Top Posts scraped from Reddit into a MongoDB collection. The information we collected will become useful once it's out of the database so let's explore different ways of getting and using the data.

The obvious solution is to utilize a Python Web Framework to create a website that displays posts from various subreddits and allows users to mark items they have already seen. Based on our needs, Flask is the best tool for the job. Creating a Flask-based site requires that we familiarize ourselves with the Flask, its recommended design patterns, as well as the various extensions that enable us to create full-featured user experiences.

This will take a few posts to cover in depth so let's start at the beginning and explore how to generate HTML pages from MongoDB documents using the MongoEngine ORM and Jinja2 Templating Engine. We will then leverage the Requests library to send emails using MailGun's REST API; this will provide us with a temporary workaround to view scraped data until our Flask website is complete.


What You Need to Follow Along

Development Tools (Stack)

Code


MongoDB

In this section we will explore MongoDB, discuss best practices, and examine how it fits into our project.

Overview

"MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling" (Mongo Docs).

What does this mean? In comparison to a relational database which focuses on linking data across tables with keys, a document database stores all the data elements together in one location.

Comparing Relational Databases to MongoDB

There are many reasons to use document-oriented databases over their relational counterparts. As our database will be the main data source for a variety of projects, it makes sense to use Mongo and take advantage of its flexible schemaless structure: our database can grow with the needs of our project.

Other reasons to use MongoDB:

Technical Details

From the Mongo Docs:

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents
MongoDB record information

MongoDB stores records as BSON (Binary-JSON) documents in collections and collections inside of databases (additional details).

When we send a query into Mongo, it performs a collection scan, i.e. it scans every document in our queried collection. This means that performance will become an issue as our data scales. To alleviate this, we can create indexes (single field, compound, multikey, or text) to ensure all calls to the database are quick.

We're glossing over a lot of the details here since database design and optimization is a field unto itself. While we are setting up our database, we will implement the following index strategy recommendations (from the MongoDB documentation):

  • Create Indexes to Support Your Queries
  • Use Indexes to Sort Query Results
  • Ensure Indexes Fit in RAM
  • Create Queries that Ensure Selectivity

MongoDB in Our Project

Loading Data

If you did not follow along the previous post, you can download the 20170309 data extract and import it into your instance of mongo using the following command:

$ mongoimport --db sivji-sandbox --collection top_reddit_posts --type json --file 20170309-reddit-posts.json
2017-03-09T15:14:51.021-0700    connected to: localhost
2017-03-09T15:14:51.043-0700    imported 551 documents

Document Schema

Let's use MongoDB Compass and take a look at a sample document to understand the fields we can pull.

Sample Document

The date_str field is a string version of the date field. Having a string type versus an ISODate type will make our queries run faster.

Creating Indexes

Following the best practices mentioned above, we should take some time to think about the queries we will need to run. Taking a step back, we need to consider the kinds of data we will want our website to display.

Indexes to Support Views Required

Use MongoDB Compass to create the index. Should look as follows:

Creating an Index in MongoDB

Our database is all set up and optimized. In the next section we will explore how to get data out of Mongo and into our program.

Note
  • Adding score to the index will make our queries run faster, but the index will also take up disk space. This example is a bit trivial since it's a toy project, but we should get into the habit of thinking about tradeoffs.

Object-Relational Mapping (ORM)

Overview

Object-Relational Mapping (ORM) is a "technique that lets [us] query and manipulate data from a database using an object-oriented paradigm" (Source).

What does this mean? ORM libraries let us work with databases in the language of our choice. No more fumbling around with database connectors and SQL, we can treat objects in the database as objects in our program. More details can be found in this StackOverflow (Praise Be) discussion.

Like with all things in programming, there are people who consider ORMs to be anti-patterns. As long as we understand the limitations of using an ORM library (i.e. not a full replacement for querying languages), we can use them to get our projects off the ground quickly. As we scale up, we should revisit the use of an ORM.

ORM Libraries in Python

For relational databases, SQL Alchemy reigns supreme. PonyORM uses generators and lambdas (Author's Note: (☞゚ヮ゚)☞) to write its queries.

Since MongoDB is a document database, Object-Relational Mapping becomes Document-Object Mapping (DOM). MongoEngine and MongoKit are two popular DOM libraries. In the next section, we will use the MongoEngine library to pull data into our program from our instance of MongoDB.


MongoEngine

Using the tutorial and User Guide as a template, lets create a class to specify our schema.

# top_post_emailer/data_model.py

from mongoengine.document import Document
from mongoengine.fields import DateTimeField, IntField, StringField, URLField


class Post(Document):
    ''' Class for defining structure of reddit-top-posts collection
    '''
    url = URLField(required=True)
    date = DateTimeField(required=True)
    date_str = StringField(max_length=10, required=True)
    commentsUrl = URLField(required=True)
    sub = StringField(max_length=20, required=True) # subredit can be 20 chars
    title = StringField(max_length=300, required=True) # title can be 300 chars
    score = IntField(required=True)

    meta = {
        'collection': 'top_reddit_posts', # collection name
        'ordering': ['-score'], # default ordering
        'auto_create_index': False, # MongoEngine will not create index
        }

Let's make sure this works by testing in the Python REPL.

$ python
Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 13:19:00)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from mongoengine.connection import connect
>>> from top_post_emailer.data_model import Post
>>> MONGO_URI = 'mongodb://localhost:27017'
>>> connect('sivji-sandbox', host=MONGO_URI)
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True, read_preference=Primary())
>>> Post.objects()[0].title
'Chrome 56 Will Aggressively Throttle Background Tabs'

Looks good!

Notes

Jinja2

Jinja2 is a "modern and designer-friendly templating language for Python" that is the default templating engine bundled with Flask (additional info can be found in the Jinja docs).

What does this mean? Jinja2 lets us create templates with programming logic (control structures, inheritance) that can be rendered into HTML as the template code is evaluated. This will allow us to build dynamic, database-driven websites using Python! Real Python has a great primer on Jinja Templating.

Let's start with a basic template that will grab all the Top Posts from the last Reddit scrape.

<!-- top_post_emailer/template.html -->

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Reddit Top </title>
</head>
<body>
    <ul id="posts">
    {% for selected_sub in Post.objects(date_str__gte=day_to_pull).distinct('sub') %}
        <h3>{{ selected_sub }}</h3>

        {% for post in Post.objects(date_str__gte=day_to_pull, sub=selected_sub) %}
            <li><a href="{{ post.url }}">{{ post.title }}</a> (Score: {{ post.score }} | <a href=" {{ post.commentsUrl }}">Comments</a>)</li>
        {% endfor %}

    {% endfor %}
    </ul>

</body>
</html>

We will need a way to render our Jinja2 template. Let's create a function:

# top_post_emailer/render_template.py

import os
import jinja2

def render(filename, context):
    ''' Given jinja2 template, generate HTML
    Adapted from http://matthiaseisen.com/pp/patterns/p0198/

    Args:
        * filename - jinja2 template
        * context - dict of variables to pass in

    Returns:
        * rendered HTML from jinja2 templating engine
    '''
    path = os.path.dirname(os.path.abspath(__file__))
    return jinja2.Environment(
        loader=jinja2.FileSystemLoader(path or './')
    ).get_template(filename).render(context)

Let's go back to our Python REPL and test to see if this works:

>>> from top_post_emailer.render_template import render
>>> ## get the last date the webscraper was run
... for post in Post.objects().fields(date_str=1).order_by('-date_str').limit(1):
...     day_to_pull = post.date_str
...
>>> ## pass in variables, render template, and send
... context = {
...     'day_to_pull': day_to_pull,
...     'Post': Post,
... }
>>> print(render("template.html", context))
<!DOCTYPE html>
<html lang="en">
<head>
    <title>Reddit Top </title>
</head>
<body>
... additional rows omitted ...

Great! In the next section we will email ourselves this information.

Note
  • Once we start building out our Flask website, we can create complex Jinja templates with tables that have alternating row styles

Requests and MailGun API

Let's finish off this post by creating a minimum viable product (MVP): a program that emails us a list of all Top Posts scraped from our last Scrapy run. We will use Requests library to interact with MailGun's REST API to send HTML emails.

What can we say about Kenneth Reitz's Requests: HTTP for Humans that hasn't already been said? Nothing. Check the links for more info.

After a little Python, we come up with the following:

# top_post_emailer/mailgun_emailer.py

import os
import configparser
import requests
from requests.exceptions import HTTPError

def send_email(html):
    '''Given HTML template, sends Reddit Top Post Digest email using MailGun's API

    Arg:
        html - HTML to send via email

    Returns:
        None
    '''
    ## api params (using configparser)
    config = configparser.ConfigParser()
    config.read(os.path.join(os.path.abspath(os.path.dirname(__file__)), 'settings.cfg'))
    key = config.get('MailGun', 'api')
    domain = config.get('MailGun', 'domain')

    ## set requests params
    request_url = 'https://api.mailgun.net/v3/{0}/messages'.format(domain)
    payload = {
        'from': 'alysivji@gmail.com',
        'to': 'alysivji@gmail.com',
        'subject': 'Reddit Top Post Digest',
        'html': html,
    }

    try:
        r = requests.post(request_url, auth=('api', key), data=payload)
        r.raise_for_status()
        print('Success!')
    except HTTPError as e:
        print('Error {}'.format(e.response.status_code))
# top_post_emailer/settings.cfg

[MailGun]
api = [Your MailGun API key here]
domain = [Your MailGun domain here]
Notes

Putting it all Together

Now that we have all the pieces in place, we can finally write our script to get data out of Mongo and into a Jinja2 template. We can then use MailGun's REST API to send emails.

# top_post_emailer/__init__.py

from mongoengine.connection import connect
from .data_model import Post
from .render_template import render
from .mailgun_emailer import send_email

def email_last_scraped_date():
    # connect to db
    MONGO_URI = 'mongodb://localhost:27017'
    connect('sivji-sandbox', host=MONGO_URI)

    ## get the last date the webscraper was run
    for post in Post.objects().fields(date_str=1).order_by('-date_str').limit(1):
        day_to_pull = post.date_str

    ## pass in variables, render template, and send
    context = {
        'day_to_pull': day_to_pull,
        'Post': Post,
    }
    html = render("template.html", context)
    send_email(html)

We have structured our app as a package and we need to create a script to run the application. First we will ensure that our directory structure looks as follows:

.
├── README.md
├── app.py
└── top_post_emailer
    ├── __init__.py
    ├── data_model.py
    ├── mailgun_emailer.py
    ├── render_template.py
    ├── settings.cfg
    └── template.html

Now let's create our script:

# app.py

"""Script to pull and email last Reddit scape from MongoDB
"""

from top_post_emailer import email_last_scraped_date

if __name__ == '__main__':
    email_last_scraped_date()

Run the app from the terminal with the following command:

$ python app.py
Success!

Did we receive an email?
Email Digest in Inbox

YIPPEE!

Notes

Conclusion

In this post, we started to look into ways of getting data out of MongoDB and into the hands of our user. We decided to use the Flask framework and started getting our feet wet by generating HTML pages from items stored in our MongoDB instance using MongoEngine and Jinja2. Lastly, we wrote some code to email ourselves the HTML page we created using Requests and the MailGun REST API. This gave us a minimal viable product we can use until we get our Flask website up.

In a future post, we will build upon what we learned and deploy a basic Flask website.


 
    
 
 

Comments