AV's Blog

Creating a Flask Web App that Scrapes Data from Austin's Open Data Portal

November 06, 2018

Before my foray into JavaScript and React, I used to code in Python. Along the way, I learned about Flask, a microframework for Python created by Armin Ronacher. From here, I learned I enjoyed web development and hence, my diving into it after spending some time in Python data analysis and machine learning. I still do like those, but I enjoy web development a lot. 🦊

One of the first web dev projects I did was a Flask app that gets data from Austin’s Open Data Portal. I got interested in the water quality sampling dataset under the environment category because I can understand it a little bit more than other datasets and it was the most interesting to me the first time I browsed around in the data portal for the city I am currently living in.

Before using this dataset, I had to first understand it. Using pandas, I was able to do this. I first zeroed in on E. coli data, which turned out to be measured using different methods and units. I just picked the one that’s most predominantly used, and that was MNP/100mL (most probable number per 100 milliliters of sample, fecal coliform bacteria counts). I’ve blogged about the data analysis here.

There is a visualization tool available at the open data portal where the data is made available but configuring the visualization is a little bit confusing. Having previously used Flask in a small project before, I thought about using it for scraping data off the portal and showing the E. coli data on a map. The resulting app is deployed on Heroku here. Having learned the practical side from Miguel Grinberg who has most comprehensive tutorial on Flask, I think, I used Bootstrap for styling it. Also, the year of the data was hardcoded and the Google maps is now not rendering correctly because Google has recently changed its policy regarding usage of its maps API.

I went back to this project and refactored the code. I now kind of understand the file structure and has modified it to mostly conform to what experts think is the right thing to do. For example, I’ve hidden the API keys for Google maps, and for querying the Austin Data Portal. I managed to upload the site via Heroku again, just like last time, while avoiding the API keys from leaking out. I did away with Bootstrap for now. It only uses basic CSS and some flexbox.

Going through this project allowed me to realize the good parts of React.js, but there are also good parts of Flask I like. One question I need to know the answer to is, why doesn’t asynchronicity matter in Flask, while in JavaScript, it does? Or I’m probably missing something here.

The code for the new version of the app is up on GitHub. Scraping involves using the Socrata Open Data API. The portal points to the API endpoint where you can get the data. Then, I had to request my own app token. For my case, how to do this is discussed here.

Using a browser or even Postman, I can test the endpoint. I went ahead and used the ‘unit’ column being equal to ‘MNP/100ML’ as parameter for the query to get all E. coli data for the current year only.

import datetime
import requests
from secret import TOKEN


def get_data(params):
  """Scrapes water quality sampling data from data.austintexas.gov"""

  url = 'https://data.austintexas.gov/resource/v7et-4fvp.json?$$app_token={}'.format(TOKEN)

  r = requests.get(url, params)
  if r.status_code == requests.codes.ok:
    data = r.json()

    # Extract only current year's readings
    now = datetime.datetime.now()
    current_year = now.year
    results = [line for line in data if line['sample_date'][:4] == str(current_year)]

    # Convert time to desired date format by creating a new column 'sample_datetime'
    sample_datetime = [datetime.datetime.strptime(line['sample_date'][:-4], "%Y-%m-%dT%H:%M:%S") for line in results]
    for i, record in enumerate(results):
      record['sample_datetime'] = sample_datetime[i]
  else:
    results = r.raise_for_status()

   return results

It seems hacky to not include the year in the API request query but instead outside of it. I actually forgot that I can probably do this now that I’m writing about it. But it was easier to do this at the time I was writing the code. So I did some datetime conversion just to get the current year’s readings.

Once I have the function that does the API request, I can use it in my Flask framework. I now find this framework a little cluncky compared to React.js. Still I like that I was able to make it work.

First, I had to create a Python virtual environment in the folder where I chose to have my project live. I am using Python 3.7 and called it python3 in my console.

$ mkdir waterATX2018
$ cd waterATX2018
waterATX2018 $ python3 -m venv venv  // (I can use any name to call the virtual env)
(venv) waterATX2018 $

Inside the virtual environment, I installed Flask using pip.

(venv) waterATX2018 $ pip install Flask

The rest is according to Miguel Grinberg’s Flask tutorial where applicable. For instance, the app doesn’t have a database, and presently, it doesn’t use a form. So I skipped those from the tutorial.The closest Miguel Grinberg for this app is the one he did in PyCon 2015, which is still up on YouTube.