One of the projects I did for the Udacity Data Analyst Nanodegree is the data wrangling of an OpenStreetMap dataset. The project demonstrates the data wrangling process from XML files which is the type of data OpenStreetMap is. It is human-edited, and hence, is inherently “dirty”. The process involves parsing, cleaning, and converting the data into a database. The database was then queried for exploration.
I chose the Austin, TX area for this project. An extract for this area already exists. This file is at least 50 MB and just right for a simple data wrangling exercise. Instructions on how to find and download a dataset from OpenStreetMap is illustrated in the Udacity video here.
The OSM XML
To understand how to parse the OSM XML, we need to understand its structure. It contains various elements, and an element is composed on a block of nodes, a block of ways, and a block of relations, aside from the character encoding. An example of an XML code is shown in the OSM XML wiki.
Looking at the wiki and as can be seen above, a node block contains information such as “latitude”, “longitude”, and “node id”, among other attributes. A ways block “is an ordered list of nodes, which has at least one tag, or is included within a relation”.
Problems Encountered in the Map
Prior to cleaning the whole map, a small part of it was parsed first and explored. The code used to do this is shown here. The code was highly dependent on
regex modules of Python. After exploring the sample file, some problems that needed attention were:
- Street names, some of which were abbreviated
- Inconsistent abbreviation of street names (for example, ‘IH35’, ‘I H 35’, ‘I-35’, and ‘I35’ all mean the same thing)
- Phone numbers also didn’t have the same format (ex. ‘(512) 782-5659’, ‘+1 512-472-1666’, ‘51224990093’, ‘512 466-3937’, etc.)
- Postcodes didn’t have a consistent format as well
- City names were also not uniform
Cleaning of Street Names
Here are some examples of data cleaning done in the project:
North IH 35 => North Interstate Highway 35 Calhoun Ln => Calhoun Lane FM 685 => Farm-to-Market Road 685 W. University Avenue,Ste 320 => West University Avenue Suite 320
The “clean function” I created to fix the formats was able to distinguish between some abbreviations:
Avenue H => Avenue H (H stayed the same) N I H 35 Bldg 7 => North Interstate Highway 35 Building 7 (H was converted to Highway)
This was made possible using the
regex module (see code on how I did this). However, after the updates, there were still some problems remaining after the clean up, such as certain streets have different names. For example, Ranch Road 620 is also referred to as Farm-to-Market Road 620, US Highway 290 is also Country Road 290. These were not addressed in the project although it could be easily added to the “mapping_street” dictionary used by the clean function.
Parsing of OSM Data to CSV Files
Data were parsed from the OSM file using “data.py”, a program provided by the course instructor. In short, this program “parses the OSM XML file and transforms them form document format to tabular format, making it possible to write to csv files”. It uses
iterparse to step through each top level element in the XML, shape each element into several data structures using
shape_element function (which uses the
clean function mentioned above), use a schema and validation library to ensure the transformed data is in the correct format, and write each data structure to the appropriate csv file (see docstring in the “data.py” program). Three csv files were created, all corresponding to three XML attributes discussed under “OSM XML” above.
Creating the SQL Database (SQLite)
Creation of the SQLite database was straightforward using the
sqlite module. The code for this task is stored here.
Querrying the SQL Database
Checking for effectiveness of the clean function
Querrying the database created can be a way to determine if cleaning was effective. For example, querrying for cities showed that the cities names were found to be fixed.
cities = cur.execute("""SELECT tags.value, COUNT(*) as count FROM (SELECT * FROM nodes_tags UNION ALL SELECT * FROM ways_tags) tags WHERE tags.key = 'city' GROUP BY tags.value ORDER By count DESC""").fetchall() print cities Out: [(u'Austin', 3068), (u'Round Rock', 113), (u'Kyle', 64), (u'Cedar Park', 43), (u'Pflugerville', 37), .....
However, looking at the postcodes (querried using a similar code as that used for querrying for cities), there were three “None” values. To figure out what these should be, I querried for the accompanying information with these values.
missing_postcodes = cur.execute("""SELECT * FROM (SELECT * FROM nodes_tags UNION ALL SELECT * FROM ways_tags) tags WHERE tags.key = 'postcode' AND tags.value = 'None'""").fetchall() print missing_postcodes Out: [(2152207067, u'postcode', u'None', u'addr'), (247506590, u'postcode', u'None', u'addr'), (383791236, u'postcode', u'None', u'addr')]
To determine what info is accompanying id 2152207067, the following query was done:
cur.execute("""SELECT * FROM (SELECT * FROM nodes_tags UNION ALL SELECT * FROM ways_tags) tags WHERE tags.id = 2152207067""") cur.fetchall() Out: [(2152207067, u'name', u'Nyle Maxwell - Taylor', u'regular'), (2152207067, u'shop', u'car', u'regular'), (2152207067, u'website', u'www.nylemaxwellcjd.com', u'regular'), (2152207067, u'street', u'United States Highway 79', u'addr'), (2152207067, u'postcode', u'None', u'addr')]
From this result and accessing the provided website, it can be seen that the postcode should be 76574. The other missing postcodes were determined in the same way.
Locations of restaurants
Exploring the locations of all restaurants recorded in the OSM XML file used involved using the following code in Python:
cuisine_loc = cur.execute("""SELECT b.id, b.value, nodes.lat, nodes.lon FROM (SELECT * FROM nodes_tags UNION ALL SELECT * FROM ways_tags) b JOIN nodes ON b.id = nodes.id WHERE b.key = 'cuisine'""").fetchall()
Overview of (Meta)Data
The following are some information about the dataset:
austin_texas.osm 1.41 GB atx_osm.db 820.4 MB nodes.csv 604.3 MB nodes_tags.csv 11.7 MB ways.csv 48.6 MB ways_tags.csv 70.6 MB ways_nodes.csv 175.6 MB
Number of Nodes
In : cur.execute("SELECT COUNT(*) FROM nodes") nodes = cur.fetchall() nodes Out: [(6356394,)]
Number of Ways
In : cur.execute("SELECT COUNT(*) FROM ways") ways = cur.fetchall() ways Out: [(666390,)]
Number of Users/Contributors
In : cur.execute("""SELECT COUNT(DISTINCT(e.uid)) FROM (SELECT uid from nodes UNION ALL SELECT uid FROM ways) e""") users = cur.fetchall() users Out: [(1146,)]
Top 10 Contributing Users
cur.execute("""SELECT e.user, COUNT(*) as num FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e GROUP BY e.user ORDER BY num DESC LIMIT 10""").fetchall() Output: [(u'patisilva_atxbuildings', 2743705), (u'ccjjmartin_atxbuildings', 1300514), (u'ccjjmartin__atxbuildings', 940070), (u'wilsaj_atxbuildings', 359124), (u'jseppi_atxbuildings', 300983), (u'woodpeck_fixbot', 223425), (u'kkt_atxbuildings', 157847), (u'lyzidiamond_atxbuildings', 156383), (u'richlv', 50212), (u'johnclary_axtbuildings', 48232)]
Suggestions for Improvement of Data
One aspect that always crop up during clean up of my data was loss of data, such as in the case where one of two phone numbers provided gets discarded. This may be remedied by using a list as value for the field. However, the validation check will flag this and create an error. A non-SQL database might be more applicable in handling this case.
Another problem with the data itself is the presence of more than one field names for one type of data. When the values of attribute ‘k’ was explored, there were at least two “fix me”’s as values. There were also more than one for phone numbers and postal codes. A standardization of the k values should be instituted by OpenStreetMap. Anything that does not fit the list of these k values should create an error upon data entry for contributors. Also, the format for the values might also be standardized. A disadvantage of such rules however, might discourage contributors causing a slow development of OSM. However, if an automated cleaning program is instituted, it might be ok.
Information from an xml file can be extracted for data by Python through the xml.eTree.ElementTree module. This can be converted to a csv file which can be converted to an sql database (or to a pandas dataframe, which is not shown here, but was explored in another unit of the course). SQL databases can be converted to a pandas dataframe.
Cleaning of data takes a while. Knowledge of the nature of data also is very important so the best decisions on what to do with it can be done.