Reverse geocoding with Tableau Prep, Python and Google Geocoding API
A couple years ago I created a Tableau Web Data Connector (WDC) for Swedish House prices. A number of my customers have been using it since then on a daily basis. It allows one to retrieve prices of properties sold anywhere in Sweden since several years back, as well as the houses currently for sale on the market.
One customer has used this data with Machine Learning to predict the prices these houses might end up selling for. Now they want to generalise their algorithm to provide a price estimate for a wider area, e.g. the postal code, rather that at the individual location (lat/long) level. The challenge is that the data provider does not include postcodes, but rather just the street address, city, as well as latitude and longitude of the property.
If I had access to a Shapefile that contains all the Swedish postal codes, I could very easily combine this file with my original WDC data to bring the post codes into Tableau, thanks to an awesome feature released in 2018.2: Spacial Joins.
Since I couldn’t find any such, I’ll show you how I leveraged Tableau Prep’s new Scripting capability (currently under v2019.3 beta testing), combined with Google Geocoding API to solve this fun challenge.
You can apply the same technique to enrich your datasets in Tableau Prep with data coming from Python.
This is what I did:
- The first step was to get an API key from GCP — Google Cloud Platform, as per these instructions.
2. I made sure I had an instance of Tabpy available on a computer on my network (Tabpy installation instructions here).
3. On that computer running Tabpy, I installed this handy Python library that wraps Google Geocoding API: pygeocoder
pip3 install pygeocoder
4. With Tableau Prep, I then connected to the output of my Booli WDC, a Tableau Server Published Data Source — also a new feature of 2019.3 :) — with 100,000 house prices. I created a simple branch on my prep flow that only contained the 2 fields I needed as an input for the Google API to reverse geocode the house location and return the postcode. These were the Latitude and Longitude.
I also included the Booli Id field, which is a unique identifier for each row, so that I could “know” which lat/long corresponds to which postcode. This was needed at a later step in my flow, so that I could join back the output of my Python script with the original dataset.
5) I then added a Script step in Tableau Prep,
which received the data from my previous step, in the form of a Pandas Dataframe. If you are not familiar with Pandas, do read more about it, as it’s an incredibly powerful library, super easy to understand for people that typically deal with datasets.
I configured this script to execute a Python function called getPostalCode.
6) Finally I connected Tableau Prep’s Script step to the python file containing my code as follows:
7) When the script above runs, it returns a dataset back to Tableau Prep including one additional column (“Postcode”). I then just had to join this data back into the original dataset, based on the linking field “Booli Id” and create an output to write that data back to Tableau Server, now including a new column.
That’s it! Here’s a screenshot of this output data in Tableau, now including the postcodes:
And here’s a screenshot of my completed Tableau Prep Flow:
Important takeaway: While working on this project I learned an easy trick I could use whenever I need to add a new field to my dataset in Prep. It’s basically an easy way to add a new new field/column to a Pandas dataframe and populate it with new data:
a) Create a empty list that will hold the new data points, for example:
list_new_data = 
b) Loop through each row in the original dataframe and perform the needed operations in each row, in order to generate your new data. In my case, that is to send lat/long to the API endpoint and get back the location object. Then the key here is to append this data point into the list:
for row in df.index:
< do some stuff in this row, e.g. create your new datapoint >
... #add this new data point to the list
c) Finally, create a new Column in the dataframe by giving it a new name then and assign it to the list populated by your for loop:
df['My_New_Column'] = list_new_data
That’s it. Following this tip you will hopefully be able to create sophisticated data pipelines in Python and augment your Tableau Prep flows :)
Let me know what suggestions you have to improve this article or solution, as well as the Tableau Prep Python/R Scripts you have created!