Graphing data from Mixpanel

Implementing a low cost solution to keep your Mixpanel data available for queries.

Product, Growth and Engineering.

Most Startups are already collecting data in their own backend databases, however this is often a partial view of how the users interact with the application, since databases are often designed to store the last State and history of Transactions mostly, all other interactions, views, queries, searches and all status changes before the last one are usually not stored in the main database and only seen by the front end (the mobile or web app).

However, this data is actually being collected already by any third party Analytics tool like MixPanel, Amplitude, etc. But those services tend to store it in a “silo” and only make it available in aggregated form through their set of visualization tools, which is often not enough for data scientists… they want the actual “raw” event-level data: each Screen View, each State Change and so on.

excel dashboards where they gather multiple sources of data (and among them mixpanel

Solution

Since we’re talking about a Startup with a small team and trying to burn as little money as possible we need to create a solution that fulfills the following criteria:

Extract raw events data using MixPanel export API periodically without adding an EC2 Instance or any intermediate storage requirements.Store the raw events in an Amazon S3 bucket (our “data lake”)(Optionally) Make this data available for SQL queries via Amazon Redshift

So… let’s do it.

The Problem

One of the most popular tools among StartUps is Mixpanel, it helps you track and analyze the users’ behavior as they interact with your mobile or web product and also provide tools to engage users.

Once your Mobile or Web Application has integrated Mixpanel’s SDK and you have done the work to properly instrument (code) the interesting Events and Properties you want to track, it provides a set of visualization tools for you to analyze all sorts of metrics around your user acquisition, engagement, retention, usage, etc. However, all that data you’re collecting is stored on MixPanel servers and you can’t easily perform custom queries on it or combine it with your own data to make more powerful analysis and extract insights.

Data scientists would love to use their Python or R scripts directly on the raw Mixpanel events, but is not possible unless you export Mixpanel data in an automated way, store it in a cheap storage (like AWS S3) and make it available for random SQL queries (like Amazon Redshift).

Solution - Data Export API

You can consume the Mixpanel analytics data using the Data Export API.

An example of receiving data using Python can be found here:

https://mixpanel.com/site_media/api/v2/mixpanel.py

This will be JSON formatted.

exit: ⌘↩

import base64
import urllib.request
import ssl

exit: ⌘↩

try:
import json
except ImportError:
import simplejson as json
class Mixpanel(object):
ENDPOINT = '<https://mixpanel.com/api>'
VERSION = '2.0'
def __init__(self, api_secret):
self.api_secret = api_secret
def request(self, methods, params, http_method='GET', format='json'):
"""
methods - List of methods to be joined, e.g. ['events', 'properties', 'values']
will give us <http://mixpanel.com/api/2.0/events/properties/values/>
params - Extra parameters associated with method
"""
params['format'] = format
request_url = '/'.join([self.ENDPOINT, str(self.VERSION)] + methods)
if http_method == 'GET':
data = None
request_url = request_url + '/?' + self.unicode_urlencode(params)
else:
data = self.unicode_urlencode(params)
auth = base64.b64encode(self.api_secret).decode("ascii")
headers = {'Authorization': 'Basic {encoded_secret}'.format(encoded_secret=auth)}
request = urllib.request.Request(request_url, data, headers)
# print(request)
context = ssl._create_unverified_context()
response = urllib.request.urlopen(request, context=context, timeout=120)
str_response = response.read().decode('utf8')
lines = str_response.splitlines(True)
records = []
for line in lines:
obj = json.loads(line)
records.append(obj)
return records
def unicode_urlencode(self, params):
"""
Convert lists to JSON encoded strings, and correctly handle any
unicode URL parameters.
"""
if isinstance(params, dict):
params = list(params.items())
for i,param in enumerate(params):
if isinstance(param[1], list):
params.remove(param)
params.append ((param[0], json.dumps(param[1]),))
return urllib.parse.urlencode(
[(k, v) for k, v in params]
)

exit: ⌘↩

if __name__ == '__main__':
encoded_secret = b'8fdd0a4f3eac0e234956d78e7bc345c5'
# byteAPISecret = bytes(encoded_secret + ':', "utf-8")
api = Mixpanel(api_secret=encoded_secret)
data = api.request(['events'], {
'event': ['Visited Page'],
'unit': 'hour',
'interval': 24,
'type': 'general'
})
# print (json.dumps(data, indent=4))
# Creating dataframe from dictionary object, but If you have nested columns then you first need to normalize the data:
# df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
df = data[0]['data']['values']
df = pd.DataFrame.from_dict(df, orient='columns')

Converting JSON to Pandas

This would be JSON formatted. Here is an example of how to convert a JSON dump into a Pandas frame:

https://stackoverflow.com/questions/41168558/python-how-to-convert-json-file-to-dataframe/41168691

You can then use Pandas as you normally would, for analysis/visualization/modeling.