Useful Big Data:  The fastest way


These days, even small to mid-sized companies can start getting valuable insights from Big Data within just a few weeks or even days. Of course, there are several flavors of Big Data, most of which fall into the categories of machine generated or customer journey data.  There are also  many ways to capture and use this Big Data.  In this post, the type of Big Data I’ll talk about will be the online, customer-journey data generated by visits to websites.   It’s a type of data that is relevant to almost all companies, corporations, and governments, because pretty much everyone has a significant online presence these days.

This online customer journey (big) data represents a gold-mine of customer insights and provides the fuel for conversion rate optimization, marketing insights, customer segmentation, production optimization, and much more.

What exactly is this customer journey Big Data?   If you run a web site, the web server is continuously receiving requests and sending information to the web browsers or mobile devices.  Thousands of times per second, your digital guests are entering text, checking boxes, clicking on links, scrolling through pages, and eventually exiting, with or without having made a transaction.  Web servers typically record much of this data in the form of web logs, which end up holding a lot of rich data.

These logs hold the key to understanding millions of digital customers (whom you’ll never get to meet in person).  If fully leveraged, these logs would provide deep insights into your customers intents and preferences. However, these logs are typically left untouched by the business and are automatically discarded by the IT department after a few days or weeks.


Typical first efforts towards customer journey Big Data

Adventurous techies occasionally do become inspired to dig into these web logs.  This is often a painful, short-lived effort.  Those logs contain a fair amount of garbage, including non-human bot traffic, making it challenging to filter out the relevant content.  In addition, the server logs won’t include all the guest data that you could collect using other methods.  There are more problems with this approach, which I won’t go into now, but suffice it to say that the road is hard and the chances of harvesting long-term business value from this approach are relatively slim.

Another solution is to use client-side tagging to send the customers browsing actions to a json store.   Some companies use customized javascript for this, resulting in additional coding effort and an additional set of tags to fire on page load.   There are numerous products available that will do this, many of them open-source by this point.  This can be a good approach if you’re ready to learn and integrate a new tool.

Most web platforms are already implementing a web analytics tools (Google Analytics, SiteCatalyst, Webtrends, CoreMetrics, etc…. many of which have been recently changing names and owners).  At the relatively low cost of adding JavaScript tags on the front end, you get a clean user-friendly (hopefully) analytics tool that gives you insights into the most commonly asked customer journey questions:  How many visitors, visit frequency, traffic source, geography, conversion funnels, etc.

But we’re not yet at the Big Data.  The web analytics tools are giving aggregated data but not giving the ability to ask specific customer journey questions.   For example, how many customers viewed a specific combination of products in a certain order?   What percentage of guests checked and then un-checked a certain product filter before a purchase?  There are countless such questions you might ask, and these are the questions that are inspired either by intuition or perhaps in attempt to diagnose a newly recognized problem.  Unless we know in advance exactly what questions we will be asking three months from now, we really need a Big Data solution in place for our online customer journey data.

Worried about privacy concerns?  After all, if we stored the online journey of each customer, this sounds invasive.  It’s actually not the personally identifiable data we need, but the insights into general preferences, execution patterns and intentions, so you can do most analysis just as when you simply don’t store the customers’ personal data.


The Quickest Path

A few years ago, while I was overseeing web analytics and big data for the global classifieds group at eBay, I started discussing with my contacts at Google the possibility of our accessing the raw data that we were sending them via our existing Google Analytics tags.  The tagging was already in place, and we were accessing the resulting web analytics data via a nice GA interface, but we wanted access to the Big Data behind that web analytics.  It was the data that we ourselves were generating but not yet storing in our own systems.  Over the course of several months, through a series of discussions and meetings at Google’s London office, we developed a product whereby Google would export the unsampled (anonymous) web traffic data into their BigQuery product.  From BigQuery my team at eBay could export the data to our own Hadoop cluster, and our analysts could also run SQL-like queries directly against the data in BigQuery. (Note: BQ now accepts ANSI-Standard SQL). Google soon opened this product to all of their GA premium clients and we (together with my colleague Duncan Manhattan) presented the new product at the annual Google I/O conference in San Francisco.

What we developed, and what Google continues to make available today with their premium GA product (now named ‘GA 360’), is an extremely low-threshold path to a working Big Data solution.

What is involved?

First, the site must be tagged with Google Analytics (GA).  GA is one of the most widely used web analytics tools in the market today, so, even if your platform is not already tagged with GA, there is a widespread knowledge base and pool of expertise on which to draw.  Use of standard GA is free of charge, and the upgrade to GA 360 (premium) does not require new tagging.

Second, the company will need to subscribe to GA 360.  This product has a number of perks, so you may want to upgrade to premium regardless of whether you take advantage of using it for Big Data applications

Lastly, a few clicks in the Google console will setup the export from GA to BigQuery.  With GA 360, you’ll typically get $500/month of billing credit on BigQuery, which is more than enough to get started with your Big Data analysis.  For reference, an analyst running occasional queries against a few hundred GB of data on BigQuery will probably incur several cents of charge per month, so, in my experience, the $500 more than covers the normal activity of a small team of analysts.

The Big Data that you now have available via BiqQuery can be accessed in several ways.  Analysts can run queries directly against BigQuery using a sql-like syntax.  Some common BI tools (such as Tableau) can also link directly to BigQuery.  Alternatively, you can download parts of the data and process locally (e.g. python has a package for this) or download the whole enchilada into your own Hadoop cluster.

In summary, the solution we’ve described provides a very low threshold entry into Big Data for online customer journey.  For web platforms with existing GA 360 subscriptions, it’s a matter of minutes.  For web platforms without GA, it’s a simple matter of adding additional GA tags (for which there is no lack of skills in the market) and subscribing to a standard Google service.   Either way, it’s probably the quickest and easiest way to get set up to start harvesting the value of Big Data.

In a future post, we’ll talk about how this setup can be further developed into a streaming data solution.  We recently implemented such a solution on a European vacation booking portal to provide real-time recommendations to online customers.


I’d love to hear peoples’ thoughts and experiences in the comments below.   Perhaps there is an even easier way that you’ve implemented?