Monthly Archives: January 2014

Netflix Teaches Us the Grammar of Big Data

On January 2nd, 2014, Alexis Madrigal of The Atlantic wrote one of the best pieces on Big Data that we will see this year: How Netflix Reverse Engineered Hollywood. You should read the nearly 5,000 word piece in its entirety if you have any interest in Big Data, retail, or marketing since there are many insights on how to apply Big Data. We’ll just wait here while you read the story.

OK, now that we’ve all read this great article, what does it mean?

There are a few key lessons that DataHive took from Netflix’ 76,897 genres to keep in mind for 2014.

1. The key to Big Data is not processing; it’s metadata. It’s easy to think of metadata as an evil word in light of the recent NSA surveillance announcements. For those who are relatively new to the concepts of Big Data and metadata, it is easy to make the quick assumption that metadata is a bad thing and to wonder how personalized metadata collection could be helpful. This Netflix example shows the other side of metadata and how it has been fundamental to providing us with the personalized recommendations that we are starting to expect from our consumer services. For companies to make these suggestions, they first require the appropriate metadata.

2. Big Data requires a grammar and human interaction to be parsable.

For Netflix to make sense of the truly Big Data associated with millions of movies, it was not enough to simply use the existing Big Data of two+ hours of video, cast and crew information, and other existing data that typically require three or four gigabytes per movie. Instead, Netflix needed to create a dictionary and grammar of relevant phrases, then to farm out the metadata tagging and contextualization to people who could analyze the video data at a high-level and human level.

This is the process that intrigues DataHive the most: how to translate Big Data into actual insight.

One of the understated advantage that Netflix has had through its electronic collation of movies is the ability to create microgenres based on basic traits and personal preferences. Many of us have seen a Netflix category like “Oscar winning thrillers starring Meryl Streep” or something along those lines, but the logic involved is quite interesting. The role that Todd Yellin, Netflix’ VP of Product, played in translating movies into categories was vital in differentiating Netflix.

There is a very logical structure for these genres that the article describes:

Region + Adjectives + Noun Genre + Based On… + Set In… + From the… + About… + For Age X to Y

This Big Data “grammar” ends up being the secret sauce that Netflix uses to create these categories from a few hundred basic descriptors. However, Netflix also had to actually assign descriptors to each movie, which was an important additional process to implement.

Netflix actually rates all of its movies from a 1 to 5 scale in a number of different areas based on a human’s perspective. An actual viewer creates this initial rating, which is suspect to personal bias and perspective no matter how many rules are provided to the viewer. This human interaction ends up being vital to the processing and analysis of Big Data.

(This Big Data perspective is not unique to movies. In baseball, the raw data associated with player performance had been available for decades, but it was not until a volunteer SABR (Society for American Baseball Research) army set upon the data that baseball became the flagbearer of analytical thought and Big Data usage that it is today.)

The categorization of our daily lives is often based on a few hundred characteristics analyzed through Big Data. It is not the complexity of taxonomy or ontology that ends up providing the greatest insight. No matter who the data scientist is, Big Data insight needs to be clear to actual end users or else all that brilliance is wasted.

3. Video is the next Big thing in Big Data. Big Data is a lazy phrase used to describe several different types of trends: the exponential growth of structured data, the increased need to curate and collate unstructured data, the exponential growth of data sources, and the challenges of gaining insight from this data ecosystem that is literally too large to comprehend as we get up to petabyte-scale Big Data.

However, a fundamental component of Big Data is that it represents new data tools and challenges that have previously not existed in the enterprise analytics and data management worlds. In that context, video is going to be the next great challenge in Big Data. Consider how challenging text is as a Big Data challenge. It is not enough to store every character and create specific keyword relationships, as challenging as that is. True text analysis needs to include the close reading and sentiment analysis challenges associated with literature and poetry.

Video takes this analysis to another level. First, instead of analyzing characters, video analytics would ideally analyze each pixel on a frame-by-frame basis. But it would also understand the language being used in context. And it would also understand visual cues and outliers, such as the algorithms used for video security. But it is safe to say that no vendor or company has currently reached this level of sophistication. Even a Big Data pioneer like Netflix finds that it needs a certain level of human interaction simply to estimate the level to which a movie is a “Thriller” or a “Drama” or a “Comedy.”

Because of this, video analytics are still in their infancy. DataHive has a gut feeling that the phrase “Big Data” will go away in the next couple of years as Hadoop, sentiment analysis, and cloud-based storage and analytics continue to become more commonly used. However, we will need to brush the phrase back off, or perhaps find a new name, as video analytics finally is ready for its day in the sun. As interesting as companies such as Ooyala, with their current video analytic capabilities, already are, they have just scratched the surface of insight that they will eventually provide to the world.

So, the big takeaways for DataHive are threefold:

1) Big Data requires metadata to be useful. If the right metadata isn’t already in place, create it from scratch. It’s better to create a metadata layer that makes sense than to simply spin your wheels with existing Big Data that may lack the context needed to get from inputs to insights. Whether we’re talking about movies, baseball, or other favorite pastimes, Big Data only makes a big impact when it is filtered correctly.

2) Make sure that your metadata outputs are easily comprehensible. Netflix’ genre titles are easy to understand, regardless of whether an 8-year-old or an 8th degree Big Data Master is reading the title. By intentionally simplifying the categorization of films, Netflix provides much greater context. Netflix could easily have automated a list that states “Jennifer Lawrence has 34 minutes and 12 seconds of screen time in the Hunger Games” or whatever the number actually is, but this measurement is trivial compared to the traits and categories that customers actually want. Ultimately, Big Data outputs need to be tailored to users regardless of the complexity or elegance of analysis.

3) Video is ready for a quantum leap in analytics and Big Data. No vendor is currently able to bring the combination of video identification, video viewing analytics, video sentiment, and video content analysis together, meaning that this Big Data problem will be around for the foreseeable future. However, big challenges are also big opportunities. DataHive Consulting looks forward to playing a part in bringing these disparate video functions together into an integrated video analytics solution.