A Tale of Two Domains – Lessons from Strata-Hadoop

A couple weeks ago, we spent some time at Strata-Hadoop NYC. It was hyped as the Lollapalooza of Big Data, though I suspect whoever made that claim has not spent a lot of time in Chicago during the summer. Nonetheless, Strata-Hadoop is one of the biggest, most exciting conferences in the industry. It’s packed with new technologies, a vibrant expo hall, brilliant speakers from both industry and academia, and the ideas that will shape Big Data for years to come. Now that we’ve had a couple weeks to think through what we’ve seen, follow up on leads, and scan through some of the presentations we missed, it’s time to discuss what we learned at Strata.


Data-City.jpg1. The Open-source ETL Space is Maturing

One of the key arguments for avoiding the open-source space has always been how turbulent it is. For exceptionally agile companies, this can be a good thing – there’s always a new and better option emerging. For most companies, this is a danger zone.  They’ve seen the rise of MapReduce and the subsequent technologies built on top of it. Many of them have built impressive data architectures around these tools, and they’ve seen those architectures lose ground to new technologies employed by competitors.

Now it seems we’re in the age of Spark. According to surveys from Databricks, more than 1000 organizations are using Spark in production, which is probably a low-end estimate given that Spark is an open-source product and everyone hates answering surveys. Its uses range from business intelligence, to recommendation engines, to processing IoT streaming data; and it does this with speeds that leave MapReduce-based technologies in the dust.

So, are you safe on Spark? Probably not, and that’s one of the reasons why many companies opt to go with products from Informatica or Talend over open-source. Enterprise vendors provide stable, reliable products that will ultimately be upgraded to use the latest technologies with minimal technical debt involved. For many of these vendors, implementations from a decade ago run better than ever today. That’s not something you’ve been able to count on with custom implementations of the Apache projects of the month. 

Enter Apache Beam.

Apache Beam is the open-source world’s response to the abstraction argument. Beam is young – it was donated to the Apache Foundation in early 2016 - but its aims are ambitious. Beam looks to provide a unifying API for both batch and streaming jobs using any number of back-ends. It’s trying to limit the Big Data space’s nasty habit of coming out with a better platform as soon as you finish implementing the last one. Maybe right now, you use Flink as the backend. Once the Spark runner is more developed, you might switch over to using that. Ten years from now you might be using Apache Bearded Seal as your backend. The idea is that your front-end code won’t require major changes to deal with whatever comes down the pipeline.

While still early in the incubation process, Beam isn’t a typical early stage Apache project. Beam was originally developed by Google as the programming model for their Cloud Dataflow product, which has been in production usage at Google for a few years now. The number of back-ends available is still fairly small – Flink is the strongest backend apart from Google’s hosted Cloud Dataflow solution. The Spark Runner is useable but still under development. The Beam development team has said that there’s been interest in Storm, Hadoop, and GearPump runners as well.

However, I wouldn’t start tolling the bell for enterprise ETL products quite yet. There are a lot of questions yet to be answered for a project like Beam. One of the larger questions is what happens when a new runner is fundamentally incompatible with the Beam model? Will we start seeing a large number of runners that all support a limited subset of the Beam API or will we see the Beam API expanding to allow for different implementations of similar features?  If developers have to monitor a host of different potential runners and their limitations in order to safely upgrade the backend, have we really reached the point where open-source offers the same stability as the big ETL vendors?

It will be interesting to see how these projects continue to evolve. Nonetheless, it’s clear that the open-source big data ecosystem isn’t quite the wild west it used to be.


2. Commercial Data Science Tools and Responsible Data ScienceData World.jpg

Data Science is always an exciting topic at these conferences. While Data Science and Big Data are not strictly the same field, Big Data has allowed Data Science to address problems that were out of reach up until very recently so you tend to find a significant amount of overlap between practitioners. While the O’Reilly AI was the place to be on Monday and Tuesday for news on cutting edge research, Strata-Hadoop is the place where you can see how innovators are putting that research into action.

There were a ton of Data Science products scattered throughout the conference. Many of these were emerging products just starting to build steam. Others came from older powerhouses pushing to secure their footprint in the increasingly open-source driven world. Nearly all of these products fundamentally fit into one of two categories:

  • Data Scientist Assisters
  • Data Scientist Replacers

The assisters are the productivity suites like SAS, Turi, Microsoft’s Cortana, IBM’s DataWorks, or RapidMiner that fundamentally exist to make your Data Scientists’ jobs easier. Some of them have data preparation components. Others rely on data wrangling earlier in the data lifecycle. These tools allow for easy, flow-based development of maintainable data science pipelines. Most of them make it easy to deploy trained models to production. It’s a developing space, but there are a lot of good products out there that can make your Data Science team more effective.

It’s the second group that worries me. I’m not saying this as a curmudgeonly old data scientist whose worried about robots taking my job – I’ve long since come to terms with the fact that most common data science tasks will be automated in the next ten years. I’m saying this as an advocate for responsible Data Science who doesn’t see evidence that the technology is ready to support that automation quite yet.

As punishment for my aggressive accumulation of t-shirts and pens in the expo hall, I’ve been getting a lot of marketing emails. One in particular exemplified this problem to me. “Are you interested in predictive analytics, but the statistics are holding you back?” it asked. As both a climber and a data enthusiast, that sounded a lot to me like “Are you interested in climbing mountains, but tying knots is holding you back?”

By this point, virtually everyone in the industry has seen Drew Conway’s Data Science Venn Diagram. As much as we hear about the democratization of data science, the simple fact is that data science is expensive, and part of the reason for this is that data scientists are expensive people. Finding someone with domain knowledge, high level technical skills, and powerful statistical expertise can be very difficult, and these people are in high demand.

This is where the data scientist replacement tools we saw at Strata enter the picture. These products come in and say, “We’ll do the machine learning for you. All you need to do is provide us with data and we’ll output a model that you can use for predictive insights.” With the machine learning covered, now you only need the substantive expertise and you have Data Science! That sounds too good to be true. Of course, the reason it sounds too good to be true is because it is. There are two very real problems with that approach.

The first is strictly a business problem. What value does this type of product really bring to a business? In theory, these products are trying to reduce your dependency on full-fledged Data Scientists by eliminating hacking skills from Conway’s diagram. The question is: Are non-technical statisticians who know your business area and can competently present to an executive team significantly less expensive than those same statisticians with developer skills?

Beyond that, is a web-based service that trains machine learning models really even eliminating the hacking skills component of the Data Science skill set? How are these teams of non-programming statisticians acquiring and prepping their data? You could have a separate team of data engineers who not only manage your production environment but are also on the beck and call of your team of non-technical statisticians, but that sounds like a recipe for gridlock.  On the other hand, there are some very powerful data blending and preparation tools on the market like Alteryx or Paxata that allow non-technical users to easily create machine learning ready data sets. Yet a full-fledged data scientist can derive just as much value from these types of tools without needing a machine learning service platform to move to the next step.

In reality, most of these tools are providing a service that data scientists don’t need. The emphasis is on making it easy to run machine learning models, but running machine learning models has never been the hard part of data science. If all a product is going to do is provide a cloud-based hyperparameter search across a predefined model space, you’re not going to find many data scientists who can’t code up a similar process fairly quickly whether they’re using one of the aforementioned productivity suites or hand-coding in one of the many open-source machine learning libraries.

The second problem is what arises from the belief that Data Science is nothing but applying machine learning algorithms to data. It’s a topic that’s been gaining a lot of steam over the past few years as machine learning models have infiltrated our daily lives. Brett Goldstein from the University of Chicago gave an excellent presentation on the subject at Strata. Dr. Cathy O’Neil, a Harvard educated Wall Street quant, writes about it in her book Weapons of Math Destruction. What I’m referring to is the risk of designing machine learning models that incorporate implicit discrimination.


3. The Big Data Industry and Gender

One of the major themes of this year’s Strata-Hadoop NYC was that our industry has some serious genderData-Gender.jpg inequality issues. The majority of developers, engineers, and data scientists from entry-level to senior management are men. It’s a well acknowledged problem, but it’s also one that’s quite difficult to solve because the causes aren’t readily apparent. There are scholarships, company outreach programs, and organizations specifically targeted at bringing more women into tech. At Strata, registration fees raised money for Women in Machine Learning and Data Science. Yet the imbalance persists - one of the keynote speakers, Susan Woodward from Sand Hill Econometrics, has studied start-ups for over a decade. In her presentation, she revealed some of her findings on the financial outcomes of female-led start-ups. Her data shows that even controlling for confounding factors like product and industry – female led start-ups still tend to have exit multiples around half that of their male-led counterparts. That was pretty sobering to hear.

Given this information, let’s picture a case where we’re a quickly growing company and we want to address problems with gender bias by eliminating humans from the hiring process. Studies have shown that hiring managers tend to select people similar to themselves (Rivera), so we’re going to replace human screeners with a machine learning model. We take our existing employee data and the data we’ve collected on failed applicants and we feed it into a machine learning model. To be safe, we won’t even give the model any demographic information. We do some validation find that our model has excellent performance on our hold-out data set, then move it to production.

Six months later, while preparing a press release to announce the success of our innovative new hiring process, our BI team gathers some descriptive data on the impact of the new model. They find virtually no change in our hiring. Digging a bit deeper, they find that some strange features are being weighted highly by the model. Factors like height, name length, vowel-frequency in names, and preferred room temperature have oddly high importance to the model. A strong Data Science team likely would have seen these results coming. A black box Data Scientist replacement product will not.

Why did this happen in our fictional scenario? Despite what the hype machine promises, machine learning is not the type of AI we all grew up reading about in science fiction nor is it a tool that crunches numbers until it has inferred the root contributors of any predictive target variable. These are some goals of machine learning researchers, but they’re not tools that are available today. For all the talk about AI as a disruptor, the truth is that machine learning models are inherently among the most conservative tools available.  

What most of the supervised learning models we tend to see in use actually do is fit some type of structure – whether it be a function or a series of trees to a given data set. This function is ultimately a description of the data as it is today. Most common tree-based regression models will not be able to extrapolate values outside of the range of the training data. Classifiers cannot classify objects that are not contained in the training set. Inferring causality is the realm of controlled experiments – observational data rarely gives us that luxury. A model predicting shark attacks won’t be able to discount the influence of ice cream sales. If there is a high level of mutual information between the distribution of ice cream sales and the distribution of shark attacks, then there’s a good chance the model will take advantage of that information.

Taking theoretical knowledge out of Data Science exposes teams to risk and fails to live up to the tenets of responsible Data Science. Many of these ML-as-a-Service products allow companies to ignore theory. However, what they should be doing is making theory more accessible by implementing features that make models easier to understand and expose some of the workings of the algorithms to the end users so they can make informed decisions on the implications of using these algorithms in production systems beyond just a cross-validation score.

One of the most interesting presentations I attended at Strata was by Prof. Carlos Guestrin from the University of Washington. Prof. Guestrin spoke about LIME (Local Interpretable Model-agnostic Explanations), a technique created by a student of his in order to address the issue of increasing model complexity (Ribiero 2016). The idea behind LIME is that the global structure of modern models is too complicated for anyone to understand. Deep neural networks often have hundreds of thousands, or millions of parameters. Even for experts, this type of model becomes a black box. However, the local structure often approximates simpler models. You don’t need to be an expert to understand the coefficients of a linear regression or a single decision tree.

LIME takes regions with relatively high error rates, models a simple local approximation, and presents the key features of the approximation to the end user. To test the effectiveness of these explanations, the research team gave non-expert subjects a classification task and a pair of models to choose from and asked them to determine which model would perform better on the given task. Using the output of LIME, 89% of their test subjects were able to determine the better classifier. In a similar test, they found that their subjects were often able to improve the result with manual tuning despite having no machine learning background. It’s an exciting result, and I’d encourage anyone interested in model interpretability to read either the paper itself or Carlos Guestrin’s summary on the O’Reilly blog.

Ultimately, what the current spate of ML-as-a-service products really do is hide complexity. Hiding complexity is not the same as eliminating it. If these products really want to reduce the need for Data Scientists, they need to ensure that they’re accounting for all of the responsibilities of the Data Scientist – not just training machine learning models. Whether it’s through LIME or using linear probes (Bengio 2016) or some other technique that has yet to be developed, incorporating modern model interpretation techniques would be a big step to a more productive and more responsible Data Science toolkit.  


Michael McCabe, Senior Consultant




Linear probes -

Rivera -

Guestrin -