The Inexplicable Fragility of Analytic Systems

One of the themes that I periodically encounter is a concern that the user, if empowered to add data to the system(s) they use or engage in non-standard / ad hoc analyses, will somehow “break” the system.

To hear this gives me pause because it suggests that the analytic systems being used are brittle in ways that probably should be a concern for the organization . . . or that the organization has implicitly decided to limit opportunities for innovative analyses.

These are deliberate provocations, of course, but worth exploring as there is a pervasive drive to use and derive value from larger volumes of more diverse data—a drive that likely means pushing more of the power and authority that has historically been the purview of system engineers and database administrators to users.

As an analyst, I never worried about “breaking” the system I used: the question that I was trying to answer was always more important. If anything, the limits of the system would result in me setting an interesting line of inquiry to the side.

When I hear people talking about “breaking” something, their concerns seem to coalesce around two issues:

Data Integrity. The concern is that is analysts—in adding the data they suspect might give them new insights—might somehow corrupt the integrity of corporate data holdings.

A second, less-common formulation of the challenge is that the analyst might create a data breach by adding information in ways that weakens corporate access controls.

I understand the overarching concern but, in light of the diversity of data available to anyone connected to the Web, analysts will want to experiment with and add complementary or contextual data to company data holdings should come as no surprise. At the National Retail Federation’s Big Show 2016, one of 1010data’s customers explained that of the 21,000 tables they had on our analytics platform, 19,000 were user generated. I don’t think our customer’s experience is an outlier; if anything, the proliferation of user-created tables is likely to be the norm if analysts are unleashed.

The question for engineers and developers is how might they create the conditions for user-driven analytic innovation? Permissioning (i.e., access control) and integration are key.

To explain 1010data’s approach, I ventured down the hall and talked with Owen Simpson, one of 1010data’s Senior Directors for Analytics and Product Development. “The key is differentiating between ‘read’ access to production tables and reserving ‘write’ access to only authorized systems and users. This affords analysts the flexibility they have come to expect in our analytics platform, while limiting the potential impact they might have on corporate data or other users.”

“Additionally, with 1010data, each user has their own accumulator,” adds Jon Katzur, 1010data’s Vice President for Sales Engineering. “This means that each session has its own thread manager. Users who do complicated queries don’t harm other users because there is no central queue. Unlike some other systems, you don’t need to wait for a central thread manager to get around to starting your query.”

The second issue is Analytic Methodology. The concern here is that as analysts are empowered to add data to the system, they might “break” the system by incorrectly interpreting the results of their analyses (ref. Tyler Vigen’s brilliant “Spurious Correlations”) or by failing to properly caveat their methodology (e.g., the confidence they have in the data, the strengths and weaknesses of the methodology in light of the data, etc.). This is a real issue . . . and is a matter of training and quality control. Companies like Skytree have put a lot of thought into explaining the meaning of a methodology and my former employer long grappled with the language of “estimative probability.” Is either approach wholly satisfactory? I suspect not . . . but they are illustrative of approaches an organization might take to make sure that their customers understand the meaning of and around the insights that they are trying to convey through their branded analyses.

The ongoing evolution of analysis in the face of massive amounts of diverse data necessitates that control be decentralized and pushed to the user. This represents a shift in how one designs, implements, operates, maintains, or shops for and evaluates corporate systems . . . but it also represents an opportunity for companies to empower their users to find new opportunities, or warn of un- or under-appreciated challenges. The system needs to be flexible enough to accommodate a user’s curiosity, not so rigid that it constrains it.