One of the things I do as 1010data’s Chief Evangelist—every employee of every company should act as an evangelist in some way, shape, or form—is think about how to present and talk to the potential of “big data” in the context of 1010data’s analytic platform, the Trillion-Row Spreadsheet. Suffice it to say, a trillion rows gives a person a fair amount of room to play with in terms of critical and creative thinking. Developing compelling demonstrations for an advanced analytics platform that democratizes analysis has brought two of the observations and complaints about data science into focus for me, and has got me to thinking about what we might do about them.
The observation and complaint that keep coming to mind are both based on the reality that there aren’t enough skilled people working on “big data”:
There are not, nor will there be in the near-term, enough data scientists to satisfy market demand.
McKinsey & Company, in response to the question “Big data: The next frontier for innovation, competition, and productivity”, noted “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”
Data scientists spend an inordinate amount of time “cleaning” data to make it more analytically friendly.
Given the expense of and expectations for data scientists, cleaning data is seen as a low-value activity . . . even though it is absolutely necessary to run meaningful analytic operations and to glean interesting and useful analytic insights. I spend most of my time trying to structure data for maximum flexibility.
So how might we rectify this?
We start with the McKinsey & Company findings: the report talks about deep analytic talent, not data science. Re-framing the requirement creates opportunity: for as much as we might speak of data scientists in reverential tones, I am consistently surprised at how developers and data scientists speak of people who have mastered spreadsheets in equally reverential tones. If we see “deep analytic talent” as a spectrum, spreadsheet users (basic, intermediate, and advanced) likely represent the first couple tiers of analytic talent, overlapping with qualitative / domain expertise to varying degrees. Separating analysts into those that manipulate data from those that perform sophisticated analysis helps mitigate the daunting problem of having sufficient data analytic resources.
From there, we turn to the data (n.b., for the purposes of this piece, I am going to set aside data created by corporate systems). The World Wide Web and the Deep Web are great repositories of data. The problems are that the data is often presented for human consumption in an aesthetically pleasing (but analytically unfriendly) manner for the computer screen or is formatted with an implicit assumption that the content will be printed. As an example of the former, consider the “store locators” so common to many retail chains” as a consumer, I might care about a store in the context of geographic proximity; as an analyst, I likely care about all stores and a lot of the metadata (geospatial or otherwise) around them.
Speaking of data, there can be—at times—a belief that the potential value of data precludes sharing it. This always strikes me as odd given how much data is available to anyone with a browser, access to the World Wide Web, and the patience to clean interesting—albeit less-than-ideally formatted—data.
One could argue that the proliferation of APIs gives more people more access to more data . . . and that is likely to be true for developers; it might not be true for people with deep analytic talent that is expressed through mastery of a spreadsheet. The implication of this are likely to be more profound for those companies that have passionate customers fans: given the long tail of the Internet, there might be someone with deep analytic talent and passion-driven intrinsic motivation (ref. Daniel Pink’s Drive) that could use the data to identify and explore new opportunities.
I frequently think of pro-am collaboration in astronomy as a model that more professionals could benefit from considering and implementing.
Conceptually, here are three of my initial thoughts:
Companies should have more CSV or other industry-standard file types related to their business available for download—even basic information can be incredibly useful. For example, (structured) databases feed often sit behind Web pages. How much of that data could be simply presented as a file for download? There are undoubtedly discussions to be had about what data should be made available…but those discussions should start with the data that already is already available to anyone with a browser and connection to the Web, even if that data is currently in an analytically unfriendly form.
Differentiating data (i.e., creating multiple facets) is a good thing. The most common spreadsheet application, Microsoft Excel, gives users just over a million rows and, perhaps more importantly, over 16,000 columns of space. Complexity of data—by blending two or more data sets—makes it easy to push beyond what might have once seemed like a generous amount of space or a reasonable amount of computational power. For as much as one might think about “their” data, single sources are rarely wholly satisfying, the question people should ask is, “How might we differentiate our data to enable more innovative analyses?” This is where platforms like the Trillion Row Spreadsheet shine.
Lastly, create a source description for the data. This makes it easy for the data creator to find the results of outside analysts’ work (which might help have the added benefit of enabling recruiting or spurring new internal analytic efforts). In all likelihood, the data also will need to be covered by some form of protection like a licensing agreement. If the goal is to empower others to do analytic work, though, source descriptions need to be clear and the licensing agreements permissive (e.g., a Creative Commons licensing agreement).
While the current narrative seems to be that it is the data that is valuable, there are massive amounts of data that are already freely available to the casual user and unused by the majority of users. I suspect that this is in part because of how painful it is to make the data analytically friendly. The question we should ask is, “How might we package and present publicly available data to more effectively leverage deep analytic talent that exists outside the organization?” If the data is formatted in ways that are use-case agnostic, then data analysis becomes much less daunting.