7 2008
Around 2008 a decisive shift in the meanings of “data science” and “data scientist” took place. After nearly a half century of development, in which two broad and consistent usages had emerged—the classical and the statistical—the term was applied in a context that launched it into the public eye and increased its circulation by orders of magnitude.1 This context was the social media corporation of the Web 2.0 era, itself the inheritor of two crowning achievements of data processing engineering—the database and the Internet—catalyzed by Berners-Lee’s invention of a global hypertext system.
7.1 Hammerbacher
According to what is perhaps the most popular article on the topic, Harvard Business Review’s “Data Scientist: The Sexiest Job of the 21st Century,” the term “data scientist” was “coined in 2008 by ... D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook” (Davenport and Patil 2012).2 Although on its face this claim is obviously false, it appears to be correct in having identified the first usages of the term in this new context. Supporting this claim, in 2009 Hammerbacher published an essay recounting his experience as a data scientist at Facebook entitled “Information Platforms and the Rise of the Data Scientist” (Hammerbacher 2009), where he explained the motivation for adopting the term:
At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist” (84).
Although the work in this description appears evenly divided among engineering and statistical tasks, Hammerbacher’s narrative actually focused entirely on efforts to manage the data impedance problem that the social media company (and others like it) faced at the time he was hired in 2006, just after they opened their doors to others than college students. It is a story of how a small group within the company moved away from a MySQL data warehouse, which literally ceased to function under the volume of the company’s data, and eventually to a new platform based on Hadoop and Hive, in order to perform the standard tasks of extracting, transforming, and loading data (ETL) to build an information retrieval platform for analysts in the company. In addition to their massive scale, these data were also textual and social in nature, putting them in the same category as the complex data types the Air Force faced years ago.
Significantly, Hammerbacher emphasized Facebook’s adoption of Google’s machine learning based approach, captured in the phrase “the unreasonable effectiveness of data” and explained in an essay with that title (Halevy, Norvig, and Pereira 2009). In this approach, the size of data is considered more important than the sophistication of models—“invariably, simple models and a lot of data trump more elaborate models based on less data” (9; also quoted by Hammerbacher).3 In essence, the new role of data scientist at Facebook was close to that of the data scientist for the British Antarctic Survey, i.e. the classical, not the statistical definition, with the added focus on machine learning. Indeed, Hammerbacher acknowledged the provenance of the term:
Outside of industry, I’ve found that grad students in many scientific domains are playing the role of the Data Scientist. One of our hires for the Facebook Data team came from a bioinformatics lab where he was building data pipelines and performing offline data analysis of a similar kind. The well-known Large Hadron Collider at CERN generates reams of data that are collected and pored over by graduate students looking for breakthroughs (Hammerbacher 2009: 84).
It seems likely that the phrase data scientist, as it was understood then by the scientific community (and documented above), was transferred to this new domain by Hammerbacher’s contact with some of its members.
7.2 2009: Yau Interprets Varian
At around the same time that Hammerbacher and Patel are said to have coined the phrase data scientist, in October 2008 Hal Varian, chief economist at Google, gave an interview to McKinsey’s James Manyika in which he uttered the famous phrase, “I keep saying the sexy job in the next ten years will be statisticians [sic]” (McKinsey & Company 2009). The interview was published on new year’s day of 2009 and was immediately processed by the blogosphere. Nathan Yau, who has a Ph.D. in statistics from UCLA and runs the blog FlowingData, devoted to data visualization, was quick to qualify Varian’s use of the term “statisticians”:
… if you went on to read the rest of Varian’s interview, you’d know that by statisticians, he actually meant it as a general title for someone who is able to extract information from large datasets and then present something of use to non-data experts (Yau 2009).
Here’s what Varian actually said:
I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves [McKinsey & Company (2009); emphases added].
In a follow-up post, “Rise of the Data Scientist,” Yau expands on his revision to Varian’s remarks by incorporating the comments of fellow blogger, Michael Driscoll of Dataspora, who also responded to the McKinsey piece in a post entitled “The Three Sexy Skills of Data Geeks.” Echoing Yau, Driscoll wrote:
I believe that the folks to whom Hal Varian is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics, data munging, and data visualization (Driscoll 2009).
Yau connects Driscoll’s insight to Ben Fry’s fertile concept of “computational information design” (Fry 2004), which maps the fields of computer science, mathematics, statistics, data mining, graphic design, and human-computer interaction onto a data processing pipeline—acquire, parse, filter, mine, represent, refine, and interact. Whereas Driscoll called the role that integrates these fields “statisticians or data geeks,” tellingly equivocating, Yau used the term data scientist:
And after two years of highlighting visualization on FlowingData, it seems collaborations between the fields are growing more common, but more importantly, computational information design edges closer to reality. We’re seeing data scientists—people who can do it all—emerge from the rest of the pack.
. . .
They have a combination of skills that not just makes independent work easier and quicker; it makes collaboration more exciting and opens up possibilities in what can be done. Oftentimes, visualization projects are disjoint processes and involve a lot of waiting. Maybe a statistician is waiting for data from a computer scientist; or a graphic designer is waiting for results from an analyst; or an HCI specialist is waiting for layouts from a graphic designer.
Let’s say you have several data scientists working together though. There’s going to be less waiting and the communication gaps between the fields are tightened.
How often have we seen a visualization tool that held an excellent concept and looked great on paper but lacked the touch of HCI, which made it hard to use and in turn no one gave it a chance? How many important (and interesting) analyses have we missed because certain ideas could not be communicated clearly? The data scientist can solve your troubles (Yau 2009: emphasis added).
Yau’s definition of data scientist is consistent with that given in the HBR article written four years later. There the authors described data scientists as those who “make discoveries while swimming in [the deluge of] data,” “bring structure to large quantities of formless data and make analysis possible,” “identify rich data sources, join them with other, potentially incomplete data sources, and cleaning the resulting set,” “help decision makers shift from ad hoc analysis to an ongoing conversation with data,” “are creative in displaying information visually and making the patterns they find clear and compelling,” “advise executives and product managers on the implications of the data for products, processes, and decisions,’ and so on. Replace the roles of decision maker, executive, and product manager with scientist and engineer, and the definition is remarkably consistent with the classical definition as it had been developed in the years leading up to this shift.
Yau and Driscoll’s response to Varian are notable because they demonstrate how new the terms data science and data scientist were to the general public at the time, and the manner in which these terms transitioned from a narrow community of discourse to a much larger one. Varian and Driscoll used the term statistician, but both had to qualify them significantly. That “data scientist” was not yet mainstream in 2009 can be seen in a New York Times article also written in response to the McKinsey interview, entitled “For Today’s Graduate, Just One Word: Statistics” (Lohr 2009). Here, the author was compelled to use the expression “Internet-age statistician” to name the role described by Varian, and to qualify this usage in a manner similar to Varian:
Though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis. Computing and numerical skills, experts say, matter far more than degrees. So the new data sleuths come from backgrounds like economics, computer science and mathematics.
We may note here an important difference between how the two bloggers, closer to the reality being described, and the established-media journalist represented the new development heralded by Varian: whereas Lohr saw an expanded division of labor, Yau and Driscoll envisioned an entirely new role, something akin to the “whole statistician” described above, although combining different elements. In any case, we see that it is at this precise point that the term “data scientist” begins to be used in the new context later described by the HBR.
7.3 2010: O’Reilly Chimes In
In September 2010, two short but highly influential blog posts appeared that sought to codify this conception of data science, which had by then reached the status of buzz word among participants of the technology conference circuit. The first was Drew Conway’s “The Data Science Venn Diagram,” which defined the field as the intersection of three areas: “hacking skills, math and stats knowledge, and substantive expertise” (Conway 2010). Conway’s post followed his attending an “unconference to help O’Reily [sic] organize their upcoming Strata conference,” where he detected “the utter lack of agreement on what a curriculum on this subject would look like.” His rationale for the three areas was the following:
… we spent a lot of time talking about “where” a course on data science might exist at a university. The conversation was largely rhetorical, as everyone was well aware of the inherent interdisciplinary nature of the [sic] these skills; but then, why have I highlighted these three? First, none is discipline specific, but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.
Of interest here is, first, the need to define a curriculum for what was perceived to be a new field, which echoed previous efforts and presaged the academic response that would eventually follow, and second, that Conway’s was essentially the classical definition applied to the context of what we might call information capitalism, the target audience of O’Reilly’s conference. Although the role of statistics is emphasized, Conway reduces its importance to “knowing what an ordinary least squares regression is” and goes on to assert that “data plus math and statistics only gets you machine learning.” In other words, Conway’s definition is closer to Breiman’s culture of algorithmic modeling than it is to that of data modeling. This is corroborated by the fact that by data Conway meant “a commodity traded electronically,” i.e. that which is found in databases and shared over networks, as opposed to that which is collected intentionally by designed experiments (A/B testing notwithstanding).
The second post was Mason and Wiggins’ “A Taxonomy of Data Science,” which was motivated by the need to make sense of the newly circulated term:
Both within the academy and within tech startups, we’ve been hearing some similar questions lately: Where can I find a good data scientist? What do I need to learn to become a data scientist? Or more succinctly: What is data science? (Mason and Wiggins 2010)
In contrast to Conway’s structural model, Mason and Wiggins proposed a processual one, based on “what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret (or, if you like, OSEMN, which rhymes with possum).” In this model, most of the activities normally associated with the classical definition of data science—as listed in the HBR piece, for example—find a place. A distinguishing feature of this definition is the modeling phase, which they characterized as follows:
Whether in the natural sciences, in engineering, or in data-rich startups, often the ‘best’ model is the most predictive model. E.g., is it ‘better’ to fit one’s data to a straight line or a fifth-order polynomial? Should one combine a weighted sum of 10 rules or 10,000? One way of framing such questions of model selection is to remember why we build models in the first place: to predict and to interpret. While the latter is difficult to quantify, the former can be framed not only quantitatively but empirically. That is, armed with a corpus of data, one can leave out a fraction of the data (the “validation” data or “test set”), learn/optimize a model using the remaining data (the “learning” data or “training set”) by minimizing a chosen loss function (e.g., squared loss, hinge loss, or exponential loss), and evaluate this or another loss function on the validation data. Comparing the value of this loss function for models of differing complexity yields the model complexity which minimizes generalization error. The above process is sometimes called “empirical estimation of generalization error” but typically goes by its nickname: “cross validation.” Validation does not necessarily mean the model is “right.” As Box warned us, “all models are wrong, but some are useful”. Here, we are choosing from among a set of allowed models (the `hypothesis space’, e.g., the set of 3rd, 4th, and 5th order polynomials) which model complexity maximizes predictive power and is thus the least bad among our choices.
Clearly, the authors sided with algorithmic modeling here; they argued for prediction over interpretation by citing methods that have become commonplace in the field. We also find here, for the first time, a clearly articulated pipeline of activity, echoing the partial sequences that appear in previous definitions. Again, it is worth noting what data meant in this context: data are to be obtained from preexisting sources, sometimes by scraping, and not produced. The skills required are far from those of the design-oriented data scientist of the Tokyo school:
Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly from multiple sources, and possibly from sites which require specific query syntax. At a minimum, a data scientist should know how to do this from the command line. e.g., in a UN*X environment. Shell scripting does suffice for many tasks, but we recommend learning a programming or scripting language which can support automating the retrieval of data and add the ability to make calls asynchronously and manage the resulting data. Python is a current favorite at time of writing.
Note that the idea of a data analyst looking for “usable data” as a first resort is anathema to that approach, at least in principle.
In 2011, O’Reilly, whose role in the promotion of data science is worth its own investigation, produced a series of influential blog posts and reports that sought to codify and amplify the definition developed by Hammerbacher, Yau, Conway, Mason, and Wiggins. The definition produced was consistent with the classical version, but strongly inflected by the new business context. For example, Loukides’ in “What is Data Science?” described the field in terms consistent with what we have seen, focusing on scale, new database technologies, and machine learning in the pattern set by Google. However, in this discourse these elements are combined in the new concept of the “data product,” a good or service that integrates surplus data to provide value to users:
The web is full of “data-driven apps.” Almost any e-commerce application is a data-driven application. There’s a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn’t really what we mean by “data science.” A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products [Loukides (2011); emphasis added].
This emphasis on data products was echoed in Patil’s essay, “Building data science teams,” where the focus on data applications became essential to his definition of data scientist. Here he addressed the question, “What makes a data scientist?”:
When Jeff Hammerbacher and I talked about our data science teams, we realized that as our organizations grew, we both had to figure out what to call the people on our teams. “Business analyst” seemed too limiting. “Data analyst” was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. “Research scientist” was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new [Patil (2011); emphasis added].
The focus on data products at this point in history may be understood in light of Zuboff’s thesis that Google invented the business model of “surveillance capitalism” around 2003, based on the extraction of “behavioral surplus,” which was then exported to Facebook by Cheryl Sandberg in 2008 and became widespread after that (Zuboff 2019). Zuboff makes sense of the fact that in the O’Reilly papers Google’s services were frequently presented as exemplary data products, as well as the fact that the role of data scientist emerged at Facebook during the year of Sandberg’s arrival there. It also sheds light on the nature of the “immediate and massive impact” of data products described by Patel: the prototypical data product is Google’s advertising auctioning platform, which, as a result of applying its massive amounts of behavioral data to predict user behavior, “produced a stunning 3,590 percent increase in revenue in less than four years” (Zuboff 2019: Ch. 3, Part VI). More generally, Zuboff sheds light on the practical context within which this new iteration of data science emerged: in the heart of the system of computer-mediated economic transactions described by Varian (Varian 2010). In the previous period, when data science was imagined to be located as the center of the infrastructure of data-driven science (as in the NSF report cited above), this setting is transferred to domain of global, Internet-mediated commerce. Thus, just as the phrase “data scientist” leapt from one context to another at this time, so did the infrastructural framework within which it made sense. Again, the meaning of data science remains relatively unchanged from the classical definition; what changes is the context.
By 2012, the terms data science and data scientist had trended widely in the media, in part due to the amplifying effects of the HBR article, which employed Varian’s catchy quip but adopted Yau’s characterization of its referent. Articles in sources such as the New York Times and Forbes regularly published stories on the demand for data scientists, profiles of data scientists and data-driven companies, and opinion pieces on its merits. In 2012, Forbes published a series of eight articles on “What is a Data Scientist,” which featured interviews with self-identified data scientists from IBM, Tableau, LinkedIn, Amazon, and other places, demonstrating the currency of the term in industry at that time. In addition to media coverage, consulting firms such as Booz Allen and Gartner produced documents targeting at C-suite executives providing an overview of the field, including the definition of a data scientist (Herman et al. 2013; Laney 2012). Throughout these writings, the definitions provided were consistent with the newer meaning, roughly the combination of computer competency, data mining, statistical knowledge, communication and visualization skills, and business acumen. In addition, the terms big data and data science were highly correlated; sentences like “[d]ata scientists are the magicians of the Big Data era” were frequent (Miller 2013).
7.4 2011: Data Science as Big Data Science
An important feature of the definition of data science in this period is its co-occurrence and close semantic association with the often-capitalized phrase “big data.” The phrase was used to refer to both large amounts of data—retroactively identified with Laney’s concept of “3D data,” data with high “volume, velocity, and variety” (Laney 2001)—and to the assemblage of technologies and methods associated with these data. The following definition from ZDNet is typical:
“Big Data” is a catch phrase that has been bubbling up from the high performance computing niche of the IT market. Increasingly suppliers of processing virtualization and storage virtualization software have begun to flog “Big Data” in their presentations. What, exactly, does this phrase mean?
. . .
In simplest terms, the phrase refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities. Does this mean terabytes, petabytes or even larger collections of data? The answer offered by these suppliers is “yes” (Kusnetzky 2010).
Indeed, “big data” and “data science” were often used interchangeably, the former as a metonym for the latter. For example, consider this usage from a New York Times piece:
The field known as “big data” offers a contemporary case study. The catchphrase stands for the modern abundance of digital data from many sources—the web, sensors, smartphones and corporate databases—that can be mined with clever software for discoveries and insights. Its promise is smarter, data-driven decision-making in every field. That is why data scientist is the economy’s hot new job [Lohr (2014); emphasis added].
Although in use since the 1990s, the term big data was launched into the public sphere (i.e. became viral) at nearly the same time as the terms data science and data scientist: around 2008, when the British weekly scientific journal Nature published a special issue entitled “Big Data: Science in the Petabyte Era” on the tenth anniversary of Google’s incorporation (“Community Cleverness Required” 2008). By this time, Google’s enormous success as a company founded on data mining had caught the world’s attention, including that of the scientific community, to the point where the company had become something of a paradigm for science. The issue was devoted to exploring how science ought to manage and exploit big data by following Google’s lead through various data processing methods, from data mining to visualization to library science. This connection between Google and science was also made by WIRED’s Chris Anderson at this time, in an issue also devoted to “the Petabyte Age,” who argued:
Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn’t just more. More is different (Anderson 2008).
Anderson argued that Google’s successful application of model-free algorithms, as in its wildly successful ad auctioning system, showed that the scientific method was obsolete; or, more accurately, that science might “learn from Google” and by-pass the concern for theory building and focus on prediction. The parallel to Breiman’s characterization of the algorithmic modeling culture is clear here.
The rise of the term big data is indicative of an important shift in how the problem of data impedance was conceptualized. Since at least the 1960s, when the trope “data deluge” was invented or adopted by NASA, the problem of data surplus was always framed as a kind of disaster, as is evident from the image of a flood, and the semantically close and more popular phrase “information explosion,” implicitly likened to nuclear weapons by the frequent use of the image of the mushroom cloud to signify exponential growth. With the success of the data-driven corporation on the model of Google, these negative terms began to be replaced by the more positive, or at least neutral, expression big data.
This transition is evident in the simultaneous publication of the Nature and WIRED issues on the topic (cited above)—the former introduced the new term while the latter used the old, and both are linked by the metonym of the “petabyte” era or age. Since then, the term big data has been used to signify the context and opportunity within which the data science operates. For example, Patel and Davenport’s 2012 article defined a data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.”
This connection became a commonplace. In 2011, the US government chartered the Big Data Senior Steering Group, which resulted in the NSF’s “Big Data R&D Initiative” (Wactlar 2012). This initiative explicitly connected big data, defined in terms of a paradigm shift from hypothesis-driven to data-driven discovery, and “Big Data Science.” It focused on the application of big data to scientific discovery, environmental and biomedical research, education, and national security. The NSF also incentivized research universities to develop interdisciplinary graduate programs to prepare the next generation of data scientists and engineers. It is roughly at this point that many universities in the US begin to offer masters degrees in data science. In 2013, Communications of the ACM published “Data Science and Prediction,” which also directly linked big data to data science, while providing some flesh to the former:
… data science is different from statistics and other existing disciplines in several important ways. To start, the raw material, the “data” part of data science, is increasingly heterogeneous and unstructured text, images, video often emanating from networks with complex relationships between their entities. ... Analysis, including the combination of the two types of data, requires integration, interpretation, and sense making that is increasingly derived through tools from computer science, linguistics, econometrics, sociology, and other disciplines. The proliferation of markup languages and tags is designed to let computers interpret data automatically, making them active agents in the process of decision making. Unlike early markup languages (such as HTML) that emphasized the display of information for human consumption, most data generated by humans and computers today is for consumption by computers; that is, computers increasingly do background work for each other and make decisions automatically. This scalability in decision making has become possible because of big data that serves as the raw material for the creation of new knowledge; Watson, IBM’s “Jeopardy!” champion, is a prime illustration of an emerging machine intelligence fueled by data and state-of-the-art analytics (Dhar 2013: 64).
Here, big data was closely linked to data science, and both are associated with the kinds of data that have been associated with the field since the AFCRL, in addition to the kinds of textual data specific to the Internet and the Web. Note also the connection between this usage and the NSF’s earlier focus on data science in the context of scientific research.
According to the Google Books NGram Viewer, using the English (2019) corpus with a smoothing factor of 3, the corpus frequency of the bigram “data scientist” (case insensitive) goes from 0.0000000613% in 2008 to 0.0000035018% in 2012 and 0.0000118327% in 2016. These are increases from 2008 of around 57 and 193 for 2012 and 2016 respectively. By comparison, frequency actually decreases by .6 from 2000 (.0000000876%) to 2008, although the difference here is probably not significant statistically; the frequency is essentially flat. The trend is similar for “data science,” with increases of 11 and 56 times from 2008 to 2012 and 2016 respectively.↩︎
The title of this article was adapted from a phrase used in 2008 by Hal Varian, chief economist at Google. In an interview with McKinsey’s James Manyika, Varian quipped: “I keep saying the sexy job in the next ten years will be statisticians” (McKinsey & Company 2009).↩︎
By “simple,” the authors meant few assumptions about the nature of the data, such as are required by data models that posit parametric distributions or causal relations among features. The usage is ironic, since one of the differences between the inferential models preferred by traditional statisticians and predictive models is that the former are chosen for their parsimony and interpretability, whereas the latter notoriously have too many terms to interpret.↩︎