9  Synopsis

Historically, there are two primary usages of “data science”: the classical and the statistical.

9.1 History

9.1.1 Classical Data Science

The classical usage of data science can be traced to the 1960s when it emerged in the US military-industrial sector. Among the several governmental and commercial organizations that adopted the term at the time, the Data Sciences Lab (DSL) of the Air Force Cambridge Research Laboratories (AFCRL) stands out (AFCRL 1963). Formed in 1963, the DSL established a paradigm that grew out of the need to dynamically process vast amounts of sensor data in real-time to support decision-making and modeling of complex phenomena. Central to this paradigm was the explicit use of artificial intelligence and computer visualization to extract patterns from “data deluge”—what today we would call “big data.” Originally focused on real-time command-and-control systems for missile defense, over time this usage evolved to include working with large, complex, and dynamic data sets in a variety of scientific fields, including physics, environmental science, and biology. This usage continues to this day in that area and in the data-driven sciences. It is represented by organizations like JISC, the NSF, and CODATA. Closely related to this tradition of data science is the growth of so-called fourth-paradigm science and the rise of database-driven methods of extracting value from data, specifically data mining and knowledge discovery in databases (KDD).

A secondary but hugely influential variant of the classical usage appeared around 2008 when the paradigm of classical data science jumped from the military-industrial and scientific sectors to the commercial sector of so-called surveillance capitalism (Zuboff 2019; Hammerbacher 2009; Yau 2009). By 2012 the usage was famously promoted by Harvard Business Review when it dubbed data scientist the “sexiest job of the 21st century” (Davenport and Patil 2012). In this sector, data science became wildly popular, achieving buzzword status and prompting a high demand for data scientists in industry as well as for authoritative explanations of the nature of the role. This in turn motivated the formation of numerous academic degree programs in the US. It also elicited a strong reaction from academic statisticians alarmed by the perceived disconnect between data science and their own field (Davidian 2013).

9.1.2 Statistical Data Science

The statistical usage of data science began in Japan in the early 1990s among academic statisticians concerned with the threats and opportunities associated with computational statistics, data mining, and other developments relating to the rise of available computing and surplus data. Chikio Hayashi appears to have coined this usage in 1992(Ohsumi 2002), after which it was used by his student Noboru Ohsumi in 1994 (Ohsumi 1994). By 1996 the International Federation of Classification Societies (IFCS) adopted it as part of a conference title (Hayashi 1998). This began a long engagement between the IFCS and data science that continues to this day in Europe and Japan. A subsequent but shorter-lived variant of this usage emerged in the US between 1996 and 2001 when a small group of academic statisticians unsuccessfully exhorted their field to brand itself as data science and embrace new developments in computational statistics and machine learning (Kettenring 1997; Wu 1997; Cleveland 2001). A third variant was developed by traditional statisticians who viewed the rise of second-wave classical data science as a threat to their field, prompting calls to “own” data science(Rodriguez 2012; Davidian 2013; Yu 2014; “ASA Statement on the Role of Statistics in Data Science” 2015; Donoho 2017).

9.2 Context

These origins shed light on the specific characteristics of each tradition and why they sometimes are at odds with each other over fundamental questions. Many of these characteristics may be traced to the social and technical contexts that motivated the usages associated with each tradition.

The classical usage was motivated by an historically unique assemblage of networked data-generating and computational machinery that characterized post-war military and scientific endeavors that has only grown since that time. The primary problem faced in this context was what we can call data impedance—the endemic disproportion between the production of data by signal-generating instruments—from satellites and particle accelerators to, nowadays, smart phones and credit cards—capturing a wide range of natural and behavioral phenomena, and the ability to process and interpret these data, often rapidly, by means of computational machinery. In this context, data science emerged as an eclectic form of expertise associated with the pipeline of activities that begins with the acquisition of data, often from non-experimental sources, to its reduction and modeling, to its visualization and interpretation. In the early days, classical data science was heavily invested in pushing the envelope of computational technology to handle the demands of data impedance, leading to advances in the design of chips, networks, and databases. Indeed, the Internet connects directly the first and second variants of classical data science. The skills and values associated with this pipeline have been remarkably consistent from the 1960s to the present time—for example the desire and ability to work with large and dynamic sets of unstructured data, the focus on computational methods in place of analytic ones, and the premium placed on the visualization and communication of results to drive decision-making.

By contrast, the statistical usage developed in the context of an established academic field reacting to external developments—the technological developments within which classical data science emerged. Each of the three variants of the statistical usage was motivated by the perceived threats and opportunities prompted by the rise of computational methods and the growing surplus of data being captured by databases and shared over networks. A recurrent theme in this usage is an attraction to machine learning, developed primarily by computer scientists, and a repulsion to data mining, perceived as an unprincipled collection of methods lacking in experimental design and mathematical grounding. In this context, data science emerged as a vision for new form of statistics that would encompass and enhance the good parts of new computational methods while throwing away the bad.