1  Introduction

The interests of data scientists—the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection—lie in having their creativity and intellectual contributions fully recognized.

National Science Board, 2005, “Long-Lived Digital Collections: Enabling Research and Education in the 21st Century” (Simberloff et al. 2005: 27).

Data and its management is still the essence of data science.

Jeffrey D. Ullman, 2020, “The Battle for Data Science” (Ullman 2020: 13).

Data science today is characterized by a paradox. The large number and rapid growth of job opportunities and academic programs associated with the field over the past decade suggest that it has matured into an established field with a recognizable body of knowledge. Yet consensus on the definition of data science remains low. Members and observers of the field possess widely variant understandings of data science, resulting in divergent expectations of the knowledge, skill sets, and abilities required by data scientists. Definitions, when they are not laundry lists, range from an expanded version of statistics to data-driven science to the science of data to simply the application of machine learning to so-called big data to solve real world problems. These differences cannot be reduced to so-called semantics; they reflect a range of deep-seated institutional commitments and values, as well as variant understandings about the nature of knowledge and science. The lack of shared understanding poses a significant problem for academic programs in data science: it inhibits the development of standards and a professional community, confounds the allocation of resources, and threatens to undermine the authority and long-term prospects of these programs.

This essay approaches the problem of defining the field of data science by describing how the collocation “data science,” and its grammatical variants “data sciences” and “data scientist,” have been used historically.1 The primary methods employed are close reading and precise seriation of textual evidence drawn from a representative collection of primary sources, including organizational reports, academic articles, news stories, advertisements, and other contemporary forms of evidence. These are used to infer the history of the term’s social and institutional contexts of use as well as its denotative and connotative meanings. Extensive extracts are often presented, rather than paraphrased, as these provide the reader with direct and illuminating evidence for the meanings in question.2

This historiography is presented as a series of periods delimited by milestone years in which the term takes on a new or variant meaning, beginning with its initial usage in the 1960s to 2012 and the five or so years following, when the phrase becomes a commonplace. It is shown that the phrase has a surprisingly continuous and consistent usage over this period. As usage of the phrase evolved, its meanings were always additions to and inflections on prior meanings; in no case did newer usages contradict what preceded them, nor did they appear as cases of random independent invention.

The result is a picture of the transformation of a semantic complex that indexes a consistent set of technical, social, and cultural realities that constitute what may be called the situation of data science. This situation has been described repeatedly by data scientists of all stripes as a kind of data processing pipeline, a sequence of operations that begins with the consumption of data and ends with the production of data products, ranging from research results and visualizations to software services employed by various sectors of society.


  1. In this essay, a collocation is defined as a combination of two or more words that function as a lexical unit. In contrast to a mere n-gram, a collocation’s usage is historically given and cannot be inferred a priori by combining the definitions of its constituent words. Throughout this essay, the collocation “data science” and it variants are referred to as a term.↩︎

  2. The arguments and observations made by the authors in each case are represented in historical tense, not the textual present, which is the usual custom in writing about the history of ideas. For example, instead of saying that “Tukey argues P” in an essay from the 1960s, the evidence is presented as “Tukey argued P.” This is done in order to ground the evidence in its social and historical setting.↩︎