5  The 1990s: Statistical Data Science

Interestingly, in 1977 a prefixed variant of the term appeared in the title of the technical report, “Non-Parametric Statistical Data Science: A Unified Approach Based on Density Estimation and Testing for ‘White Noise’” (Parzen 1977). However, Parzen later published a version of this work as “Nonparametric Statistical Data Modeling” (Parzen 1979), which indicates that his original word choice was mistaken. Yet the original choice may not have been entirely unmotivated: Parzen’s work attempted to unify parametric and non-parametric methods under one umbrella. Given the natural inclination of mathematical statistics for the former and data analysis for the latter, his choice of the term data science may have signaled an attempt to encompass both approaches to data. It is also worth noting that he later used to the term to introduce a “new culture of statistical science called LP MIXED DATA SCIENCE” (Parzen and Mukhopadhyay 2013), after the term became popular.1 Whether or not this unifying goal was his motivation, statisticians later became quite interested in the term for precisely this reason.

5.1 1994: The Tokyo School

5.1.1 Ohsumi and Meta-Stat

In the early 1990s, the term resurfaced in the context of statistics. It appeared in the title of a 1994 essay by the Japanese statistician Noburu Ohsumi on the application of hypermedia to the problem of organizing data, “New Data and New Tools: A Hypermedia Environment for Navigating Statistical Knowledge in Data Science,” an elaboration of an essay published two years earlier (Ohsumi 1992, 1994).2 In these essays, Ohsumi described the by now familiar litany of problems associated with data impedance, although this time the focus was on the production of data resulting from its analysis and storage, not its consumption in so-called raw form:

In research organizations handling statistical information, the volume of stored information resources, including research results, materials, and software, is increasing to the point that conventional separate databases and information management systems have become insufficient to deal with the amount. Increasing diversification in the media used these days interferes with the rapid retrieval and use of the information needed by users. A new system that realizes a presentation environment based on new concepts is needed to inform potential users of the value and effectiveness of using the vast amount of diverse data (Ohsumi 1992: 375).

For research facilities around the world, the products of classical data science—the database and data processing software—had become a sorcerer’s apprentice, creating new problems with each solution. Organizations were drowning in the data sets they produced or acquired, the software used to process them, the print and digital libraries of reports and articles resulting from their analyses, and a host of other materials. The requirements, approach, and design goals of Ohsumi’s proposed system, the Meta-Stat Navigator, are strikingly similar to those of a contemporary system designed to solve the information problems of another scientific organization: Berners-Lee’s World Wide Web, famously developed at CERN in 1989 (Berners-Lee and Fischetti 2008). Of course, the latter quickly obviated Ohsumi’s proposal and become synonymous with the Internet, invented decades earlier.

The significance of Meta-Stat for our purposes is that this kind of work was understood clearly as data science at this point in history. Data science continued to be connected with the processing and representation of data, and was distinct from data analysis, but with this important development: statisticians had become embedded in these technologies, and their work had changed significantly as a result. And, as a result of this change in working conditions, the connection between data analysis and data science became closer.

Here we may locate with some precision a crucial transformation in the meaning of the term, associated with its adoption by a new set of users. One clue to this change is the opportunity Ohsumi observed amid the challenges posed by data deluge:

… the information handled by the statistical sciences lies on the boundaries of various other sciences and clarifies the relationships and nature of information that joins these sciences. Development of a system that fully organizes and integrates strategic information is essential (Ohsumi 1992: 375).

The Meta-Stat “system” was designed to realize the opportunity opened up by the central position statisticians had come to occupy among the prolifically data-generating sciences and the computational environment in which these data were made available. Data science, in this view, is meta-statistics, an encompassing concern for understanding data, understood as a universal medium, and its relationship to knowledge.

This perspective was shared by Ohsumi’s senior compatriot and fellow statistician, Chikio Hayashi, whom Ohsumi described as “the pioneer and founder of data science” (Ohsumi 2004: 1).3

5.1.2 Hayashi’s Usage

5.1.2.1 1993

In 1993, at a round table discussion during the fourth conference of the International Federation of Classification Societies held in Paris (IFCS-93), Hayashi uttered the phrase “Data Science” and was then asked to explain it. At the next conference (IFCS-96), he presented an answer, in addition to having the conference named to emphasize the importance of the term—“Data Science, Classification, and Related Methods.” His definition is as follows:

Data science is not only a synthetic concept to unify [mathematical] statistics, data analysis and their related methods but also comprises their results. It includes three phases, design for data, collection of data, and analysis on data. Data science intends to analyze and understand actual phenomena with “data.” In other words, the aim of data science is to reveal the features of the hidden structure of complicated natural, human and social phenomena with data from a different point of view from the established or traditional theory and method. This point of view implies multidimensional, dynamic and flexible ways of thinking (Hayashi 1998a: 41).

Hayashi went on to describe the sequence of design, collection, and analysis as a primary and iterative “structure finding” process in which data are transformed from a state of “diversification,” given the inherent “multifariousness” of the phenomena they represent, to one of “conceptualization or simplification” (41). The discovery of structure is accomplished with what we would recognize today as the methods of exploratory data analysis and unsupervised learning. In effect, Hayashi’s definition abstracted the design goals of Ohsumi’s Meta-Stat system and presented them as “a new paradigm” of science, one that would encompass statistics, data analysis, and their vast output of data within in a unified, process-oriented framework—data science (40).

In addition to Hayashi’s own definition, it helpful also to see how the field was defined by the editors (who included Hayashi) of the proceedings of IFCS-96:

The volume covers a wide range of topics and perspectives in the growing field of data science, including theoretical and methodological advances in domains relating to data gathering, classification and clustering, exploratory and multivariate data analysis, and knowledge discovery and seeking.

It gives a broad view of the state of the art and is intended for those in the scientific community who either develop new data analysis methods or gather data and use search tools for analyzing and interpreting large and complex data sets. Presenting a wide field of applications, this book is of interest not only to data analysts, mathematicians, and statisticians but also to scientists from many areas and disciplines concerned with complex data: medicine, biology, space science, geoscience, environmental science, information science, image and pattern analysis, economics, statistics, social sciences, psychology, cognitive science, behavioral science, marketing and survey research, data mining, and knowledge organization [Hayashi (1998b): v; emphasis added].

Of interest here is use of “data science” as a big tent, an inclusive rubric under which to group a series of domains (which match roughly to a process) as well as a broad range of disciplines and levels, from tool builders to scientists and practice to theory. This passage is also significant for including within the scope of data science the methods of machine learning as well as data mining among the list of sciences concerned with “complex data,” suggesting the prominence of these approaches at that time. We will see that not all definitions proceeding from this community were as inclusive.

Hayashi assigned a revolutionary and almost messianic role to data science. In his vision, the statistical sciences had lost their way. Mathematical statisticians had come to overvalue abstract inference and precision, and by choosing to work with the artificial data required to pursue these goals were “prone to be removed from reality” (40). Data analysts, although working with real data, had “come to manipulate or handle only existing data without taking into consideration both the quality of data and the meaning of data … to make efforts only for the refinement of convenient and serviceable computer software and to imitate popular ideas of mathematical statistics without considering the essential meaning” (40). As a result of these divergent attitudes toward data, and the disregard of both for the scientist’s engagement with the primary, existential relationship between data and phenomena, the field had become stagnant and lacking in innovation. Data science emerged as a savior, unifying a divided people, showing their way out of the wilderness, and restoring prosperity and prestige to their community.

If Hayashi’s criticisms of data analysis sound familiar to those leveled today against data scientists, it is because the issues data science was meant to resolve are recurring and systemic. So too is the separation between data analysis and mathematical statistics, which was recognized by Box, and later Tukey, in the 1970s. In his response to Parzen—who, we noted, sought to overcome a methodological split between the two subfields—Tukey wrote:

I concur with the general sentiments expressed by George Box in his Presidential Address … that we have great need for the whole statistician in one body—for the analyst of data as well as for the probability model maker—and the inferential theorist/practitioner. One cannot, however, make a whole man by claiming that one can subsume one important class of mental activity under another class whose style and purposes are not only different but incompatible. To be “whole statisticians” as Box might put it, or to be “whole statistician-data analysts” as I might, means to be single persons who can take quite different views and adopt quite different styles as the needs change. As the title of my paper of yesterday put it, “we need both exploratory and confirmatory”! The twain can—and should—meet, but they need to remain a pair (or two distinct parts of a larger team) if they are to do what they should and can (Tukey 1979: 122).

Tukey implies a solution to the schism, later observed by Hayashi, in better organization, not in a utopian “new man” or in a synthetic science per se, recalling the division of labor proposed by Naur, but here focusing on different roles within that division. Implicit in this approach is the view that the problem with statistics was not epistemic but organizational.

Here it is helpful to recall a property of Kuhn’s concept of paradigm—an obvious lens through which to observe our topic—which is often overlooked by those who use the term: it refers no to an abstract body of ideas that succeed on the basis of their intrinsic rationality or truth value, but to the successful practical application of ideas by means of novel methods and tools in a way that they may be imitated. The concept has both epistemic and social dimensions. Viewed in this light, the question of whether data science is in fact a science—our main question—becomes a matter of determining whether it solves important problems in new ways, by means of an assemblage of ideas, methods, and tools that may be grasped and imitated by others. Hence, although Tukey and Hayashi may appear to be divergent in their approaches to overcoming the problems, they represent the two aspects of a scientific paradigm, the one conceptual, the other practical. This should not be viewed as contradictory.

5.1.3 A Canary the has Forgotten to Sing

Following the IFCS meetings, as well as two meetings of the Japan Statistical Society that held “special sessions on data science,’’ Ohsumi developed Hayashi’s definition as well as its rationale (Ohsumi 2000: 331). We call this the Tokyo school of data science, given the association of both Hayashi and Ohsumi with the Institute of Statistical Mathematics in Tokyo, Japan. In a paper that explicitly addressed the relationship between data analysis and data science, and which is perhaps the first of several to claim the flag of data science for statistics, Ohsumi declared that because of its privileging of”mathematical methodologies” over an engagement with data acquisition, data analysis had become “a canary that has forgotten to sing,” referring to a Japanese children’s song that contemplates a silent bird’s fate (332).4 Amplifying Hayashi, he asserted that “[h]ow data are gathered is the key to defining the relevant information and making it easy to understand and analyze” (331). In making this point, Ohsumi referred to a new figure on the scene, one that contradicted the principles he proposed:

In my opinion, this viewpoint on the meaning of data science is fundamentally different from data mining (DM) and knowledge discovery (KD). These concepts are not of practical use because they neglect the problems of “data acquisition” and its practice (332).

It is significant that Ohsumi excludes these new fields—or field, since the two so frequently co-occur, along with the variant KDD, “knowledge discovery in databases”—from his definition of data science, since many today would consider the two synonymous.

5.2 Data Mining and Knowledge Discovery

The paradox is instructive: the name “data mining,” as used here, made its appearance in the late 1980s and early 1990s as a rubric that included a set of practices motivated by precisely the same conditions that led the Tokyo school to propose the field of data science in the first place. Among these conditions was the relatively sudden appearance of vast amounts of data stored in databases—one of the fruits of classical data science—owing to the rise of relational databases and personal computing in the 1980s, and a suite of tools to work with data, from spreadsheets to programming languages to statistical software packages. Whereas many statisticians viewed these developments with alarm, being acutely aware of the epistemic disruptions they produced for the received workflow of data analysis, the data mining community embraced them as an opportunity to convert data into value. Coming mainly from the field of computer science, data miners developed a set of methods that included the application of machine learning algorithms to the data found in databases in various contexts, from science to industry (such as point-of-sale records generated as by-product of computerized cash registers and credit card use). The relationship between machine learning and data mining was also mutually beneficial—data miners supplied machine learning projects with the large sets of data required for this class of algorithms to perform well. This relationship was greatly reinforced with the rise and development of the Web and social media platforms, which generated enormous amounts of behavioral data.

Although the two fields—for simplicity, let’s call them data analysis and data mining—were responding to the same conditions of data surplus and impedance, their philosophical orientations could not have been more opposed. This difference is clearest in their respective evaluations of data provenance, the source and conditions under which data are produced. For the data miner, data provenance is largely irrelevant to the possibility of converting data into value. Data are data, regardless of how they are generated, and the same methods may be applied to them regardless of source, so long as their structure is understood. Indeed, for the data miner data exists much as natural resources do, as a given part of the environment, which helps explain the success of the metaphor of mining over competing variants, such as harvesting, which implies intentional creation. For the data analyst, as Hayashi and Ohsumi took such pains to emphasize, provenance is, or should be, everything, echoing the statistician’s orthodox preference for experimental over observational data.

This difference was expressed by Ohsumi in his definition of data science in opposition to data mining:

Owing to the qualitative and quantitative changes in data [produced by the conditions described above], it is, indeed, becoming increasingly difficulty to grasp all aspects of a dataset in explaining various phenomena. Therefore, new techniques, such as DM, KD, complexity, and neural networks, are being proposed. However, the potential of these methods to solve any of these problems is questionable (332).

Ohsumi went on to characterize the way data has changed by listing the new kinds of data with which the statistician is confronted. These include prominently data sets found in databases as by-products of various processes, such as passive accumulation (e.g. from point-of-sale devices), unstructured data (included in text fields), and aggregated data generated “spontaneously and accumulating automatically in the electronic data collection environment” (332-333). He explained his concern with data mining:

When it comes to analyzing these datasets, people discuss DM and related techniques. However, the important questions to answer are: what dataset is necessary to explicate a certain phenomenon, why is it necessary, how to design its acquisition, and how difficult the whole process is. This is more important than the dataset itself. Books on DM do contain terms such as “data preparation”, “getting the data”, “sampling procedures”, and “data auditing”, but there is an assumption that the dataset is given and the procedure may start with analysis. Fiddling with a dataset once it is collected is merely a self-contained play of data handling [Ohsumi (2000): 333; emphasis added].

Although his evaluation of data mining seems to be woefully off base—a great deal of Google’s success, to take one example, was founded on their embrace of data mining at the time of Ohsumi’s essay—in fact his concern is not with the success of predictive analytics per se, but with solving what he considered to be the central problem of data science, that of understanding how data are generated in the first place. Given some of the issues that classifiers have encountered with respect to racial bias, for example, he cannot be said to have been wrong.

5.3 A Note on the IFCS

5.3.1 Highlights

  • Founded in 1985.
    • Grew out of the German Classification Society, formed in 1977 — see (Bock 2001).
  • In 1996 “first scientific society to use the now trendy term Data Science as a title of a conference” IFCS Newsletter 52, October 2015, p.2
    • Adopted the term “data science” from Hayashi.
  • In 2015 “GfKl decided to add Data Science Society to the name of the Society” IFCS Newsletter 52: October 2015 p. 7
  • Usage persists into current era (Rizzi, Vichi, and Bock 2013; Windham 2001; Baier and Wernecke 2006; Schader, Gaul, and Vichi 2012)
    • Programs at the University of GÖTTINGEN, host of German Classification Society Meetings.
      • Summer School “Data Science”https://www.uni-goettingen.de/en/data+science+2019/611949.html
      • Mathematical Data Science (B.Sc.)https://www.uni-goettingen.de/en/582230.html — Fit for the digital era: In the degree programme “Mathematical Data Science”, students learn modern mathematical and statistical methods of data analysis and recognition of structures, and acquire tools from computer science to transform these methods into algorithms. Graduates will be able to deal scientifically with large quantities of data and thus gain access to attractive career opportunities.
      • Applied Data Science (B.Sc.)https://www.uni-goettingen.de/en/640719.html —The field of data science includes aspects of mathematics, computer science and statistics. Data science deals with data analysis and the knowledge gained from data, as well as the techniques required for processing large and partially unstructured data volume. In the Bachelor’s degree programme “Applied Data Science”, detailed knowledge of data analysis is taught on the basis of computer science and mathematics. In an application subject, students are also introduced to the practical application of the data analysis methods they have learned. The data analysis courses include aspects of machine learning, statistics, pattern recognition and the infrastructures required for effective analyses. Students may choose economics, biology, digital humanities, medical computer science, breeding informatics or physical modeling and data analysis as application subjects. … Anyone choosing Applied Data Science as their study subject should be interested in formal mathematics as well as application-related practical work. The ability to work in a team is a vital prerequisite for daily professional work later on. English language skills are required. Special subject-related knowledge, especially in programming, is not required.
  • Conferences refer to “statistics under one umbrella,” echoing the theme of unification seen in Parzen 1977, the Tokyo School, and the US.

5.3.2 Narrative

In the January 1999 issue of the IFCS Newsletter, the Japanese Classification Society (JCS) reports that “a special seminar about ‘Data mining in Data Science’ will be held at March 1999” at the JCS Annual Meeting (DeBoeck 1999: 2).

Hayashi, then president of the IFCS, wrote in the Sept 1999 Newsletter column “From the President”:

The potential energy of the IFCS is increasing over the years, and it keeps increasing since the last conference. We can say with firm confidence that the results of data science (including statistics, data analysis, classification and related methods in different respects) really contribute to the development of science and more in general to our civilization and global environment.

Further, I expect that new approaches, and new methodologies and methods in data science will be developed for the coming 21st century, and that the presentations of IFCS meetings will reflect these new ideas. I believe that signs of new movements will be found in the Namur conference in 2000.

IFCS Newsletter Number 23 June 2002. News from the JCS 2002:

(4) Statistics, Data Analysis, Classification and Data Science Chikio Hayashi (The Institute of Statistical Mathematics)

Data design, data collection and data quality evaluation are crucial to data analysis if we are to draw out useful relevant information. Analysis of low-information data never bears fruit; however, data analytic methods can be refined. In spite of the importance of this issue in actual data mining and data analysis, I am forced to ask why these problems cannot be discussed at its most essential level. Perhaps it is a matter of the laborious practical work involved or the otherwise plodding pace of research. Indeed, these problems are rarely addressed because in academic circles it is regarded as unsophisticated. In the present talk, I dare to touch these problems with the fundamental concept of data science.

2002, the GfKl

The German Classification Society (GfKl) will hold its 26th Annual Conference at the University of Mannheim, July 22–24, 2002 under the title “Between Data Science an Everyday Web Practice”. This meeting will take place immediately after the 8th IFCS-Conference at Krakow.

In 2004, the president wrote:

… the society should foremost be a forum on what has rightly been dubbed “Data Science”, dealing with all aspects of collecting and analyzing data. This not only covers classical mainstream statistics, but also statistically informal approaches such as Cluster Analysis, Multidimensional Scaling and Data Mining. The IFCS is an outstanding platform for dealing with data analysis techniques that are traditionally ignored in mainstream statistics. With its broad view on Data Science, all approaches to data analysis are welcome, and hence new approaches and classic approaches can be confronted but also synthesized, in an effort to create the best of both worlds. The IFCS could and should play a leading role in these new developments in Data Science. An important aspect here is that IFCS should foster the development of techniques with a keen mind on the applications for which the techniques are mentioned. In particular, techniques must be made usable for the researcher who collects the data (not only for the statistician or classification expert). Therefore, techniques should be user-friendly and transparent, ensuring that the researcher who analyzes the data keeps a clear idea on what the analysis does and what the results display, and the interpretation of results should be crystal clear. This implies, last but by all means not least, that IFCS should pay attention to how to teach our techniques to students and other prospective users. Thus, IFCS should promote an integrative view on developments in research and teaching of Data Science. One step in this direction is the round table session on this topic, organized by Helena Bacelar at IFCS2004. Other steps will be the organization of further IFCS sponsored courses on classification and data analysis.

Note the definition embraced “data mining” and the emphases on practices that go beyond what is perceived to be the more narrow purview of “mainstream statistics.”

5.3.2.1 Conferences / Newsletters

  • IFCS-2006: Data Science and Classification. July 25-29, 2006. University of Ljubljana, Faculty of Social Sciences, Ljubljana, Slovenia. 10th meeting.
  • 32nd GfKl Annual Conference, “Advances in Data Analysis, Data Handling, and Business Intelligence” at the Helmut-Schmidt-University of Hamburg (Germany) from July 16 – 18, 2008. Section on Data Science and Innovative Tools.
  • The 2013 Joint Conference of the German Classification Society (GfKl) and the French Classification Society (SFC) will take place in July 10-13, in Luxembourg. The title of the conference will be “European Conference on Data Analysis”. Section on Data Science, including Data Pre-Processing, Text and Web Mining, Information Extraction and Retrieval, Personalization and Intelligent Agents.
  • IFCS Newsletter 29
    • Issue 29 (2006) describes on a conference in Germany … “30th GfKl Annual Conference The German Classification Society GfKl (Gesellschaft für Klassifikation) will hold its 30th Annual Conference at the Free University of Berlin (Germany), March 8 – 10, 2006. The conference is titled ‘Advances in Data Analysis’ and focuses on data analysis, learning of latent structures in datasets, and unscrambling of knowledge.”

    • Session on Data Science and Innovative Tools:

      Data Science and Innovative Tools:
      Data Cleaning and Pre-Processing; Data, Text and Web Mining; Information Extraction and Retrieval; Personalization and Intelligent Agents; Tools for Intelligent Data Analysis; New Challenges in Data Science.
      Applications: Subject Indexing and Library Science; Marketing and Management Science; e-commerce, Recommender Systems and Business Intelligence; Banking and Finance; Production, Controlling and OR; Biostatistics and Bioinformatics; Genome and DNA Analysis; Medical and Health Sciences; Archaeology and Geography; Engineering and Environment; Administrative Record Census; Linguistics and Statistical Musicology; Image and Signal Processing.

    • The earmarks: broad definition to include the gamut of processing and an ambivalent attitude toward data mining

  • IFCS Newsletter 49
    • Paolo Giudici is (full) Professor of Statistics in the Department of Economics and Management at the University of Pavia in Italy. He currently teaches courses in statistics (undergraduate level), financial risk management (Master’s level), and data science (PhD level). .
    • The next CLADAG meeting will be held at the Flamingo Hotel in Santa Margherita di Pula, Cagliari, on October 8-10, 2015. The meeting will take place under the auspices of the IFCS and of the Italian Statistical Society (SIS). The chosen location will facilitate exchange of ideas and networking, as well as an easier access for young researchers. Includes the theme Data Science under the category Multivariate Data Analysis.
    • News from GfKl. Brief report on 50th Anniversary Celebration Meeting Classification Society, Wednesday, July 9, 2014. FirstSite, Colchester, hosted by British Classification Society (BCS) and Department of Mathematical Sciences, University of Essex in cooperation with: Big Data and Science Week hosted by Faculty of Science & Health, University of Essex; Gesellschaft für Klassifikation (GfKl); and European Conference on Data Analysis (ECDA2014). Scientific programme included:
      • Fionn Murtagh, De Montfort University, Leicester, UK: “Data science in psychoanalysis: a short review of Matte Blanco’s bi-logic, based on metric space and ultrametric or hierarchical topology.”
      • David Wishart, Hans-Hermann Bock, David Hand, Maurizio Vichi, Peter Flach, Sabine Krolak-Schwerdt, and Berthold Lausen: “50th Anniversary resume, outlook and discussion: Classification society and data science.”
  • IFCS Newsletter 52: October 2015
    • “The Federation, in Kobe in 1996 and, again, Rome in 1998, was the first scientific society to use the now trendy term Data Science as a title of a conference.” (2)
    • “The European Conference on Data Analysis 2015 (ECDA2015), co-hosted by the British Classification Society (BCS), Gesellschaft für Klassifikation e.V. (GfKl) and Sekcja Klasyfikacji i Analizy Danych (SKAD) and held at the University of Essex, Colchester, UK from 1st to 4th September 2015 (conference webpage www.ecda2015.eu).
      • “Data Science Society GfKl”, “Gesellschaft für Klassifikation (GfKl) Data Science Society”
      • “Sponsors of the ECDA2015 were the data science company Profusion Ltd, London and the Institute for Analytics and Data Science, University of Essex.” (7)
      • “At the general annual meeting the members of GfKl decided to add Data Science Society to the name of the Society.”
    • “GfKl Data Science Society will celebrate its 40th Annual Meeting at the Fourth Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik ‘Statistics under one Umbrella’ 14 - 18 March 2016 at the Georg-August University Göttingen (DAGStat2016) in Göttingen, DAGStat 2016 includes the 40th Annual Meeting of GfKl, 62. Kolloquium of the German Region of the International Biometric Society and the Pfingsttagung of the German Statistical Society.” (7)
    • “In spring 2017 we will celebrate 40 years Gesellschaft für Klassifikation Data Science Society (founded February 1977 in Frankfurt am Main).
    • “GfKl Data Science Society 40th Annual Meeting, March 14-18, at the Fourth Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik ‘Statistics under one Umbrella’, Georg-August University Göttingen (DAGStat2016). http://www.unigoettingen.de/de/485701.html”
  • IFCS Newsletter 53: March 2016
    • Foundation of the “European Association for Data Science (EuADS)”
  • IFCS-2022: Classification and Data Science in the Digital Age

5.3.3 The German Classification Society

  • Founded in Frankfurt on 12 February 1977 by “by scientists and practitioners stemming from various fields of knowledge”

This event, even if realized on a national basis, has to be seen in the context of a quite general and worldwide development: In fact, during the sixties of the 20th century there was a great interest and intensive research in problems of classifi cation, data analysis, and systems for ordering knowledge. Not only in statistics, but also in many other domains of science and practice, e.g., 

  • in library science with topics such as subject indexing and cataloguing, indexing languages, and broad systems of ordering; 
  • in documentation with the development of data bases, thesauri construction, and information retrieval systems; 
  • in linguistics and terminology with language analysis, concept formation, automatic translation, dialectometry; 
  • in biology with problems in taxonomy, systematics, phylogeny, and bacteria classi cation; 
  • in technology with pattern recognition systems, network optimization, and cybernetics; 
  • in industry with product and patent classification, systematics of commodities, and the segmentation of transport networks; and finally 
  • in economy and marketing when searching for market segments and consumer groups, product positioning strategies, and corresponding data evaluation techniques.

The work of classification is broadened to include ordering of knowledge … [DESIGN]

5.4 Tanaka and the JSCS

See Tanaka (2018)

  • JJSCS –> JJSD in 2018
    • Official journal of JFSSA
  • JSCS est 1986

Tanaka describes the role of editor of the JJSCS between 1986 and 1990:

The major purpose of establishing JSCS was to make a progress in computational statistics in Japan by providing a forum for people working in various aspects of computational statistics, e.g., statisticians engaged in the research and application of statistical theory and methods, computer scientists/engineers engaged in the development of statistical software, and statisticians/data analysts engaged in the analysis of data obtained with surveys, experiments or other means, and provide them opportunities for exchanging information and ideas and, if possible, for finding seeds for joint works among them (75; emphases added).

Tanaka points out that the term is not new:

By the way, the opportunities to meet the term “Data Science” increased rapidly in recent years, but, it was not a new term for statisticians. It was used at the 2nd Japanese-French Scientific Seminar (organized by Yves Escoufier, Chikio Hayashi, and others) held in Montpellier. The title of the Seminar and the Proceedings published later by Academic Press in 1995 was “Data Science and Its Applications”. This term was used for “new science with data at its core”. The meaning is essentially the same to that currently used, though types of data to be analyzed have changed in a quarter century (76).

5.5 1997: The US

In the late 1990s and early 2000s, a series of papers and presentations by academic statisticians in the US echoed the theme of malaise expressed by the Tokyo school: the field of statistics was suffering from an image problem and needed to redefine itself in order to meet the challenges of a variety of existential threats. These threats included the rise of computational methods and large amounts of data, the emergence of non-traditional predictive methods and areas of research, such as data mining, that were taking the limelight from statistics, and an unflattering public image. Interestingly, given the lack of reference to the Tokyo school in these sources, many of the proposed responses included expanding the scope of statistics to encompass these new methods and to rebrand the field as “data science.” Frequently associated with this suggestion was a concern for updating university curricula for teaching statistics and, apparently for the first time, the creation of a new kind of statistician, the “data scientist”—recalling the mythical figure of the “whole statistician” discussed by Tukey in the 1970s. In these exhortations to the community of statisticians, the data scientist emerged as the “new man” of a reborn statistical science, one that who would overcome the field’s crisis of recognition.

The first of these exhortations appears to have come from Jon R. Kettenring in his Presidential Address to the American Statistical Association on 12 August 1997, where he expressed concern for image of statistics:

Looking ahead, image reconstruction must be one of our top priorities. It must be understood that statistics is the data science of the 21st century—essential for the proper running of government, central to decision making in industry, and a core component of modern curricula at all levels of education. I would like to see ASA make image reconstruction a part of its strategic plan. And I suspect we may need some professional help if we are to succeed (Kettenring 1997: 1230, emphasis added).

In this talk, Kettenring clearly expressed the ambivalence of the statistics community toward the rise of computational methods that we saw in the Tokyo school. On one hand, computer science was welcomed and it’s contributions were viewed as opportunities:

I believe the time has come for us to acknowledge that computer science needs to take its place alongside mathematics (and probability) as fundamental linch pins of statistics and as disciplines that undergird our research and instruction. Examples of relevant topics in computer (and computing) science include databases and database management, algorithm design, computational statistics, artificial intelligence and machine learning (Kettenring 1997: 1232).

On the other hand, data mining—the most notable development coming from the computational field—was excluded from this vision. Not only was it absent from the list of relevant topics above, it was singled out for criticism, echoing the stance of the Tokyo school. In highlighting software development and data mining as “two opportunities that sit at the intersection of computer science and statistics,” Kettenring wrote:

The second area is the current hot topic of data mining, which statisticians might reasonably think of as data analysis of very large databases. In fact, if you dig down and look at what is involved in data mining, you will find a variety of statistical components, such as statistical graphics and cluster analysis. Moreover, there is a great opportunity to bring a variety of statistical concepts to bear—modeling, sampling, robust estimation, outlier detection, dimensionality reduction, etc. Nevertheless, there are new opportunities as well, and we would be wise to pay very close attention and to become seriously involved with these developments.

In fact, if we don’t, there is risk that we will be blown away by the momentum that is flowing in this direction. As Jerry Friedman pointed out in his keynote address at Interface ’97, we are no longer the only game in town. Many other data oriented sciences are competing with us for customers and students.

Also it should not go unnoticed-speaking of image reconstruction!-that the very term, data mining, has captured the fancy of many people, especially in the business community. It is grabbing headlines that statisticians would kill for. The image of data mining is that something powerful is going on there. The reality? Well that may be rather different. It may even turn out that the phrase of the day in the 21st century is

lies, damned lies, and data mining.

But for now I believe we should take advantage of the momentum before it fades into another missed opportunity for statistics (Kettenring 1997: 1232).

As an aside, it is worth noting that Friedman in the keynote referenced above, expressed the same ambivalence toward data mining. It is at once lauded as “having a major impact in business, industry, and science” affording “enormous research opportunities for new methodological developments” and derided as a “device to sell computer hardware and software” (Friedman 1997).

Kettenring’s message was echoed three months later by the academic statistician C. F. Jeff Wu in a lecture delivered at the University of Michigan entitled “Statistics = Data Science?” (D. Wu 1997). Note that although Wu was not the first to suggest that statistics be renamed to, and redefined as, data science, he is often regarded as the first in the US to do so (Donoho 2017). Regarding image, Wu pointed out that statisticians were perceived as either accountants or involved with simple descriptive statistics—and prone to lying with these statistics, as the saying goes—when in fact their work comprised everything from data collection to modeling and analysis to solving problems and making decisions. As a remedy, he implored his colleagues to “think big” and embrace the changes and challenges that we have seen already—the rise of large and complex data sets, the use of neural networks and data mining methods, and the emergence of new fields such as computer vision. In addition to suggesting a name change for the field, he appears to have been the first to suggest a name change of the role of statistician to data scientist, along with a college curriculum to that would embrace all phases of data science and be profoundly interdisciplinary.

Wu’s proposal is stronger than Kettenring’s. Kettenring, whose language suggests that he was aware of the prior existence of data science as a field, believed that statistics should embrace and encompass data science as the rightful heir to its fruits and labors. In contrast, Wu asserted that statistics should expand its scope and rename itself “data science,” because it “is likely the remaining good name reserved for us” (C. F. J. Wu 1997: slide 12). In both cases, however, the usage of the term data science included much of computer science, which is consistent with the classical definition that developed in the 1960s and ’70s.

Practical plans for revamped curricula to train this new kind of statistician—the data scientist—appear after these appeals. The most well-known is found in Cleveland’s essay, “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics” (Cleveland 2001), even though he used the term “data analyst” as his target student. Consistent with previous definitions of data science as statistics augmented by computational methods and tools, he proposed a curriculum comprising six areas, along with percentages denoting the amount of time and resources that should be devoted to each: Multidisciplinary Investigations (25%), Models and Method for Data (20%), Computing with Data (15%), Pedagogy (15%), Tool Evaluation (5%), and Theory (20%). This distribution is more or less consistent with the definitions we have seen proposed by other statisticians, including the Tokyo school. As a measure of how radical this suggestion was, consider Donoho’s remarks, made over a decade and a half later:

Several academic statistics departments that I know well could, at the time of Cleveland’s publication, fit 100% of their activity into the 20% Cleveland allowed for theory. Cleveland’s article was republished in 2014. I cannot think of an academic department that devotes today 15% of its effort on pedagogy, or 15% on computing with data. I can think of several academic statistics departments that continue to fit essentially all their activity into the last category, theory (Donoho 2017: 750).

Radical as it may have appeared, the curriculum was fairly conservative. The area of “models and methods,” for example, focused exclusively on statistical data models, as opposed to algorithmic ones (to use Breiman’s terms, discussed below), while the “computing with data” area focused narrowly on the infrastructure to support these models. Cleveland’s bias toward traditional statistical modeling is made clear in his explanation of it:

The data analyst faces two critical tasks that employ statistical models and methods: (1) specification—the building of a model for the data; (2) estimation and distribution—formal, mathematical-probabilistic inferences, conditional on the model, in which quantities of a model are estimated, and uncertainty is characterized by probability distributions (Cleveland 2001: 22).

That the one is subordinate to the other is evident in Cleveland’s remark: “A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use” (23). He did make a passing reference to algorithmic models, but his example came from their use to support stochastic modeling:

Historically, the field of data science has concerned itself only with one corner of this large domain [i.e. computing with data]—computational algorithms. Here, even though effort has been small compared with that for other areas, the impact has been large. One example is Bayesian methods, where breakthroughs in computational methods took a promising intellectual current and turned it into a highly practical, widely used general approach to statistical inference (23).

If it is not made clear from this passage that he did not have the wider field of data mining and knowledge discovery in mind, the following passage does:

Computer scientists, waking up to the value of the information stored, processed, and transmitted by today’s computing environments, have attempted to fill the void. Once current of work is data mining. But the benefit to the data analyst has been limited, because knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited (23; emphasis added).

So, Cleveland continued the practice of the Tokyo school to put at arms length the contributions of data miners, for their lack of training in traditional statistics or their lack of attention to data design, and to confine the role of computational expertise to knowledge of “environments.”

Regarding the argument being made here, that the term “data science” has been in continuous circulation since the 1960s, and not independently coined in various contexts, Cleveland’s remarks, similar to those of Kettenring, indicate that the term was already known to statisticians, even as they sought to appropriate it for their own purposes. The sentence—“Historically, the field of data science has concerned itself only with one corner of this large domain …”—indicates awareness of prior usage, as well as a definition that aligns with what we have called classical data science. In any case, as we have seen, CODATA, representing classical data science, began publishing the Data Science Journal in 2001, the time of Cleveland’s proposal.

The task of educating data scientists was also addressed at this time at a workshop sponsored by the American Statistical Association in 2000 on the topic of undergraduate education in statistics. In a report of the proceedings published in the American Statistician, the following understanding of data science was expressed:

… what is needed is a broader conception of statistics, a conception that includes data management and computer skills that assist in managing, exploring, and describing data. The terms “data scientist” or “data specialist” were suggested as perhaps more accurate descriptions of what should be desired in an undergraduate statistics degree. Data specialists would be concerned with the “front end” of a data analysis project: designing and managing data collection, designing and managing databases, manipulating and transforming data, performing exploratory and “basic” analysis (Higgins 1999). Data scientists (or specialists) a might share some course work with computer science majors, but where a computer scientist studies compilers and assembly language, a data scientist studies data analysis and statistics [Bryce et al. (2001): 12; citation in original].

The reference to Higgins is from a paper where he asserted the need for statisticians to pay more attention to “the non-mathematical part of statistics,” so that undergraduate programs may respond to the “explosion in the amount of data available to society” (Higgins 1999: 1). In this area he included “designing scientific studies in a team-oriented environment, ensuring protocol compliance, ensuring data quality, managing the storage/transmission/retrieval of data, and providing descriptive and graphical analyses of data” (1). Here, again, we see a definition of data science as an improved version of statistics, in response to the persistent condition of data impedance, that would include data management and computing skills (and data design)—but not the methods of data mining. Consistent with the implied structural relationship between statistics and data science, data science was sometimes described as a part of statistics. Tellingly, both Bryce, et al., and Higgins equivocated on the use of data scientist, and suggested “data specialist” as an alternate name, perhaps so as not to overshadow the role of the data analyst. In any case, the choice of term was motivated by marketing; as Higgins wrote:

Guttman expressed the opinion that the term statistics carries such a negative connotation that it might be wise to rename our departments something like “Department of Data Science” or “Department of Information and Data Science.” In this vein, I have suggested the term “data specialist” (Higgins 1999).5


  1. Parzen’s use of the term “culture” here echoes his comments on Breiman’s famous essay on two cultures of statistical models, where he suggested that there are in fact several cultures, including his own, to which he devoted the majority of his response (Breiman 2001: 224–226).↩︎

  2. According to Ohsumi, “the term ‘data science’ appeared for the first time” in 1992, at a research exchange meeting between French and Japanese data analysts (so-called) at Montpelier University II in France (Ohsumi 2000: 331). He also claims to have “argued the urgency of the need to grasp the concept ‘data science’” in 1992 (329).↩︎

  3. In his eulogy for Professor Hayashi, Ohsumi explained in some detail the motivation for his mentor’s first usage of data science:

    In the last ten years of his life, Professor Hayashi asserted the importance of “data science as a theory of scientific methods.” The starting point was the meaning of the words “data science;” this arose in 1992, when the professor was discussing the titles and introductions of a collection of papers to be published at the 2nd Japanese-French Scientific Seminar. The professor used a Japanese phrase that translates as “data science” and for the rest of his life explored the concept behind this phrase. When the papers at the seminar were published, it was proposed for the first time that the term referring to quantification be standardized as “quantification method” and the subsequent English terms were standardized as such. I can recall the professor reprimanding us with the words “you are talking about different concepts” when we referred to “deta kagaku” while another group called it “deta no kagaku” (data science).

    The basis of data science is an extremely straightforward concept; the fact that it needed to be expressed is an indication of the chaos that exists today in statistical data analysis and the deteriorating research environment. I am certain that the professor had intended to lead the way in achieving a breakthrough on these problems [Ohsumi (2002): 5; emphasis added].

    5.1.1.1

    ↩︎
  4. The specific reference is to a poem, later set to music, written by the Japanese poet Saijoo Yaso (西条八十), who lived from 1892 to 1970. According to Miriam Davis, “The moral of the song is that if the canary loses its song it is not worth its existence so it should make the most of the gift of song it has been given.” (Davis, n.d.)↩︎

  5. Higgins here referred to a quote from Guttman found in a paper by Kettenring (Kettenring 1997a).↩︎