Throwback Thursday: Big Data in the Early 20th Century

Last week, we talked about the 1888 invention of one of the first tools that could be used to process “big data,” the Hollerith Machine. A fascinating book published in 1935, Practical Applications of the Punched Card Method in Colleges and Universities, records some of the “big data” research that academics undertook using this new technology, including an effort by an anthropology professor at Harvard University to determine precise anatomical profiles for various classes of criminals. He and his research team recorded information about 125 biometric variables for 17,000 criminals, then used Hollerith machines to look for correlations in the data. I’ll let Prof. E. A. Hooton tell you more about this in his own words:

In the course of elaborating our criminal data, one process was performed by the Hollerith sorter which in its complexity is probably unique in anthropometric research. In our series of native white criminals of native parentage there is included a group of 414 robbers. These robbers display as a group a number of statistically significant excesses and deficiencies of certain categories of morphological features…. It was desired to ascertain how many individual robbers manifested each one of every mathematically possible combination of these nine morphological peculiarities. Since there are 512 possible combinations of the presence and absence of these characters, the sorting task involved was stupendous and consumed several weeks of the entire working time of the sorter…. The outcome of the research was a conclusive demonstration that, by taking a sufficient number of peculiarities of the robber group in combination and selecting all of the individuals who possessed that combination, it was possible to pick out a type which was 100 per cent robber. At the same time it was demonstrated that only one robber out of 414 showed this complete and exclusive type combination. It was therefore apparent that morphological type combinations were of no practical use in determining the offenses of criminals, so far as our particular data were concerned.(1)

While this project is notable as much for how ridiculous its premise sounds to us 80 years later as for the scale of its undertaking, other chapters in the book record efforts that would not be out of place today, such as attempts to code information about large numbers of hospital patients in an effort to learn more about the causes of mortality, or a survey of over 30,000 businesses in three states to gauge the impact of newly-imposed sales taxes. I’ll let Edwin H. Spengler, author of the latter chapter, conclude with a statement that, with only minor changes, could easily appear in any modern work on “big data”:

Much as the compilation of certain statistical data may be desired, however, the expense and the time involved in sorting and tabulating the information, have frequently deterred individuals from going ahead with a given project…. To a large extent, the introduction of mechanical methods of counting, sorting and tabulating numerical facts has eliminated these difficulties. Electric machinery, capable of performing routine operations at the rate of several hundred per minute, has increased the speed and lowered the expense of preparing statistical tabulations. This has resulted in a broadening of the field of statistical research and analysis and has stimulated the projection of studies which, without the use of such equipment, would no doubt have been considered impossible or impractical of accomplishment. (2)

What would Spengler and his colleagues have thought about today’s supercomputers, which can perform more than 1012 operations per second? And what will researchers 80 years from now view as quaint when looking back at our “big data” research?

