Blue Waters User Portal | Science Teams

Creating a Public Research Dataset for Millions of Books from the HathiTrust Digital Library

J. Stephen Downie, University of Illinois at Urbana-Champaign

Usage Details

J. Stephen Downie, Boris Capitanu

The HathiTrust Digital Library contains nearly 14 million scanned books, comprising 4.8 billion pages. One of the largest digitized collections of published books in the world, this data allows for unprecedented insights into history, language, and culture over the past few centuries. This proposal seeks computing allocation to create a dataset of pre-processed, extracted features from the HathiTrust corpus, to support public research activities across many disciplines. These extracted "features" are quantifiable facts about the pages of the books, most usefully counts of words (unigrams) or strings of words (bigrams, trigrams, n-grams). This proposal is to build upon an impactful earlier dataset release that was made possible through the use of a Blue Waters allocation. The earlier dataset included unigrams drawn from over 4.8 million books. For this proposed dataset, we are seeking an allocation to extract bigrams, trigrams, word parts-of-speech, language probabilities, and 'entities' (persons, names, locations, etc) over a larger subset of the HathiTrust data.