Text Analysis of Books from the HathiTrust Digital Library to Characterize Descriptiveness in Writing
J. Stephen Downie, University of Illinois at Urbana-Champaign
Usage Details
J. Stephen Downie, Boris Capitanu, Craig A Willis, Peter OrganisciakOur overall project develops approaches to quantifying the notion of descriptivity in text. The immediate objective of our inquiry is to explore descriptivity in forms of writing that have been characterized as exemplifying different writing styles from the earliest times to the twentieth century. Digital text analysis offers an opportunity to operationalize the anecdotal notion of descriptivity by developing quantified metrics for descriptivity. Our work will leverage the resource represented by the HathiTrust Digital Library corpus, which contains approximately 14 million scanned books, comprising more than 4.8 billion pages. This proposal builds on previous work by the HathiTrust Research Center (HTRC). The requested allocation will be used to create an updated dataset of pre-processed, extracted features from the HathiTrust corpus with exploratory methods that will support our research. These extracted "features" are quantifiable facts about the pages of the books, most usefully counts of words (unigrams) or strings of words (bigrams and trigrams). We will explore how the descriptivity of language changes with respect to a set of parameters (most importantly: chronology, gender, and genre). We are seeking an allocation to extract bigrams, trigrams, word parts-of-speech, language probabilities, and "entities" (persons, names, locations, etc) at the page level over the public domain subset of the HathiTrust Digital Library. The dataset generated in the process will, additionally, become a public resource over which research activities across many disciplines could be carried out, including one PIs ongoing NCSA Fellowship.