◎ JADH2016

Sep 12-14, 2016 The University of Tokyo

Trends in Centuries of Words: Progress on the HathiTrust+Bookworm Project
Peter Organisciak, J. Stephen Downie (University of Illinois at Urbana-Champaign)

The HathiTrust+Bookworm (HT+BW) project is providing quantitative access to the millions of works in the HathiTrust Digital Library. Through a tool called Bookworm, digital humanities scholars can use out­of­the­box exploratory visualization tools to compare trends in all or parts of the collection, or use the API directly to query for more advanced questions. In this poster, we present the progress of the HT+BW project and discuss both its potential value to the digital humanities scholars and its current limitations.

HT+BW[*1] is a quantitative text analytics tool built on top of the HathiTrust collection through improvements to a tool called Bookworm. HathiTrust, a consortium of library and cultural heritage institutions around the world, holds nearly 15 million scanned volumes, about 39% of which are in the US public domain. The current stage of HT+BW allows access to these public domain works, with ongoing work toward representing in­copyright works and those of unknown status.

Figure 1: HT+BW in its simplest form: comparing different words over-time, corpus-wide

Figure 2: Clicking on the visualization calls up links to the original works in the HathiTrust Digital Library

The tool underlying HT+BW is called Bookworm, a spiritual successor to the Google Ngrams Viewer (Michel et al. 2011). As with the earlier tool, the primary unit of analysis in Bookworm is the word token and the most common interface is a time series line chart.

Likewise, against the HathiTrust collection, the trends visualized also span centuries and millions of published works. However, HT+BW is significantly more robust than its popular predecessor: allowing more nuanced forms of inquiry, different visual interfaces for exploring results, and an application programming interface (API) that enables direct access to counts.

First, HT+BW can be queried by subsets of the data, rather than simply by year. Rather than only searching for trends of a word over time, one can compare that words trends for different classes of books, different genres, and different geographic provenance.

Faceting by metadata opens the door to much more nuanced questions. With HT+BW, one does not even have to use a word as a query: one could simply compare text counts between facets. For example: what subject areas are seen in texts published in the United States? What genres are popular in Japanese texts? How did the popularity of serials grow between countries?

Figure 3: Comparing the same word over different subsets: it this case, books published in the US version versus those in the UK.

Another area where HT+BW moves beyond its antecedent is that not all questions need to be structured along years. Subsequently, visualization does not need to be structured as a time series line chart, and alternate visualizations are in development (Schmidt 2016).

However, the raw quantitative counts for highly customized queries can be returned using a public API, providing a path for scholars to move from exploration to more in­depth questions.

HT+BW includes books from all around the world in 345 different languages. The materials held by HathiTrust are contributed to mainly from western institutions, meaning that English is the best­represented language in the collection, followed by other European languages. The best­represented Asian language is Japanese, with 73 thousand books, followed by Chinese with 32 thousand books. Bookworm supports extended Unicode characters, so Japanese is supported in the various uses of HT+BW. One limiting factor for scholars working with Japanese­language texts is that their metadata and coverage will not be as strong as for better­represented languages. For example, nearly no Japanese texts in the current HT+BW have a subject class assigned.

The current coverage of the HT+BW is of public domain works, biasing the collection toward older works. This is a temporary limitation, and the ongoing project is prioritizing an expansion of the data to all 15 million works. Another limitation being addressed in future work is that current searches can only be done on single word phrases.

HT+BW provide quantitative, flexible access to the millions of texts in the HathiTrust Digital Library. Currently it supports single word queries against 4 million public domain works, with support for facets over a variety of metadata fields and even visualization of personal collections of texts. This poster describes the current state of the HT+BW, and outlines its future work in supporting more words for more books.

Note

[*1] http://bookworm.htrc.illinois.edu


References

[1] Michel, Jean­Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82. doi:10.1126/science.1199644.

[2] Schmidt, Benjamin M. 2016. "BookwormD3". Tool. Github. https://github.com/bmschmidt/BookwormD3.