◎ JADH2016

Sep 12-14, 2016 The University of Tokyo

Development of Glyph Image Corpus for Studies of Writing System
Yifan Wang (University of Tokyo)

We have built a software suite to auto-generate, edit, and annotate glyph image databases in order to serve our text / glyph image integrated corpus of dictionaries Yiqiejing Yinyi (一切經音義) and Xu Yiqiejing Yinyi (續一切經音義) in a printed Chinese Buddhist canon Taishō Tripiṭaka (大正新脩大藏經).

The software has three main components. 1) Character isolation system (fig. 1), which automatically detects and crops each character from digital facsimiles of the books. The program has processed all input images with approx. 94% accuracy, where existing commercial OCR programs failed to correctly detect vertical lines and/or warichu style (inserting in-line annotation in double lines of smaller size characters) layout. 2) Glyph image editor (fig. 2), which has mainly been used to correct auto-generated character coordinates output by the isolation system. The program allows users to visually browse each page and quickly find errors. 3) Glyph comparison and annotation interface (fig. 3), that runs as web application, and on which users can search a certain character to compare all (or some of) appearances en masse in images stored in the corpus. It is also designed to quickly add metadata to correctly categorize glyphs into each group that consists of those regarded as the same shape. All aforementioned programs, including the corpus itself are built upon open-source libraries (OpenCV, Qt, Ruby etc.), thus easily customizable according real use cases. They, as well as their dependencies, also maintain high portability, being functional in all Windows, Mac OS X, and Linux platforms. The programs enabled us to reduce considerable amount of time and manual work, efficiently develop the corpus, and continuously maintain and improve the data set without expert knowledge in computing.

The corpus is focused on analyzing and obtaining statistical data on the internal graphemic system (i.e. whether two distinct glyphs are considered same in quality) in those documents, and consists of text data derive from SAT Project (providing digitalized text of Taishō Tripiṭaka) and the generated glyph DB. Yiqiejing Yinyi and Xu Yiqiejing Yinyi in Taishō Tripiṭaka show unique features even compared with other parts of the collection. Despite the fact that the tripiṭaka is a letterpress printing, they embrace a vast number of character variants; est. 30,000 different glyph types of varied degree of similarity are recognized, with approx. 3,000 characters are preliminary found to be subject of addition in the Unicode character set, roughly as many as the number we proposed to Unicode from all other portions of the publication. This exceptional diversity is accounted for by complicated aspects such as their fidelity to Tang-dynasty handwriting convention, multiple references with mixed collation history used during edition, and interaction of them with modern interpretation and possibly technical errors in editing. As we are preparing for Unicode proposal to encode characters in Taishō Tripiṭaka, it is urgently needed to understand the structure of the entangled writing system from the sections, which contain over 1,000,000 characters in total, hence difficult for small group of researchers to conduct an exhaustive analysis. And this is the reason we introduced automatic processing.

As we are now working on accurate glyph categorization using the programs, we will share some of our findings at the conference in September.

We believe that the system we use is also applicable to other grammatological or philological studies that require fine-grained analysis of each single character and use printed East Asian documents with vertical layout as materials.

Figure 1:

Figure 2:

Figure 3: