◎ JADH2016

Sep 12-14, 2016 The University of Tokyo

Reorganising a Japanese calligraphy dictionary into a grapheme database and beyond: The case of the Wakan Meien grapheme database
Kazuhiro Okada (Tokyo University of Foreign Studies)
Introduction

Hiragana, a Japanese moraic script, had long had a variety of letters before standardisation in 1900. Our knowledge of the history of hiragana has been deepened from the historical relationships to distinguishing letter usages of letters. However, little of our knowledge has been translated into machine-readable form. Consequently, the 1000-year-long tradition of hiragana before 1900, or older hiragana, is still left underrepresented in the computational world. This paper will address issues concerning reorganising a Japanese calligraphy dictionary, Wakan Meien, into a grapheme database, and discuss its further use as a knowledge database of the older Japanese writing system.

Wakan Meien is a calligraphy dictionary specialising in hiragana materials, compiled by To Koei (birth and death dates unknown) and published in 1768 (Fig 1). Hiragana developed from cursivised Chinese characters. Today, it consists of 48 letters, whereas before the Meiji period, it had many more. Its cursivised origin makes it difficult to distinguish between levels of cursivisation, although some are distinguishable. Wakan Meien is one of the earliest kana dictionaries, and was compiled to meet growing demand by calligraphy students. The dictionary is unique, in that it presents examples grouped by similarity of shapes and not by genetic relationship. Genetic classification is a method that groups according to the source of cursivisation, and is still commonly used in later dictionaries (Fig 2).

Fig 1. Wakan Meien

Fig 2. An example of genetic classification in Kana Ruisan (Sekine Tametomi, 1768. Holding of NDL Digital Collection)

The organisation of Wakan Meien surpasses later dictionaries with regards to grapheme representation, the basic units of a writing system. Genetic classification is generally well regarded amongst academics as an objective method, based on the fact that relationships between hiragana and cursivised Chinese characters are philologically clear, and that, further, it does not refer to the researcher’s distinction between graphemes. However, deep understanding of the distinction between graphemes is essential, in order to ensure consistent computational encoding, such as Unicode. Conversely, the organisation of Wakan Meien means that these groups of examples correspond to distinction of graphemes[1]. Building a grapheme database from genetic classified dictionaries involves complex and uncertain differentiation: Thus, it is necessary to build a grapheme database from attested graphemes, including those of Wakan Meien, for example.

The dictionary appears not to be well organised. It collates examples by order of Iroha, a common Japanese mnemonic of hiragana. Then, examples are ordered by similarity of shapes: a group of more Japanised shapes includes solely similar shapes, whilst that of less Japanised — in other words, retaining more of the Sinitic original — shapes includes many more variations in cursivisation, from barely to largely. These groups are not strictly ordered, other than that more Japanised shapes tend to appear first. The source Chinese character is not considered in the ordering. This ordering may give an impression for readers used to genetic classification, that examples are not well organised. Reflecting that structure, the database recognises the following 3 entities: Sections, Groups (of examples), and Examples (Fig 3). In the database, each group carries the possible variations, e.g., the distinction between graphemes. Considering that the original work is not strictly structured, the database presents relationship between entities rather than structure of them like tree. In addition, those entities have their own properties, such as heading images in Sections, source characters in Groups, and locations and authors of the examples in Examples. Some of these properties, source characters and authors of the examples to name a few, may have two or more sub-properties.

As will be discussed later, the database will be offered as a reference of older hiragana. This nature requires that points of reference to groups should not be excessively altered. Substantial updates to Groups thus should not impact existing references to them, but be made through creating new ones. This means that Examples can have relationship between two or more Groups. A document(-oriented) database is employed to manage such data. Major advantages in employing document databases, compared to relational databases, include that it allows structured data to be stored as they are. Whilst relational databases can also manage such data after normalisation, recalling the loose structure of the original work, allowing it at scheme level would help development of better schemata.

Fig 3. Views of Sections, Groups, and Examples

The database will be provided as a reference source of older hiragana. It will include an educational purpose, in learning to read materials that are written in older hiragana, manuscripts of Japanese Classic for example, as well as a resource in corpus building, either in the form of linked data, or simply a link in an HTML page. First, with recent advances in mobile applications for learning older hiragana, such as ‘the Hentaigana app’ by the UCLA-Waseda alliance and ‘KuLA’ (Kuzushiji Learning Application) by Osaka University, it is expected to increase the broader popularity of older hiragana. The database will provide supplemental materials for learners. Second, it will be a reference for corpus building. Whilst older hiragana will be registered for Unicode in the near future, its current specification declares that it will not deal with the detail of distinctions between graphemes. Hence, building corpora in such a way that allows such a distinction must rely on other resources. The database will provide a reference for detail via either graphemes or actual examples, using stable IRI (Internationalized Resource Identifier). Moreover, accumulation of those links to the database, or links to other databases, will enable the formation of a knowledge database of older hiragana, and further the entire Japanese writing system, comprehending its structure and history with firm examples.


References

[1] Okada Kazuhiro. Wakan Meien ni okeru hiragana jitai ninshiki [Hiragana grapheme awareness in Wakan Meien], Paper presented at the 2016 Spring meeting, the Society for Japanese Linguistics, Gakushuin University, Tokyo, May 2016.