Sep 12-14, 2016 The University of Tokyo

Can a writer disguise the true identity under pseudonyms?: Statistical authorship attribution and the evaluation of variables
Miki Kimura (Meiji University)

This is a work-in-progress study on quantitative authorship attribution of a lesbian writer with more than one pseudonyms, James Tiptree, Jr. and Raccoona Sheldon. Alice Bradley Sheldon (1915‐1987) was a writer who published feminist science fiction stories for almost 20 years. As a commercial strategy, she hid her true identitiy under a male pseudonym, James Tiptree, Jr., for little over a decade. She also used a female pseudonym, Raccoona Sheldon, as the name offered a thematic change.

Brinegar (1963) inspected the distribution of word length in order to verify the author of the QCS letters and concluded that the letters were not written by Mark Twain. Mosterller and Wallace’s study of the Federalist papers verified the author of a collection of eighteenth-century political documents, which argue for the Constitution of the United States, through the frequencies of individual words such as prepositions, which are considered irrelevant to the content of the papers. Burrows (1987) examined intra-author variations in Jane Austen’s novels by employing a statistical method called principal component analysis. In Japan as well, stylometry has developed over the past 50-plus years. In particular, Jin, Kabashima, and Murakami (1993) inspected intra-author variation in the works of a well-known Japanese author who used three pseudonyms. They could not detect intra-author variation in the Japanese author’s works, but they were able to show inter-author variation in comparison with the author’s contemporaries by using the distribution of commas in Japanese.

In this research, I will examine intra/inter author variations in Alice Sheldon’s texts. As Le Guin (1976) indicated Alice Sheldon’s works under the female pseudonym (Raccoona Sheldon) have less control and wit compared to her works under the male pseudonym (James Tiptree, Jr.). Using statistical analyses, this research primarily focused on the intra-author variation between her works under these two pseudonyms. It not only distinguished Alice Sheldon’s works under the two pseudonyms but also compares the results from this quantitative authorship attribution with the works of literary criticism scholars such as Silverberg (1975), Lefanu (1989), Russ (1995), and Larbalestier (2002).

In addition to the examination of intra-author variasion within the works of one author, this research also investigates inter-author variation between two authors. As Silverberg (1975), Lefanu (1989), and Kotani (1994) noted, in contrast to Ernest Hemingway, James Tiptree’s manner of writing is somewhat masculine. In order to address such criticisms, the Alice Sheldon’s Corpus, which consists if all the works publish under her two pseudonyms, and the Hemingway Corpus, which contains all his short stories, have been developed.

Juola (2013) recently inspected intra-author variation in the works of Joanne Rowling, who uses the two pseudonyms J. K. Rowling and Robert Galbraith, and tries to attribute the works under Robert Galbraith to those written under J. K. Rowling. This study used a specialized software called JGAAP, and verified that the works under J. K. Rowling and those under Robert Galbraith have the same style as other British female writers. Further, according to a case study on the quantitative stylistics of Joanne Rowling presented by Kimura and Kubota (2015), the author skillfully differentiates her writing style by genres and pseudonyms.

This result could possibly be useful for the analysis in the current study. However, another probable assumption is that author discriminators chosen form the corpora developed for this kind of research differentiate between the two authors, but fail to discriminate between Alice Sheldon’s two pseudonyms. The latter result means that Alice Sheldon failed to disguise her true identity by using the two pseudonyms James Tiptree, Jr. and Raccoona Sheldon.

As variables, the top 10, 25, 50 most common words, considered effective for this kind of discrimination by, for example, Burrows and Hassall (1988) and Burrows (1992), are chosen for the analysis. In addition to these lexical variables, this research has also selected syntactic variables, especially the distribution of POS, which are considered effective for discrimination based on Hirst and Feiguina (2007). I will apply two kinds of unsupervised statistical methods (principal component analysis and hierarchical clustering analysis) and two supervised classification methods (discriminant analysis and support vector machines-SVM). If the discrimination variables chosen from these two corpora have sensitivity as identifiers, the results from SVM will show that they can capture inter-author variation between works from Alice Sheldon and works from Ernest Hemingway, but cannot detect intra-author variation between works under Alice Sheldon’s two pseudonyms. In this analysis, the evaluation of the classification methods and the variables, which are considered effective for such research, will be simultaneously conducted.


