This study aims to clarify the stylistic characteristics in works of Agatha Christie, a female mystery writer in the UK, comparing with those of Dorothy Leigh Sayers, who was also a renowned contemporary female mystery writer in the UK from the same period.
Agatha Christie is one of the most successful female mystery writers in history, and her novels are read all over the world now. Dorothy Leigh Sayers is likewise one of the most successful female mystery writers, and had a relationship with Christie. Christie and Sayers are called the two mistresses of mystery. Sayers’ mystery novels are not as popular as Christie’s. However, according to Mori (1998), Sayers had better writing ability and was better at describing characters in novels than Christie, and leading female mystery writers today, including P. D. James and Ruth Rendell, say that Sayers, not Christie, is the ideal writer. On the other hand, Christie has often been recently criticised by contemporary mystery writers, saying that the characters in her novels are in a fixed form, and that her style is too plain. At this point Christie forms a great contrast to Sayers. The purpose of this study is to examine how Christie’s style differs from Sayers’ one by comparing their works using statistical analysis.
While quantitative researches on Christie’s works, such as those of Lancashire & Hirst (2009) and Le et al. (2011), do exist, but these researches are based on simple statistics like word frequencies and Type/Token ratio. In addition, such researches do not deal with all of her works.
This study aimed to answer the following questions:
(1) Can we distinguish Christie’s works from Sayers’ by using statistical methods?
(2) What are the stylistic difference between the works of Christie and Sayers?
The data used in this study consists of Christie’s (221 texts, 5,230,256 words) and Sayers’ works (55 texts, 1,430,257 words). This study applies Random Forests for classifying the two writers’ works and extracting characteristic words from each writer’s works. Random Forests, which was proposed by Breiman (2001), is an ensemble learning method for classification and regression. In recent studies it has been used for classifying texts and authorship attribution. For example in Jin & Murakami (2007), Random Forests was employed for authorship identification of three different types of texts (novels, compositions and diaries), and it is shown that this method is more effective than other classifiers. In Tabata (2012), Random Forests was used to extract marker words that distinguish Charles Dickens from Wilkie Collins. Tabata reported that Random Forests overcame common problems in key word measures such as Log Likelihood or Chi-squared score, and Random Forests was proposed as the alternative method. Following Tabata, this study employed Random Forests to extract characteristic words that differentiate Christie from Sayers.
The variables used in Random Forests are the most frequent words. Random Forests is trained and validated on the 276 texts with different numbers of most frequent words ranging from 1000 to 100 in 100 word steps. The Christie texts and the Sayers texts were correctly classified into two different groups with an accuracy of 92.7%-95.3%. These texts were classified the most accurately with the top 600 words. Out of these, the 100 the most characteristic words were identified. The results of this analysis revealed that Christie’s use of characteristic words contrasts with Sayers’, especially between synonyms: Christie tends to use anyone, someone and until while Sayers tends to use anybody, somebody and till. Moreover, it revealed that Christie tends to use words related to the female gender, movement and visual. We discuss how these words are used differently in the texts of the two authors, and attempt to reveal the stylistic characteristics of Christie when compared with her contemporary.
 Breiman, L. (2001). Random forests. Machine Learning, 45: pp.5-23.
 Jin, M. and Murakami, M. (2007). Authorship Identification Using Random Forests. Proceedings of the Institute of Statistical Mathematics, 55(2): pp.255-26
 Lancasire, I. and Hirst, G. (2009, March). Vocabulary Changes in Agatha Christie’s Mysteries as an Indication of Dementia: A Case Study. Paper presented at the 19th Annual Rotman Research Institute Conference, Cognitive Aging: Research and Practice, Toronto.
 Le, X., Lancashire, I., Hirst, G. and Jokel, R. (2011). Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists. Literary and Linguistic Computing, 26(4): pp.435-461.
 Mori, H. (1998). Sekai Mystery Sakka Jiten: Honkakuha-hen. Tokyo: Tosho Kankoukai.
 Tabata, T. (2012). Stylometry of co-authorship: Charles Dickens and Wilkie Collins. The Special Interest Group Technical Reports of Information Processing Society of Japan, CH-93(3): pp.1-7.