Literary Fun With Text Mining


My wife is doing her PhD in political science on the topic of political interest groups and how they use social media to disseminate information and reach new audiences, and how they utilize this new(ish wow we’re old) medium to effect voting behaviour. Part of this has meant learning how to mine Twitter data and analyze it through the R programming language; in order to provide technical support and to have someone to troubleshoot coding issues, I’ve also been learning to use R to mine and analyze texts. What I’ve been concentrating on, in order to learn the language and the processes, is using it to mine and visualize data gathered from fictional texts, specifically the bibliography of Stephen King. What I want to do is to analyze plot trajectories drawn from sentiment data – quantitative measures of emotional sentiment words based on established dictionaries used for that sort of thing. Research questions on this would include things like: is there a pattern that King has for his plots, based on emotional language cues? Is this pattern, if any, different from other well-known horror writers? Furthermore, are there established “archetypal” emotional plot patterns for horror books, and do these patterns differ when you switch genres – say, to fantasy, military science fiction, paranormal romance, etc. etc. down the fracture lines of human experience.

So to start I’ll be going book by book through the King bibliography and presenting what are basically preliminary findings based on the sentiment dictionaries included in the quanteda package for R: Afinn, Bing et al, and the NRC emotional sentiment dictionary. Ultimately none of these will be ideal; a custom dictionary for emotional sentiment specifically in literature would be necessary to really capture a more accurate picture, but this is where linguistics comes in and I don’t have much formal training in that area. My on-paper expertise is in English literature and political science, and while Stuart Soroka’s Lexicoder program is what I’ll likely use to build and code the custom dictionary, the Lexicoder topic dictionaries that exist to date are meant to examine political speech rather than literary texts. Building my own will take a lot of time and research.

This should be fairly quick work up until about 1989 or so – The Dark Half, at any rate. I happen to think it’s the absolute nadir of his oeuvre, a self-indulgent author-insert story used to deal with the professional regret of outing his pseudonym and failing his dream of being Donald Westlake/Richard Stark. From Carrie to Tommyknockers, though, I’m familiar enough with the books that I can look at specific chapters pointed out on the graphs and quickly grasp what they mean in the story as a whole. After that…I’m going to end up having to read a number of King books I haven’t actually read yet, like Dolores Claiborne, The Girl Who Loved Tom Gordon, and any of the newer crime trilogy books he’s written. I mean, I guess it’s as good an excuse as any, right? I plan on running the data through the process and then reading, to see what kind of predictive power is embedded into the visualized data.

There is minimal pre-processing being done to the texts. They are epubs that are being converted to .txt files through calibre. They are then trimmed to get rid of all the extraneous matter – the list of other books, the reviews of other books, the acknowledgements, the endless introductions, and in some cases the other stories tacked on after the main story is finished. An example of this is the inclusion of “One For The Road” and “Jerusalem’s Lot” at the end of “Salem’s Lot”. Both stories are included in short fiction collections so that isn’t much of a concern. The texts are also gone through to ensure a certain similarity in terms of chapter breaks; these are necessary since chapter breaks are how I am tracking plot progress as the x variable. Carrie, for example, has no chapter breaks; breaks were inserted at the beginning of each epistolary passage, since those marked natural breaks in the story. Salem’s Lot meanwhile has sixteen chapters, but each of those chapters has multiple sub-chapters within them; these were all used as breaks, after converting each one to a “Chapter (n)” format. Rage and The Stand, the two other books I have to date processed, have luckily been blessed with a more normal chapter break format. One other I know off the top of my head that will require greater pre-processing is The Running Man, since it has that weird (n) And Counting chapter heading.

Some explanation of the text mining process and a number of glowing recommendations of the work of Julia Silge will follow, and the data visualization of Carrie.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s