As I ended with the last post, and thanks to reading Matthew Wilkens’s “Geographic Imagination of Civil War-Era American Fiction”, I am interested in validating literary conceptions of regionalism in terms of late-nineteenth century and early twentieth century American texts.
Though, as with all topic modeling and text mining projects, the first step is to find a readily available corpus of texts and then to apply a set of programming scripts that will help find whatever it is a user is looking for.
I had made the mistake of first putting together a set functions that would, albeit loosely, extract place-name references in a set of text files. The mistake being made explicitly clear when I realized that I had a script but no set, actors, props, etc. A creeping sense of frustration began to settle in as developing a corpus of American texts from 1930-1940 is, to put it lightly, heinous. It is worth mentioning that a reason why literary DH scholars have an affinity for nineteenth century literature (other than that it is a fabulous literary historical moment) is that copyright laws pose a significant obstacle for accessing digitzed textual information past 1923.
However, and has been a recurring graduate school experience, scholarship is a haphazard set of motivation and skill catalyzed by a significant amount of luck. One of my professors in undergrad was performing some scholarly pilfering of HathiTrust’s non-consumptive research portal. It is a wonderful site that allows researchers to perform computations on texts without necessarily being in possession of them. A lovely, though a bit unwieldy, circumvention of copyright law.
The result of this clever rummaging was that I inherited a small corpus of 157 texts from 1890-1930. Not nearly enough to be statistically significant, but enough to develop a rudimentary methodology of Wilkens’s computationally-assisted process.
Which is great, considering that not too long ago my corpus contained 15 texts I scraped from Project Gutenberg.