This study is widely considered one of the clearest demonstrations of how algorithms and computational models can transform literary history.
A Case Study: Computational Analysis of 3,500 Nineteenth-Century Novels
1. Research Question
Jockers began with a simple but powerful question:
Can we use computational analysis to uncover large-scale patterns in literary history that traditional reading cannot detect?
Traditional literary criticism usually studies a small number of canonical works such as:
- Moby-Dick
- Great Expectations
- Jane Eyre
But thousands of novels were published in the nineteenth century, most of which scholars never read.
Jockers argued that literary history built on only a few canonical texts is incomplete.
Therefore he attempted to analyze 3,500 novels simultaneously using computational methods.
2. Building the Corpus
The first step in computational research is constructing a corpus.
A corpus is a large collection of texts prepared for analysis.
Jockers assembled thousands of novels from digital archives such as:
- HathiTrust Digital Library
- Project Gutenberg.
These novels included works by major authors such as:
- Charles Dickens
- Jane Austen
- George Eliot
as well as hundreds of forgotten authors who were widely read in their time.
This was important because the goal was to reconstruct the full literary ecosystem, not just the canon.
3. Text Processing
Before analysis, the texts had to be prepared for machine reading.
Several preprocessing steps were required.
Tokenization
The novels were broken into individual words.
Example:
“The gentleman walked slowly through the garden.”
becomes
- the
- gentleman
- walked
- slowly
- through
- the
- garden.
Stop-word removal
Common words such as:
- the
- and
- of
- to
appear frequently but carry little thematic meaning.
These words were removed so the algorithm could focus on meaningful vocabulary.
Lemmatization
Different forms of the same word were reduced to a single root.
Example:
- run
- running
- ran
become run.
This allows the algorithm to detect patterns more accurately.
4. Topic Modeling
One of the most important techniques used by Jockers was topic modeling.
Topic modeling algorithms identify clusters of words that frequently appear together across many texts.
For example, a topic might contain words such as:
- ship
- captain
- sea
- voyage
- harbor.
This cluster would likely represent a maritime narrative theme.
Another cluster might contain:
- marriage
- love
- family
- society
- propriety.
This would indicate domestic fiction, which was common in nineteenth-century novels.
What Topic Modeling Revealed
When applied to thousands of novels, topic modeling revealed recurring thematic structures across literary history.
Some major themes included:
- domestic life
- imperial travel
- religion
- economic struggle
- war and nationalism.
These patterns showed how literature reflected major historical forces of the nineteenth century.
5. Sentiment Analysis and Narrative Arcs
Jockers also used sentiment analysis to track emotional patterns in narratives.
The algorithm measured whether passages contained positive or negative emotional language.
By plotting sentiment across the length of a novel, researchers could visualize emotional trajectories.
For example:
Many nineteenth-century novels follow a structure where:
- early hardship occurs
- emotional tension increases
- the narrative resolves in a positive ending.
This pattern is especially common in novels like David Copperfield.
Computational analysis demonstrated that this emotional arc appears across hundreds of novels, not just a few famous ones.
6. Gender Patterns in Nineteenth-Century Fiction
One particularly fascinating discovery concerned gender representation.
Jockers analyzed linguistic patterns associated with male and female characters.
The results revealed clear statistical tendencies.
Female characters were more often associated with words such as:
- home
- family
- marriage
- affection.
Male characters were more frequently linked with words like:
- power
- travel
- politics
- work.
These patterns reflected Victorian gender ideology embedded within narrative language.
Such patterns are extremely difficult to detect through traditional reading because they require large statistical datasets.
7. Geography and Literary Culture
Another computational analysis studied the geographical distribution of literary settings.
Algorithms extracted place names from thousands of novels.
This revealed a strong concentration of settings in:
- London
- Paris
- New York.
This finding demonstrated how urban modernity shaped nineteenth-century literary imagination.
8. Rethinking Literary History
The most radical implication of Jockers’ research was methodological.
Traditional literary history focuses on individual masterpieces.
But computational analysis suggests that literature should also be studied as a system of thousands of interacting texts.
In this view:
- individual novels are data points
- literary history becomes a pattern across large datasets.
This approach complements the theory of distant reading developed by Franco Moretti.
Moretti argued that scholars should sometimes step back from individual texts and study large literary structures.
9. Criticism of Computational Literary Studies
Despite its innovations, computational literary analysis has also faced criticism.
Some scholars argue that algorithms cannot capture:
- metaphor
- irony
- symbolism
- aesthetic nuance.
For example, a computer cannot easily interpret the symbolic depth of a novel such as Ulysses.
Therefore many critics insist that close reading remains indispensable.
10. A Hybrid Model of Literary Scholarship
Today most digital humanities scholars advocate a hybrid model combining two approaches:
Close Reading
Traditional interpretation of individual texts.
Distant Reading
Computational analysis of large textual corpora.
Together these methods allow scholars to study literature at multiple scales:
- microscopic analysis of passages
- macroscopic analysis of literary systems.
Conclusion
The computational analysis conducted by Matthew Jockers demonstrates how digital methods can reshape literary scholarship. By examining thousands of novels simultaneously, algorithms reveal thematic patterns, emotional structures, and cultural ideologies embedded within literary history.
However, computational analysis does not replace interpretation. Instead it expands the scope of literary inquiry, allowing scholars to move between data-driven macroanalysis and interpretive close reading.