Topic Modeling the English Novel: Ted Underwood and the Historical Semantics of Genre

Introduction

If Matthew L. Jockers’s Macroanalysis established the large-scale application of topic modeling in literary studies, the work of Ted Underwood advances the field in a more theoretically refined and methodologically self-conscious direction. Underwood’s research, particularly in Distant Horizons: Digital Evidence and Literary Change, represents a significant development in the use of computational models—including topic modeling—to investigate the evolution of literary discourse over time.

Where earlier studies emphasized thematic discovery, Underwood’s work is distinguished by a more precise question:

How do literary categories—such as genre, period, and style—change over time, and how can these changes be measured?


1. Corpus and Research Design

Underwood’s research is grounded in large-scale corpora of English-language texts, including:

  • Nineteenth- and twentieth-century novels
  • Digitized archives from libraries and databases
  • Both canonical and non-canonical works

Unlike earlier approaches that focused primarily on thematic clustering, Underwood integrates:

  • Topic modeling
  • Classification algorithms
  • Statistical modeling

This multi-method approach allows for a more nuanced understanding of literary change.


2. Topic Modeling as a Tool for Historical Analysis

Underwood employs Latent Dirichlet Allocation not merely to identify themes, but to track their historical trajectories.

Key Idea

Topics are not static—they evolve across time.

Rather than asking:

  • What topics exist?

Underwood asks:

  • How do topics rise, decline, and transform across decades?

3. Modeling Genre as a Moving Target

One of Underwood’s most influential contributions is his reconceptualization of genre.

Traditional View

  • Genres are fixed categories (e.g., romance, realism, gothic)

Underwood’s View

  • Genres are statistical patterns that shift over time

Using topic modeling, he demonstrates that:

  • The vocabulary associated with a genre changes
  • The boundaries between genres are fluid
  • Genres overlap and evolve

Example: The Novel

Underwood shows that what counts as a “novel” in:

  • 1800
    is not the same as in:
  • 1900

This is reflected in:

  • Changing word distributions
  • Emerging and disappearing topics

4. Temporal Dynamics of Language

A central focus of Underwood’s work is the temporal dimension of language.

By applying topic modeling across chronological slices of data, he identifies:

(1) Topic Emergence

  • New thematic clusters appear over time

(2) Topic Persistence

  • Some themes remain stable across centuries

(3) Topic Decline

  • Certain discourses fade or disappear

Illustrative Patterns

For instance, analysis may reveal:

  • Decline in religious vocabulary over time
  • Rise in industrial and scientific discourse
  • Shifts in emotional and psychological language

These patterns provide:

A data-driven account of cultural transformation.


5. Combining Topic Modeling with Classification

Underwood’s work goes beyond unsupervised modeling.

He integrates:

  • Supervised machine learning
  • Predictive modeling

This allows him to:

  • Classify texts by period or genre
  • Measure how distinguishable different periods are

Key Insight

If a model can accurately predict the date of a text, then language has measurable historical signatures.

This transforms literary history into:

  • A problem of pattern recognition

6. Methodological Sophistication

Underwood emphasizes methodological rigor in several ways:

(1) Validation

  • Testing models on unseen data

(2) Replicability

  • Making methods transparent

(3) Interpretation

  • Combining statistical results with human analysis

This marks a maturation of the field:

From experimentation to disciplined inquiry.


7. Conceptual Contributions

Underwood’s work introduces several important theoretical shifts:

(a) History as Gradient, Not Boundary

  • Literary periods are not sharply defined
  • They change gradually

(b) Genre as Distribution

  • Genres are not categories but tendencies

(c) Evidence in Literary Studies

  • Computational models provide a new form of evidence
  • Not replacing interpretation, but supplementing it

8. Tensions and Critiques

Despite its sophistication, Underwood’s approach raises critical questions.

(1) Quantification vs Interpretation

  • Can statistical patterns capture literary meaning?

(2) Loss of Aesthetic Detail

  • Style, irony, and narrative complexity remain difficult to model

(3) Dependence on Archives

  • Results reflect available digitized texts
  • Not the totality of literary production

9. Relation to Broader Digital Humanities

Underwood’s work represents a second phase in digital humanities:

  • First phase: Demonstration (e.g., Jockers)
  • Second phase: Refinement and critique

It moves the field toward:

  • Methodological self-awareness
  • Theoretical integration

Conclusion

The research of Ted Underwood marks a significant advancement in the application of topic modeling to literary studies. By focusing on temporal dynamics, genre evolution, and methodological rigor, it transforms topic modeling from a tool of thematic discovery into an instrument of historical analysis.

Through the use of Latent Dirichlet Allocation and complementary techniques, Underwood demonstrates that literary change can be measured, modeled, and interpreted as a dynamic system of shifting linguistic patterns.

The broader implication is profound:

Literary history is not a sequence of fixed periods and stable genres, but a continuous process of transformation—one that can be traced through the statistical evolution of language itself.