Data and Source Materials for Mapping Texts
The Mapping Texts project team relied heavily on the Texas Digital Newspaper Collection’s archive of historic documents. The main primary material for this project was 232,500 pages that were digitized and converted to plain text using optical character recognition (OCR). This collection was processed through several different computational analyses to help the team explore the possibilities for computer aided “distant reading” of large document collections.
List of Cities and Publications in Document Collection
Natural Language Processing Results
We are sharing data files containing the results of various natural language processing tools that we used on the document collection:
- Word Counts: Lists of words in descending order of frequency, sliced and diced by year, location and title
- Named Entity Recognition results: people, places, organizations recognized from the text
- Topic Models: Clusters of commonly co-occuring words found in the collection as a whole, and as sliced and diced into different groupings (by historical era, by city and historical era, paper and historical era)
Download the data set here: texas_newspapers_naturallanguageprocessing.tar.gz (748mb, as compressed archive)
Visualization Source Code
We are sharing the original source code of the interactive data visualizations created for this research project:
- Visualizing Digitization Quality
The source code for the interactive visualization of text recognition quality is available in a GitHub repository for downloading and re-use.
- Assessing Language Patterns
The source code for the interactive visualization of language patterns is available on a GitHub repository for downloading and re-use:
- Download Link: https://github.com/wi-design/Mapping-Texts