• Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset

    Benjamin Lee (see profile)
    Critical theory, Data mining, Digital humanities, Digital libraries, Machine learning, Science--Study and teaching, Technology--Study and teaching
    Item Type:
    Chronicling America, data archaeology, digitized newspapers, library of congress, Newspaper Navigator, Critical data studies, Science and technology studies (STS)
    Permanent URL:
    The increasing role of machine learning in the construction of cultural heritage and humanities datasets necessitates critical examination of the myriad biases introduced by machines, algorithms, and the humans who build and deploy them. From image classification to OCR, the effects of decisions ostensibly made by machines compound through the digitization pipeline and redouble in each step, mediating our interactions with digitally-rendered artifacts through the search and discovery process. Here, I consider the Library of Congress’s Newspaper Navigator dataset, which I created as part of the Library of Congress’s Innovator-in-Residence program. The dataset consists of visual content extracted from 16 million historic newspaper pages in the Chronicling America database using machine learning. In this data archaeology, I examine the ways in which a Chronicling America newspaper page is transmuted and decontextualized during its journey from a physical artifact to a series of probabilistic photographs, illustrations, maps, comics, cartoons, headlines, and advertisements in the Newspaper Navigator dataset. I consider the digitization journeys of four different pages in Black newspapers in Chronicling America that reproduce the same photograph of W.E.B. Du Bois. In tracing the pages’ journeys, I unpack how each step in the pipelines, such as the imaging process and the construction of training data, not only imprints bias on the resulting Newspaper Navigator dataset but also propagates the bias via the machine learning algorithms employed. I investigate the limitations of the Newspaper Navigator dataset and machine learning as it relates to cultural heritage, from marginalization and erasure via algorithmic bias to unfair labor practices in the construction of commonly-used datasets. I argue that any use of machine learning with cultural heritage must be done with an understanding of the broader socio-technical ecosystems in which the algorithms have been utilized.
    Last Updated:
    3 years ago


    Item Name: pdf bcgl-nn-data-archaeology.pdf
      Download View in browser
    Activity: Downloads: 796