• Evaluating a Machine Learning Approach to Identifying Expressive Content at Page Level in HathiTrust

    Author(s):
    NIKOLAUS PARULIAN (see profile)
    Contributor(s):
    Stephen Downie, Ryan Dubnicek, Kristina Hall, Yuerong Hu
    Date:
    2020
    Group(s):
    DH2020
    Subject(s):
    Digital libraries, Machine learning, Natural language processing (Computer science)
    Item Type:
    Conference proceeding
    Conf. Title:
    Digital Humanities 2020
    Conf. Org.:
    ADVANCE ISSUE OF DIGITAL SCHOLARSHIP IN THE HUMANITIES
    Conf. Loc.:
    Carleton University and the University of Ottawa, Ottawa, Canada
    Conf. Date:
    22-24 July 2020
    Tag(s):
    Natural language processing
    Permanent URL:
    http://dx.doi.org/10.17613/3nfw-tx25
    Abstract:
    HathiTrust currently provides metadata, scanned images, and full text for all public domain volumes. However, it’s likely there is content that is of interest to scholars and free from restriction within the front matter of most volumes, regardless of rights status. For example, the title page or table of contents may contain information that is likely non-expressive and useful to understanding the content’s structure and subject matter. It’s also likely that some volumes include materials that have expressive/creative content in the first 20 pages, so front matter cannot be made open for all volumes without understanding the most frequent type of content within the first 20 pages. This task is time-prohibitive for entirely manual exploration, so we seek to evaluate a machine learning approach for this task.
    Metadata:
    Status:
    Published
    Last Updated:
    3 years ago
    License:
    All Rights Reserved

    Downloads

    Item Name: pdf ischool_poster_htrc_parulian.pdf
      Download View in browser
    Activity: Downloads: 87