More than 1.5 Million Historical Newspaper Images Now Discoverable Online with Newspaper Navigator
(15 Sept 2020) The public can now explore more than 1.5 million historical newspaper images online and free of charge. The latest machine learning experience from Library of Congress Labs, Newspaper Navigator allows users to search visual content in American newspapers dating 1789-1963.
The user begins by entering a keyword that returns a selection of photos. Then the user can choose photos to search against, allowing the discovery of related images that were previously undetectable by search engines.
For decades, partners across the United States have collaborated to digitize newspapers through the Library’s Chronicling America website, a database of historical U.S. newspapers. The text of the newspapers is made searchable by character recognition technology, but users looking for specific images were required to page through the individual issues. Through the creative ingenuity of Innovator in Residence Benjamin Lee and advances in machine learning, Newspaper Navigator now makes images in the newspapers searchable by enabling users to search by visual similarity.
To create Newspaper Navigator, Lee trained computer algorithms to sort through 16 million Chronicling America newspaper pages in search of photographs, illustrations, maps, cartoons, comics, headlines and advertisements. The idea for Lee’s groundbreaking project began with a Library crowdsourcing experiment by 2017 Innovator in Residence Tong Wang called Beyond Words, which invited members of the public to help identify cartoons, illustrations, photographs and advertisements in World War I-era newspapers. Users could draw boxes around visual content on a page, transcribe captions or review other users’ transcriptions.
“When I first encountered Beyond Words, I was captivated by the thousands of photographs, illustrations, cartoons and maps identified by volunteers. I began to wonder whether this identified visual content was the key to throwing open the treasure chest of visual content throughout all 16 million pages in Chronicling America using machine learning,” Lee said. He applied to the Library’s Innovator in Residence Program to find out.
While image searching techniques are not new from tech companies, Newspaper Navigator marries cultural heritage with computer science. Users encounter a real-time demonstration of how algorithms are trained to scan millions of pieces of data in seconds. All code used in the project is open source and placed in the public domain for unrestricted re-use. The dataset code can be accessed at github.com/LibraryOfCongress/newspaper-navigator.
“As I am writing a history of editors in the early United States, Newspaper Navigator will be an invaluable tool for charting the visual culture of the press,” said Jim Casey, an assistant professor of African American Studies at Penn State University who was part of a test group for Newspaper Navigator. “It provides us with a wealth of clues about the work of editors (behind the scenes) to forge the look and feel of the first drafts of history. Ben Lee’s work at the LC Labs is a first-rate example of how computing can help us understand our cultural heritage in new and unexpected ways. I expect that the Newspaper Navigator platform is going to open up many new areas of research because it allows us to ask new kinds of questions.”
The Library’s longtime collaboration with the National Endowment for the Humanities created the National Digital Newspaper Program, which produces Chronicling America.
“Newspaper Navigator affords a whole new dimension of access to Chronicling America,” said Molly O’Hagan Hardy of the National Endowment for the Humanities. “Images and words on the printed newspaper page interact to construct meaning for readers past and present, and we miss half of that meaning making when our searches rely exclusively on the written text.”
Newspaper Navigator will allow greater access to a large collection and can enable new discoveries from historical newspapers, Hardy said.
“What inspires me about Newspaper Navigator is that it’s possible only through decades of collective vision and innovation,” said Kate Zwaard, the director of digital strategy at the Library of Congress. “Ben’s creative work builds on other open-source software projects, open data from Chronicling America scanned by libraries and archives across America, and the shared contributions of Beyond Words users. It allows us to see the exponential effect of sharing information and technology.”
Through experimentation, research and collaboration, LC Labs works to realize the Library’s vision that “all Americans are connected to the Library of Congress” by enabling the Library’s Digital Strategy. LC Labs is home to the Library of Congress Innovator in Residence Program; has nurtured experiments in machine learning and the use of collections as data; and incubated the Library’s popular crowdsourced transcription program By the People. Learn more and subscribe to the monthly LC Labs newsletter at labs.loc.gov.
The Library of Congress is the world’s largest library, offering access to the creative record of the United States — and extensive materials from around the world — both on-site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov; access the official site for U.S. federal legislative information at congress.gov; and register creative works of authorship at copyright.gov.
The original press release is here.