(5 Feb 2025) OCLC Metadata Quality teams implement a variety of measures—both manual and automated—to improve the quality and usefulness of WorldCat data. These extensive and ongoing efforts ensure that WorldCat data supports the needs of our membership and our global network of thousands of libraries across a wide range of services. As the technologies and tools that allow us to do this important work evolve, we are continually exploring new methods for enriching, repairing, and de-duplicating WorldCat records—data that powers the global discovery and sharing of library resources.
At OCLC, we believe Artificial Intelligence (AI) is at its best when guided by human expertise. Our journey with AI is a partnership—where the insights and values of library professionals shape how AI serves communities. A core component of many AI systems is machine learning, which involves training algorithms on data to enable them to make predictions or decisions without explicit programming.
In August 2023, we implemented our first machine learning model for detecting duplicate bibliographic records as part of our ongoing efforts to mitigate and reduce their presence in WorldCat. In the lead up to this, we had invited the cataloging community to participate in data labeling exercises, from which we received feedback from over 300 users on approximately 34,000 duplicates to help validate our model’s understanding of duplicate records in WorldCat. This initiative led to the removal of ~5.4 million duplicates from WorldCat for printed book materials in English and other languages like French, German, Italian, and Spanish.
We’ve now enhanced and extended our AI model to de-duplicate all formats, languages, and scripts in WorldCat. Leveraging the labeled data collected from community participation, we’ve tuned and optimized the AI machine learning algorithm, completed extensive internal testing, and engaged WorldCat Member Merge libraries to provide external verification of the algorithm’s performance.
On 11 February 2025, we will do a test run of 500,000 record pairs, targeting only print English books in WorldCat, and merging 500,000 duplicate records. Print English books represent the largest category of duplicates in WorldCat and is the format that has been most rigorously tested and improved in our machine learning de-duplication activities to date. After this initial run, we will pause to evaluate the results before completing more de-duplication passes of WorldCat to address the remaining duplicate pairs for print English books. Once this category of materials is completed, de-duplication runs will be done for all non-book and non-English materials. We will provide updates as we initiate additional runs.
We recommend that libraries not using WorldShare Management Services enable WorldCat updates in WorldShare Collection Manager to ensure they receive the updated OCN for held records that were merged. If you suspect an incorrect merge, report it to [email protected]. WorldCat Metadata Quality staff can view the history of merged records and recover them if needed.
Cleaning up duplicate records is one of the most impactful ways to improve the quality of WorldCat. WorldCat’s scale presents challenges, with data from various sources, cataloging practices, and languages. Amplifying manual efforts by metadata professionals with the latest AI technology have led to significant success in reducing the number of duplicates. This approach reinforces our commitment to quality, so AI can help libraries deliver accurate, streamlined experiences for users.
The announcement is here.