(11 Mar 2026) DataSeer, in collaboration with The Michael J. Fox Foundation (MJFF), has developed a new large language model (LLM)-based system designed to detect and quantify dataset reuse across the scholarly literature at scale.
The system aims to address a longstanding challenge for research funders and institutions: measuring the downstream impact of shared research data.
By automating the detection of dataset reuse in published research, the platform is intended to remove a key bottleneck that has previously limited large-scale analysis of how data is reused.
Developed by DataSeer in collaboration with its Open Science Indicator partner PLOS and with input from the broader open science community, the LLM was piloted on a corpus of 6,000 MJFF-funded articles. Unlike traditional approaches that rely on formal data citations or digital object identifiers (DOIs), the model analyses the full text of research articles to identify reused datasets.
This allows the system to detect reuse even when datasets are referenced indirectly, such as through accession numbers, repository names, URLs, or narrative descriptions.
Find out more here.


