(11 Nov 2024) OpenAI, Google, Meta and Anthropic all rely deeply on content from premium publishers to train the large language models, or LLMs, at the heart of their AI efforts, even as these companies have regularly underplayed their use of such copyrighted content, according to new research released this week from online publishing giant Ziff Davis.
Ziff Davis owns CNET, as well a host of other brands, including IGN, PCMag, Mashable and Everyday Health.
A paper detailing the research and authored by Ziff Davis’ George Wukoson, lead attorney on AI, and Chief Technology Officer Joey Fortuna, reports that AI companies intentionally filtered out low-quality content in favor of high-quality, human-made content to train their models. Given that AI companies want their models to perform well, it makes sense they’d favor quality content in their training data. AI companies used websites’ domain authority, or essentially their ranking in Google search, to make those distinctions. Generally, sources that filter higher on Google tend to be of higher quality and trustworthiness.
Read more on CNET.