New data shows AI companies love ‘premium publisher’ content

(11 Nov 2024) OpenAI, Google, Meta and Anthropic all rely deeply on content from premium publishers to train the large language models, or LLMs, at the heart of their AI efforts, even as these companies have regularly underplayed their use of such copyrighted content, according to new research released this week from online publishing giant Ziff Davis.

Ziff Davis owns CNET, as well a host of other brands, including IGN, PCMag, Mashable and Everyday Health.

A paper detailing the research and authored by Ziff Davis’ George Wukoson, lead attorney on AI, and Chief Technology Officer Joey Fortuna, reports that AI companies intentionally filtered out low-quality content in favor of high-quality, human-made content to train their models. Given that AI companies want their models to perform well, it makes sense they’d favor quality content in their training data. AI companies used websites’ domain authority, or essentially their ranking in Google search, to make those distinctions. Generally, sources that filter higher on Google tend to be of higher quality and trustworthiness.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

New data shows AI companies love ‘premium publisher’ content

Related Articles