Longer online version: print version here

Google ’fessed up to mass copyright breach

...then it apparently unconfesses

INTERNET giant Google has confessed to keeping all the text it can find - for the purposes of training machine learning systems, so-called "artificial intelligence". Gizmodo notes that over the weekend of 2-3 July Google changed its privacy policy to announce that "we use publicly available information to help train Google's AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities."

Previous iterations referred to use for "language models". We had long assumed that it was pouring everything it could find into these - Google Translate and Bard, Google's rival to ChatGPT. All these are "large language models" - and are labelled "artificial intelligence" for marketing purposes.

And we assume that this practice is unlawful - a breach of copyright. Certainly, research has shown that it is possible to ask such as system to spit out text on which it was trained. Indeed, in late June it was found that the instance of ChatGPT attached to Microsoft's Bing search engine could easily be instructed to reveal the text of an article behind a paywall - though it's far from clear whether the "AI" component is involved in this, rather than the web archive kept by the plain old search engine. Microsoft withdrew the feature after less than a week.

The rub is that we cannot afford to sue Google to find out. It is certainly contrary to online licenses such as that under which the Freelance makes words available online.

One clear implication of Google's change is that the privacy policy now clearly states that Google will ingest images and music as well as text. And such a policy is a strange place to make such an announcement.

Oddly, as we wrote the current version of the Google privacy policy, accessed from the UK, does not include that section on "publicly accessible sources". Has Google unconfessed?