Large Language Models Are Blurry JPEGs of the Web

Ted Chiang, the New Yorker:

[…] Think of ChatGPT as a blurry JPEG of all the text on the Web. It retains much of the information on the Web, in the same way that a JPEG retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable. You’re still looking at a blurry JPEG, but the blurriness occurs in a way that doesn’t make the picture as a whole look less sharp.

This analogy to lossy compression is not just a way to understand ChatGPT’s facility at repackaging information found on the Web by using different words. It’s also a way to understand the “hallucinations,” or nonsensical answers to factual questions, to which large language models such as ChatGPT are all too prone. […]

I shamelessly cribbed the Xerox example in the last post from this excellent article. The similarities between many of these machine learning models are apparent: computational photography uses trained guesswork to reconstruct detail in blurry images, just as GPT and similar models use a vast library of language which can be used to create a best guess at seemingly precise phrases.