Google’s Tricky Gemini Demo

Surely by now you have seen Google’s Gemini demo. The company opens the video with this description:

We’ve been testing the capabilities of Gemini, our new multimodal Al model.

We’ve been capturing footage to test it on a wide range of challenges, showing it a series of images, and asking it to reason about what it sees.

What follows is a series of split-screen demos with a video on the left, Gemini’s seemingly live interpretation on the right, and a voiceover conversation between — I assume — a Google employee and a robotic voice reading the Gemini interpretation.

Google acknowledges in the video description that “latency has been reduced and Gemini outputs have been shortened for brevity”. Other than that, you might expect the video to show a real experience albeit sped up; that is how I interpreted it.

Parmy Olson, Bloomberg:

In reality, the demo also wasn’t carried out in real time or in voice. When asked about the video by Bloomberg Opinion, a Google spokesperson said it was made by “using still image frames from the footage, and prompting via text,” and they pointed to a site showing how others could interact with Gemini with photos of their hands, or of drawings or other objects. In other words, the voice in the demo was reading out human-made prompts they’d made to Gemini, and showing them still images. That’s quite different from what Google seemed to be suggesting: that a person could have a smooth voice conversation with Gemini as it watched and responded in real time to the world around it.

If you read the disclaimer at the beginning of the demo in its most literal sense, Google did not lie, but that does not mean it was fully honest. I do not get the need for trickery. The real story would have undoubtably come to light, if not from an unnamed Google spokesperson then perhaps someone internally feeling a guilty pang, and it undermines how impressive this demo is. And it is remarkable — so why not make the true version part of the story? I do not think I would have found it any less amazing if I had seen a real-time demonstration of the still frame of the video being processed by Gemini with its actual output, and then I saw this simplified version.

Instead, I feel cheated.