Google's generative AI can now analyze hours of video
Gemini, Google’s family of generative AI models, can now analyze longer documents, codebases, videos and audio recordings than before.
During a keynote at the Google I/O 2024 developer conference Tuesday, Google announced the private preview of a new version of Gemini 1.5 Pro, the company’s current flagship model, that can take in up to 2 million tokens. That’s double the previous maximum amount.
At 2 million tokens, the new version of Gemini 1.5 Pro supports the largest input of any commercially available model. The next-largest, Anthropic’s Claude 3, tops out at 1 million tokens.
In the AI field, “tokens” refer to subdivided bits of raw data, like the syllables “fan,” “tas” and “tic” in the word “fantastic.” Two million tokens is equivalent to around 1.4 million words, two hours of video or 22 hours of audio.
Beyond being able to analyze large files, models that can take in more tokens can sometimes achieve improved performance.
Unlike models with small maximum token inputs (otherwise known as context), models such as the 2-million-token-input Gemini 1.5 Pro won’t easily “forget” the content of very recent conversations and veer off topic. Large-context models can also better grasp the flow of data they take in — hypothetically, at least — and generate contextually richer responses.
Developers interested in trying Gemini 1.5 Pro with a 2-million-token context can add their names to the waitlist in Google AI Studio, Google’s generative AI dev tool. (Gemini 1.5 Pro with 1-million-token context launches in general availability across Google’s developer services and surfaces in the next month.)
Beyond the larger context window, Google says that Gemini 1.5 Pro has been “enhanced” over the last few months through algorithmic improvements. It’s better at code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding, Google says. And in the Gemini API and AI Studio, 1.5 Pro can now reason across audio in addition to images and video — and be “steered” through a capability called system instructions.
Gemini 1.5 Flash, a faster model
For less demanding applications, Google’s launching in public preview Gemini 1.5 Flash, a “distilled” version of Gemini 1.5 Pro that’s small and efficient model built for “narrow,” “high-frequency” generative AI workloads. Flash — which has up to a 2-million-token context window — is multimodal like Gemini 1.5 Pro, meaning it can analyze audio, video and images as well as text (but it generates only text).
“Gemini Pro is for much more general or complex, often multi-step reasoning tasks,” Josh Woodward, VP of Google Labs, one of Google’s experimental AI divisions, said during a briefing with reporters. “[But] as a developer, you really want to use [Flash] if you care a lot about the speed of the model output.”
Woodward added that Flash is particularly well-suited for tasks such as summarization, chat apps, image and video captioning and data extraction from long documents and tables.
Flash appears to be Google’s answer to small, low-cost models served via APIs like Anthropic’s Claude 3 Haiku. It, along with Gemini 1.5 Pro, is very widely available, now in over 200 countries and territories including the European Economic Area, U.K. and Switzerland. (The 2-million-token context version is gated behind a waitlist, however.)
In another update aimed at cost-conscious devs, all Gemini models, not just Flash, will soon be able to take advantage of a feature called context caching. This lets devs store large amounts of information (say, a knowledge base or database of research papers) in a cache that Gemini models can quickly and relatively cheaply (from a per-usage standpoint) access.
The complimentary Batch API, available in public preview today in Vertex AI, Google’s enterprise-focused generative AI development platform, offers a more cost-effective way to handle workloads such as classification and sentiment analysis, data extraction and description generation, allowing multiple prompts to be sent to Gemini models in a single request.
Another new feature arriving later in the month in preview in Vertex, controlled generation, could lead to further cost savings, Woodward suggests, by allowing users to define Gemini model outputs according to specific formats or schemas (e.g. JSON or XML).
“You’ll be able to send all of your files to the model once and not have to resend them over and over again,” Woodward said. “This should make the long context [in particular] way more useful — and also more affordable.”