References & Citations
Computer Science > Computation and Language
Title: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
(Submitted on 8 Mar 2024 (v1), last revised 25 Apr 2024 (this version, v2))
Abstract: In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
Submission history
From: Rhys May [view email][v1] Fri, 8 Mar 2024 18:54:20 GMT (7059kb)
[v2] Thu, 25 Apr 2024 16:34:26 GMT (21758kb,D)
Link back to: arXiv, form interface, contact.