Top Papers of the week(Mar 11- Mar 17)
2. Scaling Instructable Agents Across Many Simulated Worlds ( webpage | paper )
DeepMind present new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings
3.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context ( paper )
In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.
4.Multistep Consistency Models ( paper )
In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a ∞-step consistency model is a diffusion model.
5.StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control( code | paper )
StreamMultiDiffusion is a real-time interactive multiple-text-to-image generation from user-assigned regional text prompts.
6.Video Editing via Factorized Diffusion Distillation ( paper )
We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model.
7.Stealing Part of a Production Language Model ( paper )
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access.
8.MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training( paper )
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons.
9.Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking ( paper )
When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text.
10.RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation ( webpage | paper )
We explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models' reasoning and generation ability in long-horizon generation tasks, while hugely mitigating hallucination. In particular, the proposed method -- *retrieval-augmented thoughts* (RAT) -- revises each thought step one by one with retrieved information relevant to the task query, the current and the past thought steps, after the initial zero-shot CoT is generated.
AIGC News of the week(Mar 11- Mar 17)
1.Open-Sora: Democratizing Efficient Video Production for All ( repo )
2.OpenAI transformer-debugger ( link )
3.What I learned from looking at 900 most popular open source AI tools ( link )
4.Intro to LLM Agents with Langchain: When RAG is Not Enough ( link )
5.AI safety is not a model property( link)
more AIGC News: AINews