Top Papers of the week(Feb 12- Feb 18)
1.) OpenAI Sora:Video generation models as world simulators ( webpage | technical report )
Sora is an AI model that can create realistic and imaginative scenes from text instructions.
more: I have created a collection of Sora reference papers on huggingface. I hope it will be useful to you.
link: Sora Reference Papers
2.) MetaAI V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video( webpage | paper | code )
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
3.) Google: Our next-generation model: Gemini 1.5 ( webpage | paper )
Gemini 1.5 delivers dramatically enhanced performance. This includes making Gemini 1.5 more efficient to train and serve, with a new MoE architecture.
4.) World Model on Million-Length Video And Language With RingAttention ( paper )
This paper presents the Large World Model (LWM), an AI model that combines video and language data using RingAttention technology to train on sequences up to a million tokens long. LWM aims to understand human knowledge and the multimodal world, demonstrating impressive performance in long video understanding and text retrieval tasks, with an open-source release of a 7B parameter model.
5.) OS-Copilot: Towards Generalist Computer Agents with Self-Improvement ( paper )
The paper titled "OS-Copilot: Towards Generalist Computer Agents with Self-Improvement" introduces a framework for building computer agents capable of interacting with various elements of an operating system (OS), including the web, code terminals, files, multimedia, and third-party applications. The authors have created FRIDAY, a self-improving embodied agent that automates general computer tasks. FRIDAY demonstrates impressive performance on the GAIA benchmark for general AI assistants, outperforming previous methods by 35% and showcasing strong generalization abilities through accumulated skills from previous tasks. The paper also presents evidence of FRIDAY's ability to learn and improve control over Excel and PowerPoint with minimal supervision. The OS-Copilot framework and the empirical findings provide a foundation for future research into more capable and general-purpose computer agents.
6.) LLM Agents can Autonomously Hack Websites ( paper )
The paper "LLM Agents can Autonomously Hack Websites" reveals that advanced LLMs like GPT-4 can independently hack websites, performing complex attacks without prior knowledge of vulnerabilities. This capability raises concerns about the deployment of LLMs and their potential misuse in cybersecurity.
7.) Boximator:Generating Rich and Controllable Motions for Video Synthesis( webpage | paper )
Boximator is a novel video synthesis method that uses hard and soft boxes to control object motion. It integrates with existing video diffusion models, enhancing motion control and video quality without altering the base model's weights. The self-tracking technique simplifies training, and human evaluations prefer Boximator's results.
8.) Keyframer: Empowering Animation Design using Large Language Models( paper )
The paper introduces Keyframer, an AI-powered animation prototyping tool that leverages large language models (LLMs) to generate animations from static SVG images using natural language prompts. It supports iterative design through a combination of prompting and editing, allowing users to refine animations and request design variants. The study reveals that Keyframer empowers both novices and experts, enabling them to create animations with high-level design goals and maintain creative control throughout the process.
9.) DoRA: Weight-Decomposed Low-Rank Adaptation ( paper )
The paper introduces DoRA, a novel parameter-efficient fine-tuning method that decomposes pre-trained weights into magnitude and direction components, enhancing learning capacity without additional inference overhead. DoRA outperforms LoRA on various tasks, including commonsense reasoning, visual instruction tuning, and image/video-text understanding, while maintaining efficiency.
10.) LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing ( paper )
LAVE is a video editing tool that integrates large language models (LLMs) to assist users in editing tasks through natural language commands. It features a language-augmented video gallery, storyboarding, and clip trimming functions, enhancing the editing process by reducing barriers and preserving user agency. A user study demonstrated LAVE's effectiveness and its positive impact on creativity and co-creation.
AIGC News of the week(Feb 12- Feb 18)
1.) NVIDIA CEO Jensen Huang downplays Altman’s AI chip fundraising ( link )
2.) Memory and new controls for ChatGPT ( link )
3.) Andrej Karpathy departs OpenAI and Release open source minbpe ( X | minbpe )
4.)Thinking about High-Quality Human Data( link )
5.) The Groq LPU™ Inference Engine: Purpose-built for inference performance and precision, all in a simple, efficient design ( link )