Skip to main content

The Evolving LLM: Multimodality, Agents, and the Future of AI

Introduction:

We've explored the foundations of LLM training. Now, let's look ahead at the exciting developments on the horizon, including multimodality, AI agents, and the importance of staying informed in this rapidly evolving field.

1**. Multimodality: Beyond Text**
  • The Concept:
    • LLMs are evolving beyond text to incorporate audio and images.
    • This means LLMs will be able to "hear," "speak," and "see," enabling more natural and intuitive interactions.
  • How it Works:
    • Audio and images are tokenized, just like text.
    • Audio spectrogram slices and image patches are converted into sequences of tokens.
    • These tokens are integrated into the LLM's context window, allowing it to process and generate multimodal content.
  • The Impact:
    • Multimodality will enable LLMs to understand and respond to a wider range of inputs.
    • This will lead to more versatile and powerful AI applications.
2. AI Agents: Long-Running Tasks
  • The Challenge:
    • Current LLMs excel at individual tasks but struggle with long-running, complex jobs.
    • They lack the ability to coherently string together multiple tasks and perform error correction over extended periods.
  • The Solution:
    • AI agents are being developed to perform tasks over time, with human supervision.
    • These agents will be able to plan, execute, and report on progress for complex projects.
  • The Analogy:
    • Just as factories have human-to-robot ratios, we'll see human-to-agent ratios in the digital domain.
    • Humans will act as supervisors, guiding and overseeing the work of AI agents.
  • The Importance:
    • AI agents will enable LLMs to tackle more ambitious and complex tasks.
3. Action Taking and Pervasive Integration
  • Action Taking:
    • LLMs are gaining the ability to take actions on behalf of users, such as performing keyboard and mouse actions.
    • This will enable LLMs to automate tasks and interact with digital environments more directly.
  • Pervasive Integration:
    • LLMs are becoming increasingly integrated into everyday tools and applications.
    • They're becoming more "invisible," seamlessly enhancing our digital experiences.
4. The Need for Continued Research: Test-Time Training
  • The Current Paradigm:
    • LLMs are trained offline and then deployed for inference with fixed parameters.
    • They learn through in-context learning, but they don't update their parameters based on test-time experiences.
  • The Human Analogy:
    • Humans learn and adapt throughout their lives, including during sleep.
    • LLMs lack this ability to continuously update their knowledge and skills.
  • The Challenge:
    • The context window is a finite resource, especially for long-running multimodal tasks.
    • New approaches are needed to enable LLMs to learn and adapt over time.
  • The Opportunity:
    • Test-time training could enable LLMs to become more adaptable and efficient.
5. Staying Updated: Key Resources
  • EL Arena:
    • An LLM leaderboard that ranks models based on human comparisons.
    • Provides insights into the relative performance of different LLMs.