The Evolution of AI Research: Examining Sakana's LLM Agent

TAI #113; Sakana's AI Scientist — Are LLM Agents Ready To Assist AI Research

This week, xAI joined the growing crowd of broadly GPT-4 class models, which now includes models from OpenAI, Anthropic, Deepmind, xAI, Meta, Mistral, and DeepSeek. Anthropic also launched a context caching option saving up to 10x for reused input tokens costs. We recently flagged that context caching opens up many new opportunities, including for complex LLM agent pipelines, and on this note, this week, Sakana AI introduced “The AI Scientist,” an LLM agent for assisting machine learning research.

Sakana’s agent begins by brainstorming new ideas using an initial topic and codebase (provided by a human researcher) and performs a literature search to review its ideas for novelty. It then plans and executes code-based experiments and gathers and visualizes data before writing a full research paper. It also includes an automated LLM peer review process that evaluates these papers. We think Sakana’s agent includes a strong feedback loop that can drive continuous improvement. In particular, its “peer reviewer” agent can be used to filter and label good and bad examples of ideas, experiments, and papers, and the agent can learn from both in the future.

Challenges and Future Improvements

Currently, this agent has many shortcomings, and the papers it produces are not of great quality. Sakana measures the average cost of these papers at under $15 — given plausible looking papers can be created at such a low cost, it can even pose a risk to research integrity with journals, and peer reviewer inboxes flooded with difficult to identify low-quality AI content submissions from people using these agents irresponsibly. However, the results are still impressive, and I see many obvious next steps to improve the agent, e.g., multimodal capabilities, giving relevant papers to the model via long context, RAG, or fine-tuning and scaling up inference budget for parts of the pipeline.

I think Sakana’s implementation is impressive and ties into the power of “inference-time scaling laws” we discussed in recent weeks. Many people criticize the “scale is all you need” hypothesis of LLM’s march to AGI, but in reality, very few people believe in this on its own, and many different avenues are being pursued for progressing LLM capabilities. We can achieve new capabilities via agent pipelines or research breakthroughs without larger training budgets. In fact, one of the key benefits of the training compute vs capability scaling laws for LLMs is that even risking very small compute budgets on a small scale (and maybe LLM agent managed) experiments can potentially produce insights that can be scaled up 5+ orders of magnitude and integrated into SOTA models.

Human Amplifier and Research Support

Sakana’s agent does, however, touch on a sensitive subject; many people are resistant to the rush to handing over human work to AI and also very skeptical that we are remotely close to LLMs helping in actual scientific research. In this case, however, we still see Sakana’s agent as primarily a human amplifier to aid in incremental research, which will work best with an experienced AI scientist proposing interesting ideas and code bases that they think are a promising research direction. As with any GenAI tools — many people are likely to be lazy and use these agents irresponsibly, however, I can imagine many ways to use an AI scientist agent effectively and diligently.

Wait..this AI Agent does research for you 24hrs without ...

ML Research and LLM Research Agents

There are other things that make ML research particularly well suited to LLM research agent assistants: the high availability of open source code and papers, purely cloud-based experiments, and the agent’s ML engineers can understand both the agent and the papers it produces to judge quality. Sakana is a respected AI research lab, and it wouldn’t surprise me if other leading AI labs like OpenAI and DeepMind were working on similar technologies in-house. It remains to be seen if any of these agents can really be used to aid scientists in truly novel research.