Link Search Menu Expand Document

From Testing to Tuning

Table of contents
  1. From Testing to Tuning
    1. Tuning Ideas for Further Exploration
      1. Reinforcement Finetuning
    2. The Impact on Architecture and Design
      1. Tuning Tools
        1. InstructLab
        2. Open Instruct
    3. Experiments to Try
    4. What’s Next?
    5. Appendix: How Science Changes Its Mind…

Finally, could it be we are thinking about this all wrong? It is normal to attempt to bend our current Paradigm (see also the Appendix) to meet a new reality, rather than rethink the situation from first principles. We are still early in the generative AI “revolution”. We don’t really know what radically-different approaches will emerge for any aspect of our use of AI, including how to perform sufficiently-reliable testing.

With that in mind, are there more AI-native alternatives to our conventional ideas about testing, ideas that work better for the Stochastic generative AI Behaviors of the AI components? This chapter speculates on one possibility.

Highlights:

  1. It is still early in the generative AI “revolution”.
  2. We tend to apply our traditional approaches to new problems. Often this works well.
  3. However, we should expect completely new AI-driven approaches to problem solving to emerge, especially for new AI-driven challenges.
  4. One possible new approach is to shift attention from the traditional cycle of evolving code and tests together, where we use the tests to ensure compliance, to a more “active” process of continuous Tuning of models to meet evolving requirements.

Our standard approach to software development involves writing software and then testing that it works1. Since models are Tunable, what if instead our development cycle includes routine, incremental model tuning steps that run until satisfactory behavior is achieved? In other words, what if we go from verifying desired behavior after the fact to coercing the desired behavior as part of the “building” process?

The verification role is still required for measuring when tuning is needed and how well it worked, so we will still need to write tests, i.e., Unit Benchmarks of some kind.

Of course, tuning is already used by model builders to improve their models’ performance in various categories, such as safety, question and answering, etc. Domain-specific models are also tuned from popular “foundation” models to provide more effective behavior in the use cases for the domain. Tuning is still considered a specialized skill and not widely used, but we anticipate that tuning technology will become easier to use and more efficient, with the result that more organizations will tune their own models for their specific domains and use cases.

What’s still missing is an active integration of tuning into iterative and incremental development processes, which will be necessary to do incremental tuning for each new use case or feature implemented.

This kind of fine-grained tuning of models is still a research and development topic, in part because each incremental improvement needs to be automatically evaluated to detect regressions in behavior, as well as improved performance in the area where tuning is focused. This continuous verification is exactly how tests are used for traditional software in organizations with mature testing practices; it is integral to DevOps, specifically. Our hope is that AI benchmarking and testing practices will evolve similarly so that rapid, targeted, and automatic execution of these tools can similarly be performed when doing incremental tuning.

Tuning Ideas for Further Exploration

Here are some ideas we are investigating.

Reinforcement Finetuning

For some inspiration, consider slide 25 of this NeurIPS 2024 presentation by Nathan Lambert, where he discusses a recent evolution of Reinforcement Learning, called reinforcement finetuning:

What is reinforcement finetuning?

Reinforcement finetuning uses repeated passes over the data with reinforcement learning (RL) to encourage the model to figure out more robust behaviors in domains.

Requires:

  1. Training data with explicitly correct answers.
  2. A grader (or extraction program) for verifying outputs.
  3. A model that can sometimes generate a correct solution. Otherwise, no signal for RL to learn from.

Key innovation:

Improving targeted skills reliably without degradation on other tasks.

Nathan also discusses this work in this Interconnects post. It is based on this OpenAI paper, which is entirely focused on conventional model tuning, but if you consider the bullets quoted here, reinforcement finetuning also fits nicely with our goals of finding general ways to assure desired behavior.

For example, a grader is used to verify outputs, analogous to LLM as a Judge. Hence, it is worth exploring what suite of graders would be useful for many AI-centric Use Cases? John Allard from OpenAI describes them in this X post. Graders may be useful for testing, as well as tuning.

Subsequent slides go into the tuning data format, how answers are analyzed for correctness, etc.

TODO: More investigation and summarization here, especially the concept of graders. Provide an example??

The Impact on Architecture and Design

In Architecture and Design, we discussed techniques with testing needs in mind. Making tuning an integral part of the software development process could impact the architecture and design, too, as we explore in this section.

TODO: How tuning becomes a part of the development lifecycle, how testing processes might change.

For example, there are several tools designed to organize tuning data and processes.

Tuning Tools

InstructLab

InstructLab is project started by IBM Research and developed by RedHat. InstructLab provides conventions for organizing specific, manually-created examples into a domain hierarchy, along with tools to perform model tuning, including synthetic data generation. Hence, InstructLab is an alternative way to generate synthetic data for Unit Benchmarks.

Open Instruct

Open Instruct from the Allen Institute of AI tries to meet similar goals as InstructLab. It is mentioned by Nathan Lambert in the Reinforcement Finetuning content discussed above.

Experiments to Try

TODO: We will provide some examples to try along with suggestions for further experimentation.

What’s Next?

Review the highlights summarized above and optionally the Appendix below, then review the Glossary terms see the References for more information.

Appendix: How Science Changes Its Mind…

The idea of a complete reset is an established idea. The Structure of Scientific Revolutions, published in 1962, studied how scientists approach new evidence that appears to contradict an established theory. They don’t immediately discard the established theory. Instead, they first attempt to accommodate the new evidence into the existing theory, making modifications as necessary.

Eventually, if the contradictions become too glaring and the modifications become too strained, some researchers will abandon the established theory and allow the evidence to lead them to a fundamentally new theory. Two examples from the history of Physics are the transition from Newtonian (“Classical”) Mechanics to Quantum Mechanics and the emergence of the Special and General Theories of Relativity, all of which emerged in the early decades of the twentieth century. In Astronomy, it took several millennia for astronomers to discard the geocentric view of the solar system, where the Earth was believed to be at the center and everything else revolves around it. Astronomers developed elaborate theories about orbital mechanics involving epicycles, nesting of circular orbits, that were needed to explain the observed retrograde motion of planetary orbits. An important breakthrough for considering a heliocentric solar system, where the Sun is at the center, was the way this model greatly simplified orbital mechanics, removing the need for epicycles.


  1. The tests are written before the code, in part to drive thinking about the design, when doing Test-Driven Development