Join Our Work Group Visit Our GitHub Repo

From Testing to Tuning

Finally, could it be we are thinking about this all wrong? It is normal to attempt to bend your current Paradigm to meet a new reality, rather than rethink the situation from the fundamentals. With that in mind, should we abandon the idea of testing, at least for the unavoidable, nondeterministic model behaviors, in favor of something entirely new?

Our standard approach to software development involves writing software and then testing that it works¹. Since models are Tunable, what if instead our development cycle includes routine, incremental model tuning steps that run until satisfactory behavior is achieved? In other words, what if we go from verifying desired behavior after the fact to coercing the desired behavior as part of the &ldqou;building&rdqou; process? We will probably need some combination of verification and coercion.

How would this work? What is needed that we don’t already have?

In a sense, this may not look all that different than the Test-Driven Development cycle, where unit benchmarks are written first for the desired behavior, but instead of writing code and running conventional tests, a tuning cycle is started that tunes the model until the benchmarks pass. As in normal TDD practice, all existing unit benchmarks would be executed regularly to catch behavior regressions.

Ideas to Explore

Here are some ideas we are investigating.

Reinforcement Finetuning

For some inspiration, consider slide 25 of this NeurIPS 2024 presentation by Nathan Lambert, where he discusses a recent evolution of reinforcement learning, called reinforcement finetuning:

What is reinforcement finetuning?

Reinforcement finetuning uses repeated passes over the data with reinforcement learning (RL) to encourage the model to figure out more robust behaviors in domains.

Requires:

Training data with explicitly correct answers.

A grader (or extraction program) for verifying outputs.

A model that can sometimes generate a correct solution. Otherwise, no signal for RL to learn from.

Key innovation:

Improving targeted skills reliably without degradation on other tasks.

Nathan also discusses this work in this Interconnects post. It is based on this OpenAI paper, which is entirely focused on conventional model tuning, but I think if you consider the bullets above, it also fits nicely with our goals of finding general ways to assure desired behavior.

In particular, note that a grader is used to verify outputs, a key component of any test framework! Hence, it is worth exploring what suite of graders would be useful for many AI-centric use cases? John Allard from OpenAI describes them in this X post. Graders may be useful for testing, as well as tuning.

Subsequent slides go into the tuning data format, how answers are analyzed for correctness, etc.

Some Next Steps

Explore graders.
We need very fine-grained tuning techniques for use-case specific tuning. See Unit Benchmarks. Two technologies to investigate for organizing these tuning runs are the following:
- InstructLab (from RedHat).
- Open Instruct (from the Allen Institute of AI; mentioned by Nathan above)
Projects like OpenDXA with DANA are working to establish better control over model behaviors in part by automatically learning to be more effective.
We still need regression “testing”, so whatever we construct for fine-grained tuning should be reusable in automated, repeatable test runs.
…

Appendix: How Science Changes Its Mind…

The idea of a complete reset is an established idea. The Structure of Scientific Revolutions, published in 1962, studied how scientists approach new evidence that appears to contradict an established theory. They don’t immediately discard the established theory. Instead, they first attempt to accommodate the new evidence into the existing theory, making modifications as necessary.

Eventually, if the contradictions become too glaring and the modifications become too strained, some researchers will abandon the established theory and allow the evidence to lead them to a fundamentally new theory. Two examples from the history of Physics are the transition from Newtonian (“Classical”) Mechanics to Quantum Mechanics and the emergence of the Special and General Theories of Relativity, all of which emerged in the early decades of the twentieth century. In Astronomy, it took several millennia for astronomers to discard the geocentric view of the solar system, where the Earth was believed to be at the center and everything else revolves around it. Astronomers developed elaborate theories about orbital mechanics involving epicycles, nesting of circular orbits, that were needed to explain the observed retrograde motion of planetary orbits. An important breakthrough for considering a heliocentric solar system, where the Sun is at the center, was the way this model greatly simplified orbital mechanics, removing the need for epicycles.

The tests are written before the code when doing Test-Driven Development. ↩