Join Our Work Group Visit Our GitHub Repo

From Testing to Training

Finally, could it be we are thinking about this all wrong? It is normal to attempt to bend your current Paradigm to meet a new reality, rather than rethink the situation from the fundamentals. Should we abandon the idea of deterministic testing, at least for nondeterministic model behaviors, in favor of something entirely new?

This idea of a complete reset is an established idea. The Structure of Scientific Revolutions, published in 1962, studied how scientists approach new evidence that appears to contradict an established theory. They don’t immediately discard the established theory. Instead, they first attempt to accommodate the new evidence into the existing theory, making small modifications, as necessary.

However, eventually, the willingness of some researchers to consider abandoning the orthodoxy and the weight of the evidence lead to the emergence of a fundamentally new theory to explain the data. Examples from Physics include the transition from Newtonian Mechanics to Quantum Mechanics and the Special and General Theories of Relativity, all of which emerged in the early decades of the twentieth century. In Astronomy, it took several millennia for astronomers to discard the geocentric view of the solar system, where the Earth was believed to be at the center and everything else revolved around it. Astronomers developed elaborate theories about orbital mechanics involving epicycles, nesting of circular orbits, which were needed to explain the observed retrograde motion of planetary orbits. An important breakthrough for considering a heliocentric solar system, where the Sun is at the center, was the way this model greatly simplified orbital mechanics, removing the need for epicycles.

Back to generative AI, what if we relax the usual approach of writing software and then testing that it works? Since models are tunable, what if instead our development cycle includes routine, incremental model tuning steps that run until satisfactory behavior is achieved? In other words, what if we go from verifying desired behavior to coercing desired behavior? How would this work and what’s needed that we don’t already have? Should we actually strive for some combination of verification and coercion?

In practical terms, this may not look all that different than a typical TDD cycle, where unit benchmarks are written first for the desired behavior, but instead of writing code and running conventional tests, a tuning cycle is started that tunes the model until the benchmarks pass. As in normal TDD practice, the existing unit benchmarks would be executed to catch regressions in behavior.

For More Information

For some inspiration, consider slide 25 of this NeurIPS 2024 presentation by Nathan Lambert, where he discusses a recent evolution of reinforcement learning, called reinforcement finetuning:

What is reinforcement finetuning?

Uses repeated passes over the data with RL to encourage model to figure out more robust behaviors in domains.

Requires:

Training data with explicitly correct answers.

A grader (or extraction program) for verifying outputs.

A model that can sometimes generate a correct solution. Otherwise, no signal for RL to learn from.

Key innovation:

Improving targeted skills reliably without degradation on other tasks.

Nathan also discusses this work in this Interconnects post.

Nathan is talking about this OpenAI paper, which is entirely focused on model tuning, but I think if you consider the bullets above, it also fits nicely with our goals of finding general ways to assure desired behavior.

In particular, note that a grader is used to verify outputs, a key component of any test framework! Hence, it is worth exploring what suite of graders would be useful for many AI-centric use cases? John Allard from OpenAI describes them in X this post. Graders may be useful for testing, as well as tuning.

Subsequent slides go into the tuning data format, how answers are analyzed for correctness, etc.

Next steps

(TODO: - this needs refinement)

Explore graders.
We need very fine-grained tuning techniques for use-case specific tuning. See Unit Benchmarks. Another technology to investigate for organizing these tuning runs is InstructLab. Open Instruct From Allen Institute of AI has similar potential (and it is mentioned by Nathan above.)
We still need regression “testing”, so whatever we construct for fine-grained tuning should be reusable in some way for repeated test runs.
…