Join Our Work Group GitHub Repo
From Testing to Training
Finally, maybe we are thinking about this all wrong. It’s normal to attempt to bend your current paradigm to meet a new reality, rather than rethink the situation from the fundamentals. Should we abandon the idea of developer testing in favor of something entirely new?
This idea of resetting completely is not a new idea. For example, The Structure of Scientific Revolutions, published in 1962, studied how scientists approach new evidence that appears to contradict established theories. They don’t immediately discard the established theories, but first attempt to accommodate the new evidence into the existing theories, with modifications as required. However, eventually, the willingness of some researchers to consider abandoning the orthodoxy and the weight of the evidence lead to fundamentally new theories about reality. Examples from Physics include the transition from Newtonian Mechanics to Quantum Mechanics and the Special and General Theories of Relativity, all of which emerged in the early decades of the twentieth century. In Astronomy, it took centuries for astronomers to discard the geocentric view of the solar system, where the Earth is at the center and everything else revolves around it. Astronomers developed elaborate theories about orbital mechanics involving epicycles, essentially smaller and smaller nested circles, which they needed to explain the observed retrograde motion of orbits. An important clue for considering a heliocentric solar system, where the Sun is at the center, was the greatly simplified orbital mechanics that resulted from this change.
Back to generative AI, various model-tuning techniques are established and necessary practices for ensuring that models perform as desired. So, what if we abandon the usual approach of writing software and testing that it works, and instead strive to continue tuning the model until satisfactory behavior is achieved? In other words, what if we go from verifying desired behavior to coercing desired behavior?
How would this work and what’s needed that we don’t already have? Should we actually strive for some combination of verification and coercion?
For some inspiration, consider slide 25 of this NeurIPS 2024 presentation by Nathan Lambert, where he discusses a recent evolution of reinforcement learning, called reinforcement finetuning:
What is reinforcement finetuning?
Uses repeated passes over the data with RL to encourage model to figure out more robust behaviors in domains.
Requires:
- Training data with explicitly correct answers.
- A grader (or extraction program) for verifying outputs.
- A model that can sometimes generate a correct solution. Otherwise, no signal for RL to learn from.
Key innovation:
Improving targeted skills reliably without degradation on other tasks.
Nathan also discusses this work in this Interconnects post.
Nathan is talking about this OpenAI paper, which is entirely focused on model tuning, but I think if you consider the bullets above, it also fits nicely with our goals of finding general ways to assure desired behavior.
In particular, note that a grader is used to verify outputs, a key component of any test framework! Hence, it is worth exploring what suite of graders would be useful for many AI-centric use cases? John Allard from OpenAI describes them in X this post. Graders may be useful for testing, as well as tuning.
Subsequent slides go into the tuning data format, how answers are analyzed for correctness, etc.
Next steps
(TODO: - this needs refinement)
- Explore graders.
- We need very fine-grained tuning techniques for use-case specific tuning. See Unit Benchmarks. Another technology to investigate for organizing these tuning runs is InstructLab. Open Instruct From Allen Institute of AI has similar potential (and it is mentioned by Nathan above.)
- We still need regression “testing”, so whatever we construct for fine-grained tuning should be reusable in some way for repeated test runs.
- …