Chapter 6: The Iterative Journey — Iterate Until Satisfied
Building on the version control practices described in Chapter 4 and Chapter 5, each iteration of testing and refining should be captured in your repository. Commit or branch your changes as you address issues, add new features, or revert problematic updates. This ensures that when you iterate until satisfied, you also preserve a complete record of every revision and can quickly return to any previous state of the code.
flowchart TB
accTitle: Iterative testing workflow
accDescr: A compact loop showing testing, assessment, focused refinement, and the exit point when the script meets success criteria.
A[Current script version] --> B[Test on sample data]
B --> C[Assess outputs and errors]
C --> D{Meets success criteria?}
D -->|No| E[Identify next focused change]
E --> F[Prompt or edit refinement]
F --> A
D -->|Almost| G[Fine-tune edge cases]
G --> A
D -->|Yes| H[Move to standardization]
The process of refining code with an LLM is rarely a one-time event. More often, it involves a cycle of review and refinement that continues until the generated script meets all the necessary requirements and performs as expected. This chapter focuses on the Iterate Until Satisfied step, emphasizing the importance of continuous testing and assessment throughout this journey [Self-Refine].
After each round of edits or feature additions in the previous step, it is crucial to test and assess the code [LLM code/test analysis]. This might involve actually running the script on a sample of your data to see if it produces the expected outputs and if any errors occur. Alternatively, for smaller changes or if running the full script is time-consuming, you might mentally walk through the modified code with some sample data to anticipate its behavior. The key is to verify whether the implemented changes work as intended and if any new, unintended issues have been introduced.
For LLM-assisted work, push this from informal testing toward a repeatable verification loop. At minimum, keep a small synthetic dataset and a smoke-test command that should run quickly. For higher-risk scripts, add testthat tests for boundary conditions, expected row counts, missing-value behavior, output schemas, and known edge cases. Ask the LLM or agent to run these commands and show the output, but do not treat “I tested it” as evidence unless the command and result are visible [GitHub Copilot agent best practices], [Claude Code best practices].
“I tested it” is not evidence. Require the command, the output, and enough context to know what was actually verified.
A useful BMI harmonization verification set might include:
- BMI values exactly on category boundaries, such as 18.5, 24.9, 25.0, 29.9, 30.0, 34.9, 35.0, 39.9, and 40.0.
- Implausible heights, weights, dates, and mixed units.
- Duplicate measurements with tied dates.
- Missing demographic values, multiracial values, and categories that should not receive lower BMI thresholds.
- Empty input files and files with unexpected columns.
Recent productivity evidence also argues for disciplined verification. A 2025 randomized study by METR found that AI slowed a set of experienced open-source developers in one realistic setting, while a 2026 update suggested the measured effect is changing and difficult to interpret as agentic tools become more common [METR 2025 productivity study], [METR 2026 update]. The practical lesson for this guide is not “avoid AI”; it is that perceived speed is unreliable unless you define the task, tests, quality bar, and stopping rule clearly.
Based on the results of your testing, you will likely need to repeat the review and refine cycle. This means going back to Chapter 4 to review the current state of the code and then to Chapter 5 to make further refinements based on any issues or missing features you identified during testing. This iterative process of reviewing and refining with the LLM is a normal and expected part of working with these tools. Think of it as a back-and-forth collaboration where you provide feedback, and the AI assistant makes adjustments accordingly.
As you get closer to a satisfactory solution, you will often find yourself fine-tuning smaller details [Google code review]. This could involve addressing edge cases that you hadn’t initially considered, making small performance tweaks to improve the efficiency of the code, or enhancing the formatting of the output to make it more presentable. For example, in our BMI harmonization project, you might realize that you need to handle cases where two patient encounters have the same measurement date and decide on a rule for selecting the representative BMI in such scenarios. Or, you might identify a slow loop in the code and prompt the LLM to suggest a more efficient, vectorized approach.
Periodic red teaming during the iterative process can also be valuable. As your code evolves and becomes more complex, applying the red teaming techniques from Chapter 4 at key iterations can help identify new edge cases or methodological concerns that emerged during development. Consider red teaming your code after major feature additions or when preparing for final validation.
Re-run targeted checks after each meaningful change, then broaden validation before moving on. Do not wait until the end to discover that an early iteration broke the analysis.
Throughout this iterative process, it is important to know when to stop. Once you are satisfied that the code is correct, meaning it runs without errors and produces the expected outputs, and that it is reasonably efficient and readable, you can conclude the code generation phase. At this point, you can move on to the subsequent steps of refactoring and documenting your final solution. Recognizing when the code is “good enough” is a key skill that helps to prevent over-engineering and ensures that you are using your time effectively.
In essence, this iterative journey of testing, reviewing, and refining is central to achieving a successful outcome when working with LLMs for coding tasks. It allows for continuous improvement and ensures that the final code not only addresses your research question but also meets the desired standards of quality and functionality. This process often mirrors the experience of collaborating with a human assistant, where ongoing feedback and adjustments are essential for reaching the best possible results.
Stop when the script meets the success criteria, passes the agreed checks, and is readable enough to maintain. More iteration is not automatically better.