Chapter 3: The AI Assistant — Prompt LLM to Generate Code

Having meticulously planned our approach and fortified our understanding through thorough research, we now arrive at the stage where we engage our AI assistant to generate the code we need. This step, Prompt LLM to Generate Code, is pivotal, as the quality and relevance of the LLM’s output are heavily dependent on the clarity and precision of our instructions.

The cornerstone of this step is crafting clear and explicit prompts [GitHub Copilot best practices]. This involves translating our well-defined plan from Chapter 1 and the insights gained from our research in Chapter 2 into a set of instructions that the LLM can readily understand and follow.

If you created a detailed plan, such as the Advanced Project Plan, share only a sanitized version with the LLM. Include schema, method choices, validation rules, and output requirements; remove individual-level records, PHI, PII, controlled-access data, private paths, credentials, and protected system details.

flowchart TB
    accTitle: Prompt construction workflow
    accDescr: A vertical convergence diagram showing multiple sources of context forming a prompt package that produces a draft script for review.
    A[Project plan] --> F[Prompt package]
    B[Sanitized schema and file formats] --> F
    C[Method decisions and thresholds] --> F
    D[Output requirements] --> F
    E[Coding standards] --> F
    F --> G[LLM-generated draft script]
    G --> H[Human review]

We must explicitly explain the task we want the LLM to perform, providing ample context so that it grasps the overall objective. For our BMI harmonization example, this would mean clearly stating that we need code to process EHR data to calculate and categorize BMI. It is also crucial to specify the programming language we intend to use—such as R or Python—so that the LLM generates code in the correct syntax. Furthermore, we need to detail each required operation step-by-step, essentially mirroring the sub-tasks we outlined in our initial plan. If our plan included reading the data, filtering outliers, selecting one BMI per person, and then categorizing those values, our prompt should reflect this sequence of actions in a clear and logical order.

In addition to the high-level instructions, it is essential to include schema-level data details in our prompts [Chauhan]. This involves providing the LLM with information about the format of our data (e.g., tab-delimited file, CSV file, or database table) and listing the column names and expected types that the generated code must handle. For our EHR data, we might specify that the approved local input is tab-delimited and list columns such as person_id, encounter_id, bmi, height_cm, weight_kg, and measurement_date. Do not provide real rows, individual-level values, private file paths, small-cell outputs, or protected identifiers. Use synthetic fixtures when examples are needed.

To ensure that the LLM’s output is directly usable and aligns with our needs, we must also be clear about requesting specific outputs [PharmaSUG prompt engineering]. This means stating the desired format of the code we want the LLM to produce. For example, we might ask for “an R script” or “a Python function.” Moreover, if we have any requirements for clarity and reproducibility—such as the inclusion of comments within the code—we should explicitly state these in our prompt.

Prompting Repo-Native Coding Agents

Many current coding assistants are no longer only chat windows. They can read files, edit repositories, run commands, open pull requests, and use external tools. As of May 2026, common patterns include repository instruction files such as AGENTS.md, .github/copilot-instructions.md, and tool-specific files such as CLAUDE.md; the Linux Foundation has also highlighted AGENTS.md and MCP as part of the emerging agentic AI ecosystem [GitHub Copilot agent best practices], [Claude Code best practices], [Linux Foundation agentic AI announcement].

When you use a repo-native coding agent, turn your prompt into a work order rather than a general request:

Name the target files or directories.
State the exact behavior to add or preserve.
Point to the project plan, data schema, coding template, and existing examples.
Specify allowed tools and constraints, such as “do not access external services” or “do not modify files outside scripts/.”
Run the agent in a code-only workspace with no mounted protected data directories, secrets, private paths, or sensitive outputs.
Provide verification commands that use synthetic fixtures, such as Rscript scripts/simulate_ehr_data.R --outdir tmp/example, Rscript path/to/script.R ... --input tmp/example/synthetic.tsv, R -e 'testthat::test_dir("tests/testthat")', or bundle exec jekyll build.
Ask the agent to summarize changed files, assumptions, and verification output.

For repo-native coding agents, write a work order: target files, behavior, constraints, and verification commands. Do not rely on a general chat-style request.

This format gives the agent enough context to work in the repository while preserving human control. It is especially important for research code because missing assumptions, implicit data rules, and unrun tests can turn a plausible script into a misleading analysis.

Requesting Structured Outputs

For tasks such as data dictionaries, extraction of phenotype definitions, literature-screening tables, or validation summaries, ask for structured output whenever possible. A JSON schema, Markdown table schema, or explicit column specification can make the LLM’s answer easier to parse and check. Structured-output APIs can constrain responses to a schema, and R packages such as ellmer support structured data extraction workflows [OpenAI structured outputs], [ellmer].

Structured output does not remove the need for scientific review. Google Gemini’s structured-output guidance explicitly cautions that syntactically valid JSON does not guarantee semantically correct values [Gemini structured output]. In practice, treat structured output as an interface contract: validate required fields, allowed values, dates, units, and cross-field logic before using the result in an analysis.

A valid JSON object can still contain scientifically wrong values. Validate structured outputs against domain rules before using them in an analysis.

For instance, we could instruct the LLM to:

Example Prompt

Return an R script that performs the following steps and includes comments explaining each section of the code:
Reads BMI data from a user-supplied local TSV path
Filters out implausible heights and weights
Selects a single representative BMI measurement per person
Categorizes BMI based on pre-defined thresholds
Outputs a cleaned dataset and a summary table
Includes a smoke test that uses only synthetic data.

To further enhance the effectiveness of our prompting, we can leverage prompt templates [Datasette prompt templates]. The provided file CodingPromptForRScript.md is an excellent example of a structured approach to guide the LLM. This template begins by assigning the LLM the role of an expert R programmer with experience in production-quality scripts using specific libraries like data.table and optparse. It then instructs the LLM to:

Review a detailed project description and analysis plan (e.g., the one in ProjectPlan_Advanced.md).
Ask clarifying questions as needed.
Only begin writing code once it is confident in its understanding.
Follow specific coding guidelines (e.g., using optparse for command-line arguments, data.table for data manipulation, including clear comments and function definitions, ensuring the script can be run from the command line, and outputting useful logs).

Practical flow:

Create a sanitized copy of your chosen project plan (Initial, Improved, or Advanced).

Remove individual-level records, PHI, PII, private paths, credentials, small-cell outputs, and controlled-access content.

Paste the sanitized plan together with the CodingPromptForRScript.md template.

Wait for the LLM to generate an initial script.

Review, refine, and iterate as described in Chapters 4–6.

Finally, to solidify our understanding of how to translate our planning and research into an actionable set of instructions, it is helpful to consider an example prompt for BMI harmonization. Such a prompt would incorporate the overarching goal of harmonizing BMI data, the specific context of EHR data with its columns and format, the sub-tasks identified in Chapter 1 (reading, cleaning, filtering, selecting, categorizing, outputting), and any relevant research insights from Chapter 2 (e.g., specific BMI category definitions or outlier thresholds).

Below is a short example showing how you might combine the plan into a prompt template:

Example Prompt Template

You are an expert R programmer with deep experience in writing production-quality scripts using `data.table` and `optparse`.

Here is my sanitized project plan. It contains schema and methods only, with no real records, PHI, PII, private paths, credentials, or controlled-access data:
```markdown
<paste sanitized contents of docs/templates/ProjectPlan_Advanced.md here>
```

Please review the plan, ask any clarifying questions, and then generate an R script that follows the instructions precisely. Use synthetic fixtures for examples and tests.

The LLM then evaluates the sanitized project plan and desired coding format. Once it understands them, it can produce an R script that will be reviewed, tested with synthetic fixtures, and later run against approved data only inside the proper environment.

In conclusion, effective prompting is a skill honed with practice [Alonso]. By adhering to the principles of clarity, specificity, and providing sufficient context—often by directly including your project plan and the coding prompt template in the conversation—you can harness the LLM’s code-generation capabilities to significantly accelerate your research and coding workflows. The more specific and clear your prompt, the better the code generation will align with your needs.

Specific prompts produce more reviewable code. Include the plan, data details, output requirements, constraints, and checks in the same request.