simulate_ehr_data.R

simulate_ehr_data.R generates synthetic EHR BMI measurements and demographic records that can be used to practice the 8-step GenAI coding workflow.

Purpose

The script creates tab-delimited synthetic data with realistic issues such as missing values, implausible measurements, and separate data dictionaries. This gives readers a safe example dataset for prompt engineering, code review, data cleaning, and documentation exercises.

Example Usage

Rscript scripts/simulate_ehr_data.R \
  --output_ehr "./data/raw/ehr_bmi_simulated_data.tsv" \
  --output_ehr_dict "./data/raw/data_dictionary.txt" \
  --output_demo "./data/raw/demographics_simulated_data.tsv" \
  --output_demo_dict "./data/raw/demographics_data_dictionary.txt" \
  --seed 123 \
  --n_individuals 1000

Outputs

  • Synthetic EHR BMI data in TSV format.
  • An EHR data dictionary.
  • Synthetic demographic data in TSV format.
  • A demographic data dictionary.
erDiagram
    accTitle: Synthetic EHR output relationships
    accDescr: An entity relationship diagram showing that each demographic record can have many synthetic EHR measurement records linked by person_id.
    DEMOGRAPHICS ||--o{ EHR_MEASUREMENT : has
    DEMOGRAPHICS {
        string person_id PK
        date date_of_birth
        int age
        string age_bin
        string race_ethnicity_harmonized
        string sex_gender
        string zip3
    }
    EHR_MEASUREMENT {
        string encounter_id PK
        string person_id FK
        float bmi
        float height_cm
        float weight_kg
        datetime measurement_date
    }