simulate_ehr_data.R
simulate_ehr_data.R generates synthetic EHR BMI measurements and demographic records that can be used to practice the 8-step GenAI coding workflow.
Purpose
The script creates tab-delimited synthetic data with realistic issues such as missing values, implausible measurements, and separate data dictionaries. This gives readers a safe example dataset for prompt engineering, code review, data cleaning, and documentation exercises.
Example Usage
Rscript scripts/simulate_ehr_data.R \
--output_ehr "./data/raw/ehr_bmi_simulated_data.tsv" \
--output_ehr_dict "./data/raw/data_dictionary.txt" \
--output_demo "./data/raw/demographics_simulated_data.tsv" \
--output_demo_dict "./data/raw/demographics_data_dictionary.txt" \
--seed 123 \
--n_individuals 1000
Outputs
- Synthetic EHR BMI data in TSV format.
- An EHR data dictionary.
- Synthetic demographic data in TSV format.
- A demographic data dictionary.
erDiagram
accTitle: Synthetic EHR output relationships
accDescr: An entity relationship diagram showing that each demographic record can have many synthetic EHR measurement records linked by person_id.
DEMOGRAPHICS ||--o{ EHR_MEASUREMENT : has
DEMOGRAPHICS {
string person_id PK
date date_of_birth
int age
string age_bin
string race_ethnicity_harmonized
string sex_gender
string zip3
}
EHR_MEASUREMENT {
string encounter_id PK
string person_id FK
float bmi
float height_cm
float weight_kg
datetime measurement_date
}