Project Plan - Advanced

This example is written for synthetic or otherwise approved training data. Do not paste real EHR rows, individual-level records, PHI, PII, private paths, credentials, controlled-access data, or sensitive small-cell outputs into an LLM. For real projects, keep the code repository separate from protected data and run generated code only inside the approved environment.

Data Description The primary source file contains rows of BMI-related information (tab-delimited) with the following fields:

  • person_id: A unique identifier for each person
  • encounter_id: An identifier for each clinical encounter
  • bmi: The numerical BMI value
  • height_cm: Height in centimeters
  • weight_kg: Weight in kilograms
  • measurement_date: Date of the BMI measurement

Note: This raw EHR data was not originally collected for research purposes and may contain multiple rows per person.

In addition, a separate demographics file is available that provides key background information:

  • person_id: De-identified Patient Identifier
  • date_of_birth: Date of Birth (YYYY-MM-DD)
  • age: Age in years (with invalid values set to NA)
  • age_bin: Age category (e.g., <18, 18-34, etc.)
  • deceased: Indicator if the person is deceased
  • race: Race (with NA for ‘Patient Refused’, ‘Unknown’, or blank)
  • ethnicity: Ethnicity (with NA for ‘Patient Refused’, ‘Unknown’, or blank)
  • race_ethnicity: Combined Race and Ethnicity
  • race_ethnicity_harmonized: Harmonized classification (e.g., Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian/Pacific Islander/Native American, Other)
  • sex_gender: Consolidated Sex/Gender (using sex assigned at birth if available)
  • marital_status_name: Marital Status
  • zip3: Three-digit ZIP code

For processing the height and weight data, the R package growthcleanr will be used. This requires calculating an additional variable, agedays, based on the individual’s date of birth.


Task to Be Accomplished

  1. Data Ingestion
    • BMI Data:
      • Read the BMI file using the appropriate delimiter (e.g., tab).
      • Convert columns to appropriate data types, especially the measurement dates.
    • Demographic Data:
      • Load the demographics file ensuring proper parsing of key fields such as person_id, sex_gender, date_of_birth, and race.
      • Compute agedays (age in days) for each individual to support processing with growthcleanr.
  2. Data Cleaning and Filtering
    • General Cleaning:
      • Flag rows with missing or implausible values in height or weight.
      • Identify and flag rows where the reported BMI greatly differs from the BMI calculated using height and weight.
      • Output flagged entries to a separate file and continue processing only valid rows.
    • Outlier Detection:
      • For each person, identify extreme outlier measurements.
      • Write these outlier records to a separate file and retain the remaining valid records.
  3. Representative Record Selection
    • For each individual, from the valid records, determine a “typical” measurement:
      • Flag any extreme outlier measurements and exclude them.
      • Select the row with the median BMI value.
      • If multiple valid measurements exist, choose the one closest to the median BMI.
      • If no valid measurements remain for a person, flag that individual and record them separately.
  4. Categorization of BMI, Height, and Weight
    • BMI Categorization:
      • General:
        • Underweight: BMI < 18.5 kg/m²
        • Normal: 18.5 ≤ BMI < 25 kg/m²
        • Overweight: 25 ≤ BMI < 30 kg/m²
        • Obesity I: 30 ≤ BMI < 35 kg/m²
        • Obesity II: 35 ≤ BMI < 40 kg/m²
        • Obesity III: BMI ≥ 40 kg/m²
      • Race-Specific: For individuals identified (using race) as Black, Asian, Native American, or Pacific Islander:
        • Underweight: BMI < 18.5 kg/m²
        • Normal: 18.5 ≤ BMI < 23 kg/m²
        • Overweight: BMI >= 23 AND < 27.5 kg/m²
        • Obesity: BMI >= 27.5 kg/m²
        • Obesity I: 27.5 ≤ BMI < 32.5 kg/m²
        • Obesity II: 32.5 ≤ BMI < 37.5 kg/m²
        • Obesity III: BMI ≥ 37.5 kg/m²
    • Height Categorization:
      • Short: height < 150 cm
      • Average: 150 cm ≤ height < 180 cm
      • Tall: height ≥ 180 cm
    • Weight Categorization:
      • Light: weight < 50 kg
      • Medium: 50 kg ≤ weight < 80 kg
      • Heavy: 80 kg ≤ weight < 100 kg
      • Very Heavy: weight ≥ 100 kg

Expected Output

  1. Cleaned Dataset
    • A TSV file containing one representative row per person, which includes:
      • The “typical” BMI measurement along with its corresponding height, weight, and measurement date.
      • Demographic information (including date_of_birth, sex_gender, computed agedays, and race).
      • Categorical variables for BMI (both general and race-specific), height, and weight.
  2. Summary Report
    • A detailed summary (in text or Markdown) that documents:
      • The number of rows removed due to missing or implausible values.
      • The number of rows flagged and removed because of large discrepancies between reported and computed BMI.
      • The number of extreme outlier measurements removed.
      • The count of individuals with only invalid measurements versus those with valid measurements.
      • Descriptive statistics on the number of BMI measurements per person (e.g., mean, median, standard deviation).
    • A tableone-based table with the “typical” BMI, height, and weight for each individual with:
      • Overall and stratified by sex_gender
      • Breakdown of individuals across BMI categories (both general and race-specific) as well as height and weight categories.
      • A summary of the distribution of BMI, height, and weight categories.
  3. Data Dictionary
    • A separate document detailing each column along with a brief description (e.g., person_id: a unique identifier for each person, date_of_birth: the patient’s birth date, etc.).
  4. Additional Considerations
    • Utilize the R package growthcleanr for processing height and weight, leveraging the computed agedays variable from the demographic data.
    • Incorporate race-specific BMI categorizations to better capture cardiometabolic risk profiles in populations prone to central adiposity at lower BMI thresholds.