DREAM DATA CAREER

Aspiring Biomedical Data Scientist with a laboratory diagnostics background, bringing curiosity, clinical perspective, and data-driven thinking to biomedical research.

Portfolio Case Study · Data Science · Healthcare Analytics

Voice-Based Biomarkers and Motor Severity in Parkinson’s Disease

This portfolio project explored whether motor symptom severity in Parkinson’s disease can be estimated from voice-based acoustic biomarkers and basic demographic variables.

The analysis was based on the Oxford Parkinson’s Telemonitoring Dataset and focused on predicting motor_UPDRS, a clinical score describing motor symptom severity.

Rather than building a diagnostic classifier, the goal was to frame the task as a regression problem and examine how much clinically meaningful information voice features can provide.

Project Focus

The project investigated the relationship between voice measurements, patient characteristics, and motor symptom severity in Parkinson’s disease.

The main target variable was motor_UPDRS. Higher values indicate more severe motor symptoms, while lower values suggest milder motor involvement.

Task Regression-based estimation of motor symptom severity
Target motor_UPDRS score
Data Type Repeated voice recording-based measurements
Main Predictors Age, sex, HNR, PPE, Shimmer.APQ11, Jitter.RAP

Dataset at a Glance

The dataset contained 5,875 voice recording-based measurement records from patients with early-stage Parkinson’s disease.

A key methodological point is that these records do not represent 5,875 independent patients, because multiple measurements may belong to the same patient.

Records 5,875 measurement records
Population Early-stage Parkinson’s disease patients
Missing Values No missing values were identified
Important Note Repeated measurements were present

Target Variable: motor_UPDRS

The distribution of motor_UPDRS showed that most observations were concentrated in the mild-to-moderate and moderate severity ranges.

The mean and median values were close to each other, suggesting that the target variable was not strongly skewed in one direction.

Distribution of motor_UPDRS values.

Variability in Motor Severity

The boxplot provided a clearer view of the median, interquartile range, and overall spread of motor_UPDRS values.

The central 50% of observations were concentrated around the middle range, while both milder and more severe motor states were also present.

Boxplot of motor_UPDRS scores.

Age as a Key Background Factor

Age showed a visible positive relationship with motor_UPDRS. Higher motor_UPDRS values generally appeared in older age ranges.

At the same time, age alone did not fully explain motor symptom severity, since patients of similar age could still show different motor_UPDRS scores.

Relationship between age and motor_UPDRS, stratified by sex.

Voice Quality and Motor Severity

Among the acoustic variables, HNR was examined as a measure related to voice clarity and the ratio between harmonic and noisy components.

A weak negative relationship was observed between HNR and motor_UPDRS, suggesting that lower voice quality may be associated with greater motor symptom severity.

Relationship between HNR and motor_UPDRS.

Model Interpretation

A Random Forest model was used to estimate motor_UPDRS and examine feature importance.

Age emerged as the strongest predictor, followed by voice-based variables such as HNR, PPE, Shimmer.APQ11, and Jitter.RAP. The contribution of sex was less substantial.

Feature importance in the Random Forest model.

Main Takeaways

  • Voice-based acoustic features can reflect clinically relevant aspects of Parkinson’s disease.
  • Age was the strongest predictor of motor_UPDRS in this analysis.
  • HNR, PPE, Shimmer.APQ11, and Jitter.RAP contributed additional information.
  • The model’s predictive performance remained limited.
  • Voice-based telemonitoring appears promising as a complementary research direction, not as a standalone clinical decision-support tool.

Limitations

  • The dataset contained repeated measurement records, so the number of records was not equal to the number of individual patients.
  • Voice features may be influenced by recording conditions, microphone quality, background noise, and the patient’s current state.
  • The dataset focused on early-stage Parkinson’s disease, which limits generalizability to more advanced disease stages.

Portfolio Summary

This project demonstrated how demographic and voice-based acoustic variables can be used to explore motor symptom severity in Parkinson’s disease.

The results suggest that voice biomarkers may carry complementary information, especially when combined with demographic and clinical variables.

The analysis also highlighted an important practical limitation: voice features alone are not sufficient for accurate clinical-level prediction. More robust models would likely require longitudinal modeling, additional clinical variables, and external validation.