Abstract
Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research.
Similar content being viewed by others
Main
Large language models (LLMs) have shown strong language generation performance across diverse domains. LLMs have achieved passing grades on examinations in the style of the US legal bar examination1 and the second-year medical school examination2,3,4. In medicine specifically, natural language generation may influence clinical practice5, education and research6. When enriched with healthcare-specific data, LLMs achieve impressive performance in medical question answering2,4, electronic health record analysis7, differential diagnosis from medical images8, psychiatric functional assessment9 and delivery of psychological interventions10,11,12.
Conventional clinical visits provide invaluable information but offer only periodic assessments of lifestyle features, including sleep, physical activity, stress and cardiometabolic health. These features, which have profound impact on adverse health outcomes, morbidity, disability-adjusted life years and overall health13,14,15,16,17,18, can be measured passively and continuously by wearable devices19. However, these features are not yet widely used in clinical practice nor are incorporated into standard datasets used for medical question answering20,21, likely because these data are typically captured without context, are computationally demanding to store and analyze and can be difficult to interpret. Consequently, general foundation LLMs and even medically tuned LLMs may be unable to use these data effectively to reason about and recommend interventions based on personalized individual health behaviors.
Despite these hurdles, initial efforts to integrate sensor data with LLMs have occurred. Personal Health Insights Agent investigated tool use and code generation within an agent framework to accurately compute summary metrics of raw sensor data for use in question answering22. Similarly, Health-LLM evaluated LLMs on rating scale questions in consumer health framed as classification and regression problems23. Although valuable in demonstrating sensor data integration, these approaches do not assess the capability of LLMs to generate the long-form, nuanced and actionable advice characteristic of effective personal health coaching, leaving this critical application underexplored.
Here we provide new datasets, evaluation rubrics and benchmarks to assess LLM performance in the sleep and fitness domains. Although wearable devices measure more than just sleep and fitness, we focus on these domains specifically because the accuracy and suitability of commercial wearable measurements for these domains are well studied24,25; they have high engagement among device wearers; and practical recommendations can be offered without giving clinical advice. Additionally, the datasets introduced here can be used to improve performance of other LLMs, and the approach and rubrics are extensible to additional personal health domains.
Our main contributions are summarized in Fig. 1. We created Personal Health Large Language Model (PH-LLM), a version of Gemini finetuned to generate both insights about and recommendations to improve sleep and fitness patterns. We evaluated the performance of PH-LLM across three tasks in the sleep and fitness domains that span a breadth of abilities useful for a personal health coach. First, we assessed the amount of expert domain knowledge present in PH-LLM using multiple-choice examinations. Second, we evaluated the model’s ability to apply its expert knowledge and interpret aggregated sensor data to provide coaching recommendations from long-form case studies. Finally, for sleep only, we assessed the ability of PH-LLM to predict subjective patient-reported outcomes (PROs) in order to function as a coach that can incorporate individual assessments of sleep quality when generating recommendations.
a, Schematic showing the overall experiment design of the study. PH-LLM was evaluated on three aspects of personal health: (1) assessing its level of expert knowledge based on certification examination-style MCQs; (2) generating personalized insights and recommendations for user goals in the sleep and fitness domains; and (3) predicting PROs for sleep quality from aggregated daily-resolution numerical sensor data. b, Performance of PH-LLM, as contextualized with the responses of human experts, for professional examinations, coaching recommendations and PROs for sleep quality. Data are presented as mean accuracy (n = 629 sleep questions and n = 99 fitness examination questions), mean human expert rating (n = 4,265 sleep and n = 5,049 fitness case study ratings across principles and subsections from PH-LLM and n = 2,606 and n = 3,335 from human experts) and AUROC (n = 833 survey responses) bootstrapped over 1,000, 1,000 and 100 iterations, respectively. Error bars represent 95% confidence intervals. ‘*’ indicates a statistically significant difference between two groups using a two-sided t-test for examinations (P = 1.52 × 10−10 for fitness) and a two-sided Wilcoxon rank-sum test for recommendation ratings (P = 3.31 × 10−11 for sleep). ‘Naive performance’ is that achieved by a random classifier HR, heart rate; HRV, heart rate variability.
Results
The PH-LLM was developed upon Gemini Ultra 1.0 (ref. 26) using a two-stage training process. First, we finetuned the entire model on the task of generating long-form case study responses for sleep and fitness. The training and evaluation data consist of textual representations of demographic data, up to 30 days of daily metrics, aggregated metrics and, for fitness case studies only, individual exercise logs and synthetically generated subjective assessments of workout readiness that mimic a user training log. The target case study responses were created by domain experts in sleep and fitness working in concert with Gemini Ultra 1.0 (Methods and Extended Data Fig. 1). After finetuning on case studies, we finetuned a multimodal adapter for PH-LLM prediction of PROs in sleep disruption and sleep impairment27 from longitudinal passive sensor data containing daily sleep and activity metrics for at least 15 days28 (Methods). After both training stages were completed, we evaluated the performance of PH-LLM across three tasks in the sleep and fitness domains: multiple-choice examinations assessing expert knowledge, coaching recommendations in case study responses and classification performance on PRO prediction. To measure case study response quality, we designed a custom evaluation rubric that quantifies incorporation of user data, appropriate personalization based on user data, use of expert domain knowledge, evidence of confabulations or unwarranted assumptions, potential for harm, readability and overall quality.
PH-LLM exceeds expert performance on multiple-choice examinations
We compiled multiple-choice examinations in the style of the American Board of Internal Medicine (ABIM) Sleep Medicine Certification examination and the National Strength and Conditioning Association (NSCA) Certified Strength and Conditioning Specialists (CSCS) examination (Methods). PH-LLM correctly answered 79% of sleep medicine and 88% of fitness board examination questions tested (Table 1 and Fig. 1b), comfortably exceeding the approximate grade (70%) to either receive continuing medical education (CME) credit for sleep (https://www.boardvitals.com/sleep-medicine-moc-recertification) or pass the practice examination for fitness (https://www.nsca.com/certification/cscs/certified-strength-and-conditioning-specialist-exam-description), and was competitive with leading external models, slightly worse for sleep but equivalent for fitness questions (Extended Data Fig. 2 and Supplementary Fig. 1). In sleep, PH-LLM scored 79%, whereas Gemini Ultra 1.0 scored 77%. In fitness, both PH-LLM and Gemini Ultra 1.0 scored 88%. Receiver operating characteristic (ROC) and precision-recall curves confirm strong model performance in both examinations (Supplementary Fig. 2). Furthermore, despite finetuning PH-LLM for sleep and fitness tasks, its performance on the general medical benchmarks PubMedQA29 and MedQA30 does not degrade (Supplementary Table 1).
The sleep medicine question bank contained additional metadata for each question including the distribution of responses from human test takers, enabling performance comparisons stratified by empirical question difficulty. PH-LLM consistently improved slightly over Gemini Ultra 1.0, with larger performance improvement on more difficult questions, suggesting that finetuning on sleep case studies was relevant for these questions (Table 2). Notably, although this is not able to be confirmed, this trend of stronger performance for easier questions suggests that the overall strong performance is not due to memorization or presence of the questions in the model pretraining corpus.
To contextualize the performance of PH-LLM, we recruited five sleep medicine experts (average experience: 25 years) with advanced degrees and five professional athletic trainers (average experience: 13.8 years) to take the respective exams. The experts achieved an average score of 76% in a representative subset of the sleep medicine examination (N = 204) and 71% in the fitness examination. PH-LLM outperformed these expert graders on both question banks (Table 1). In sleep, when stratifying performance by empirical question difficulty for human test takers, we observed that PH-LLM performed similarly to both those test takers and the set of sleep medicine experts whom we recruited (Table 2).
We additionally performed ablation studies on the use of self-consistency31 (N = 5 rounds) and chain-of-thought (CoT) prompting32 (prompts in Supplementary Tables 2–4). Self-consistency improved performance on fitness questions for both CoT and non-CoT prompting techniques, whereas the performance from including CoT was mixed (Supplementary Table 5). Few-shot prompting with self-consistency improved performance further in both sleep and fitness examinations (Supplementary Table 6). Correct and incorrect answer examples are provided in Supplementary Tables 7 and 8.
PH-LLM approaches expert performance on case studies
We next sought to evaluate the model’s ability to apply its expert knowledge and interpret sensor data. To do so, we created the first dataset of detailed personal health case studies in sleep and fitness (857 case studies containing 3,271 question–answer pairs) curated by multiple experts in the associated domains. The dataset contains individual wearable sensor data over a multi-week timeframe and corresponding long-form insights and recommendations (Fig. 2a,b). To assess model performance, we compiled responses to each case study by Gemini Ultra 1.0, PH-LLM and a human expert and had all responses rated by human experts.
a,b, Visual depiction of wearable sensor data used as input and corresponding expert analysis and recommendations for improving sleep quality (a) and for assessing workout readiness and fitness (b). For each case study, the experts considered demographic information and wearable sensor data for a period of up to 30 days over which the device was worn (sections (i)–(iii) of a and b). For all metrics considered, see Supplementary Tables 26 and 27 for sleep data and Supplementary Tables 28–36 for fitness data. Both human experts and PH-LLM used these inputs to create domain-specific responses: sleep case studies contained subsections for insights, possible etiologies and recommendations, and fitness case studies contained subsections for training load, sleep, health metrics, readiness assessment and recommendations. Abridged examples of human expert responses are shown in sections (iv) of a and b. c,d, Mean expert ratings across all subsection principles for case study responses generated by Gemini Ultra 1.0, PH-LLM and human experts in the sleep (c) and fitness (d) domains. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. ‘*’ indicates a statistically significant difference (P < 0.05) between two response types using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Within each bar, n denotes the number of principle ratings per conversation source, and circles show the proportion of scores at a given Likert rating. bpm, beats per minute; brpm, breaths per minute; hh, hours; HR, heart rate; HRV RMSSD, heart rate variability root mean square of successive differences; mm, minutes.
Evaluation rubric and rater consistency
We evaluated the aggregated performance of PH-LLM and human experts on the long-form case study responses, rated by human experts using 15 questions with grading scale 1 (worst) through 5 (best), spanning topics such as using important domain knowledge, correctly referencing relevant user data and avoiding confabulations (Supplementary Table 9). Our rubric closely tracks prior human evaluation of LLMs in healthcare33. To enable inter-rater reliability assessments, we had multiple human experts rate responses (Methods). We conclude moderate inter-rater reliability, based on Gwet’s AC2 (refs. 34,35) ranging from 0.699 to 0.956 (Extended Data Fig. 3) and contingency tables of inter-rater agreement (Extended Data Fig. 4).
Expert evaluation of PH-LLM and expert responses
For sleep case studies, PH-LLM responses received an average rating of 4.61 versus 4.75 for human expert responses (P = 3.3 × 10−11, N ≥ 2,606, Z (test statistic) = −6.63, r (effect size) = −0.08; Fig. 1b). Although the difference is statistically significant, the effect size is small, and PH-LLM responses are high quality as indicated by receiving the top rating 73% of the time. Finetuning PH-LLM on sleep case studies significantly improved its overall performance in this task as compared to Gemini Ultra 1.0 (average rating of 4.51 for Gemini Ultra 1.0 versus 4.61 for PH-LLM, P = 4.0 × 10−6, N ≥ 2,603, Z = 4.63, r = 0.06). For fitness case studies, PH-LLM aggregate performance was not statistically different from expert performance (P = 0.48, N ≥ 3,335, Z = −0.70, r = −0.01; Fig. 1b). Gemini Ultra 1.0 responses also were not statistically different from human expert responses (P = 0.92, N ≥ 3,161, Z = −0.10, r = −0.00).
Because the case studies contain multiple sections, we also analyzed ratings for each section separately (Fig. 2c,d). For sleep case studies, finetuning PH-LLM improved its ability to provide insights and etiologies (P = 6.65 × 10−7, N ≥ 800, Z = 5.18, r = 0.11 and P = 2.46 × 10−3, N ≥ 801, Z = 3.15, r = 0.07, respectively). Recommendations showed no statistically significant difference between Gemini Ultra 1.0 and PH-LLM (P = 0.45, N ≥ 801, Z = 0.76, r = 0.02). We further analyzed ratings by various rubric questions (Extended Data Fig. 5a). Finetuning PH-LLM improved its ability to reference important domain knowledge (P = 4.47 × 10−5, N ≥ 201, Z = 4.44, r = 0.19), important interpretations (P = 4.47 × 10−5, N ≥ 201, Z = 4.48, r = 0.19), important user data (P = 5.21 × 10−8, N ≥ 201, Z = 5.91, r = 0.26) and no unimportant interpretations (P = 4.31 × 10−2, N ≥ 201, Z = 2.53, r = 0.11).
For fitness case studies, PH-LLM had similar performance as both human experts and Gemini Ultra 1.0 (no statistically significant difference observed, N ≥ 768) on three out of four sections (Fig. 2d). Training load was the only section in which responses from human experts and Gemini Ultra 1.0 were rated higher than those from PH-LLM (P = 0.01, N ≥ 768, Z = 3.02, r = 0.07). When analyzing ratings by rubric questions, we observed no statistically significant differences in ratings between PH-LLM and human experts (Extended Data Fig. 5b).
We present examples of PH-LLM model responses for sleep (Supplementary Tables 10–12) and fitness (Supplementary Tables 13–17). To assess robustness to missing data, we examined performance on case studies stratified by data availability. We observed no bias in rating distributions based on data availability (Supplementary Fig. 3). To understand the effect of context changes on the PH-LLM response, we varied the input context provided to PH-LLM. We observed that PH-LLM appropriately adapted its responses depending on the length of the context available (Supplementary Fig. 4 and Supplementary Table 18). Additionally, PH-LLM effectively used different parts of the input data for distinct aspects of its response. For example, modifying training load statistics while keeping workout logs constant led the model to adjust its response accordingly (Supplementary Table 19). Similarly, the model adapted its output when faced with missing information, such as the absence of today’s sleep metrics (Supplementary Table 20). Finally, we note that some individuals were the subject of multiple fitness case studies from distinct days. Expert assessments of workout readiness can appropriately vary from one day to the next (Supplementary Fig. 5), and the relative performance of PH-LLM and human experts is consistent at a variety of target date overlaps allowed in the holdout set (Supplementary Fig. 6).
Automatic evaluation and ablation experiments
Expert human grading of case studies is laborious and time-consuming. To enable a wider set of ablation studies and additional experiments, we used the expert human ratings to train a separate LLM to automatically evaluate responses (‘AutoEval’) according to the rubric (Methods)36. Our best AutoEval models ranked case study response sources more accurately than an LLM not explicitly finetuned for AutoEval tasks (Supplementary Table 21) and similarly to human experts (Extended Data Fig. 6) in a fraction of the time (Supplementary Table 22).
When evaluated using the best AutoEval model, PH-LLM scored significantly higher than leading external models on sleep insights and etiologies sections and competitively on the other sections (Extended Data Fig. 7). Additionally, we used our AutoEval raters to explore the effect of finetuning data size on model performance. We observed that sleep insights and etiologies sections showed statistically significant rating improvements when using the whole dataset compared to training with only 25% of samples (Extended Data Fig. 8).
PH-LLM predicts PROs using sensor data
To assess whether PH-LLM could infer subjective user experience to better tailor coaching advice, we evaluated its ability to predict PROs in sleep disturbance and sleep impairment from daily-resolution numerical sensor data. To do so, we used the Google Digital Wellbeing (DWB) study dataset, which recruited thousands of individuals to use wearable devices and take validated survey instruments that measure wellbeing over a 4-week span28. We filtered the dataset to 4,163 individuals who had both daily measurements of 20 wearable sensor features for at least 15 days and complete responses for the sleep disturbance and impairment surveys at study completion (Methods). We first explored the PRO data by computing the correlation between the survey answers obtained from research participants. The 16 questions measure related but distinct sleep outcomes (Fig. 3a). We then investigated sensor features for possible confounding and observed similar distributions of sensor readings across devices and participant compliance (Supplementary Fig. 7). Next, we computed the importance of individual sensor features for predicting PROs by fitting linear models that used all sensor features as input to predict each survey question independently. We quantified importance as the magnitude of the regression coefficient. We observed that no single feature dominated predictive power across all PROs and that the overall predictive signal was spread across many sensors (Fig. 3b). However, we note that interpreting the relationships between sensor features and outcomes is complicated because sensor features may capture demographic effects.
a, Correlations among survey responses for questions that measure related but distinct sleep outcomes from the PROMIS Sleep Disturbance and Sleep Impairment surveys. b, Feature importance for sensor features predicting survey responses in a linear regression model. The top two predictors for each survey question, measured based on the magnitude of the regression coefficient, are annotated with ‘*’. c, AUROC for the performance of PH-LLM with adapter, zero-shot and few-shot prompting approaches when predicting binary outcomes derived from survey responses in the test set (n = 833). The dashed vertical line denotes the AUROC of the random predictor. Data are presented as mean AUROC over 100 bootstrapping iterations, and error bars show 95% confidence intervals. Outcomes for which the confidence intervals of the difference in AUROC between PH-LLM with adapter and both zero- and few-shot both exclude 0 in 100 paired bootstrapping iterations are annotated with ‘*’. d, AUPRC for the performance of PH-LLM with adapter, zero-shot and few-shot prompting approaches when predicting binary outcomes derived from survey responses in the test set (n = 833). Outcome-specific prevalence bars are added to show the AUPRC of the random predictor. Survey response names are mapped to their corresponding questions in Supplementary Tables 39 and 40. SI, sleep impairment. Data are presented as mean AUPRC over 100 bootstrapping iterations, and error bars show 95% confidence intervals. Outcomes for which the confidence intervals of the difference in AUPRC between PH-LLM with adapter and both zero-shot and few-shot both exclude 0 in 100 paired bootstrapping iterations are annotated with ‘*’.
To enable PH-LLM to predict PROs from sensor features, we trained a multilayer perceptron (MLP) adapter to map summary statistics of the 20 sensor features into PH-LLM’s latent token space (Methods). We then provided the latent tokens to PH-LLM as context and prompted it to predict each binary outcome, analogous to previous work37. We compared the area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for each binary trait with zero-shot and few-shot text approaches (Fig. 3c,d) in a heldout test set. We note that objective measurements of sleep and sleep behaviors generally provide only modest predictive power for perceived sleep quality metrics. However, PH-LLM using the adapter significantly outperformed both prompt-based approaches in terms of AUROC and AUPRC (Extended Data Fig. 9 and Supplementary Tables 23 and 24). This performance increase is likely due to adapter-enabled LLMs being able to capture more signal from the training set than zero-shot and few-shot prompting, which see a very limited amount of training data37.
To assess how well PH-LLM performed compared to traditional machine learning approaches, we fit independent logistic regression models for each binary trait using the same encoded vector input. We found no statistically significant differences in performance between PH-LLM and logistic regression models for AUROC and only one statistically significant difference out of 16 outcomes for AUPRC (Extended Data Fig. 9 and Supplementary Tables 23 and 24). Furthermore, we hypothesized that, in a low-data regime, LLMs might leverage their prior knowledge to outperform logistic regression. However, our comparison of PH-LLM and logistic regression revealed no such advantage (Supplementary Fig. 8), likely due to the limited representation of wearable sensor data in LLM training corpora. Separately, to model nonlinear effects not captured by these logistic regression models, we also trained a convolutional neural network (CNN) on the full time-series sensor data to predict the same outcomes. However, due to the limited size of the existing dataset, the CNN model performed worse than the logistic regression model (Extended Data Fig. 9).
We performed multiple ablation analyses to assess the effects of the sensor data context window length (number of days of sensor data), number of training data examples and data augmentation on PRO prediction performance. We observed that using only 5 days of sensor data led to lower AUROC and AUPRC metrics, whereas using 10 days or 15 days gave similar performance (Supplementary Fig. 9). Increasing the context window beyond 10 days is likely to provide little value because Patient-Reported Outcomes Measurement Information System (PROMIS) sleep survey questions refer specifically to the past 7 days. We also observed a significant performance improvement when increasing the number of training samples from 25% to 50% and a smaller improvement when increasing the number of training samples from 50% to 100% (Supplementary Fig. 10), indicating that PH-LLM’s performance may continue to improve with additional training data. Finally, we retrained PH-LLM after augmenting the adapter inputs with various levels of zero-mean Gaussian noise. Although the AUROC scores do not show consistent trends, the AUPRC scores are generally higher—albeit with wide error bar estimates—from the model augmented with Gaussian noise with zero mean and 0.001 s.d. (Supplementary Fig. 11), suggesting that data augmentation may modestly improve performance further.
Discussion
We developed an LLM finetuned from Gemini (PH-LLM) to perform a variety of tasks relevant to setting and achieving personal health goals in sleep and fitness. Our study showed that PH-LLM can answer technical multiple-choice questions (MCQs) in these domains with accuracy exceeding multiple experts. Furthermore, we showed that PH-LLM can integrate passively acquired objective data from wearable devices into personalized insights, potential causes for observed behaviors and recommendations to improve sleep hygiene and fitness outcomes. After finetuning from Gemini Ultra 1.0, which itself displays aggregate performance approaching that of experts in fitness, PH-LLM demonstrated significantly improved use of domain knowledge and personalization of relevant user information for sleep insights. Finally, we trained PH-LLM to use a multimodal encoder that natively integrates daily summaries of time-series health sensor data to predict subjective self-reported outcomes in sleep and achieved performance on par with specialized models to predict the same outcomes.
Multiple-choice examinations are commonly used benchmarks to assess general domain knowledge of LLMs1,2,3,4,26. Strong performance on MCQs can be considered a necessary, but not sufficient38, requirement for practical utility of LLMs in realistic settings. Through exhaustive evaluation of PH-LLM and comparison to multiple human experts and other models, we showed that, although leading external general-purpose LLMs exhibited slightly stronger performance on general sleep knowledge, PH-LLM comfortably demonstrates expert-level performance in both sleep and fitness MCQs, and finetuning toward these tasks did not degrade its performance on general medical benchmarks.
Open-ended long-form case studies represent key use cases for applications of LLMs to personal health features on wearable devices. In the present study, we created 857 case studies to assess sleep quality and fitness readiness for a workout and coupled the case studies with rigorous evaluation rubrics. We observed that the average performance over all case study responses was very high for all human experts, Gemini Ultra 1.0 and PH-LLM, underscoring the strong knowledge and reasoning capabilities of the Gemini model family.
However, finetuning Gemini Ultra 1.0 to create PH-LLM produced significant improvements in the model’s ability to provide insights and etiologies for sleep. For example, PH-LLM is better at using relevant domain knowledge, providing important interpretations of data and using important user data. The relatively lower performance gain observed for sleep recommendations may be driven in part by both the increased ambiguity of the task, which led to wider variety in training example content, and the increased challenge of requiring consistency across multiple model generations to provide an accurate recommendation that referenced real data.
In contrast, for fitness, both PH-LLM and Gemini Ultra 1.0 performed similarly. This pattern may be attributable to several factors. First, the base model’s stronger performance in fitness could reflect well-optimized prompting, minimizing potential gains from finetuning. Second, differences in training data could have influenced the performance. For example, our use of a standardized format for sleep analysis may have resulted in more specific training data, whereas the absence of a standardized guideline for fitness may have led to more generalized data that are more difficult to differentiate. Third, Gemini Ultra 1.0 may possess stronger baseline knowledge in fitness compared to sleep medicine.
PH-LLM was outperformed by Gemini Ultra 1.0 (and human experts) for a single section: the training load section of fitness case studies. This may be partially explained by the data generation process, in which multiple models were used to create candidate case study responses. Expert rater comments suggest that some lower-quality statements with incorrect use of domain and user data may be present in the training data, particularly with respect to rest periods. Moreover, because the fitness case studies incorporate sleep quality as one input, there is potential for further improvements in fitness by integrating full sleep case studies into the sleep section of fitness for a more detailed view of an individual’s rest status.
Additionally, we developed methods for automated evaluation of case studies and demonstrated their ability to be used as scalable proxy measures for expert human evaluation of LLM performance. Our best AutoEval models ranked study responses similarly to human experts, and these models obtained similar agreement measures with expert raters when compared using inter-rater concordance metrics. We used the best AutoEval models to compare PH-LLM to public external models and perform dataset ablation analyses. The analyses underscored the value of our expert-curated case studies, as PH-LLM significantly outperformed on sleep insights and etiologies, and performance degraded when trained on subsets of the data.
Beyond expert interpretation of wearable features, understanding the subjective experience of an individual can be useful for determining a comprehensive and personalized action plan. Indeed, subjective PROs are receiving greater attention in health management39,40,41. However, predicting subjective PROs from sensor information alone is difficult, with specialized models developed specifically for those tasks only reaching AUCs in the 0.55–0.75 range. We trained a multimodal adapter to enable PH-LLM to predict subjective PROs from aggregates of daily sensor information and showed that the native multimodal integration was necessary and sufficient to achieve performance on par with specialized discriminative models. Although for the standalone task of PRO prediction the simpler logistic regression model is much more efficient and interpretable, enabling PH-LLM to predict subjective wellbeing allows for integrating this capability directly into user-facing interactions. Additionally, by building the capability directly into PH-LLM, it may benefit from positive transfer learning toward additional out-of-distribution outcomes for which it was not trained, as shown previously8.
Our work has several limitations. First, expert evaluation of long-form text is inherently subjective for open-ended tasks such as coaching42, and the observed distribution of case study rubric ratings was skewed high, making differentiation across models and expert responses challenging. Although some case study sections and evaluation rubric principles showed significant differentiation, further training of expert raters to increase inter-rater reliability or adjudicating existing responses could increase signal strength of model performance. Second, to minimize inter-rater variability, most case studies had a single expert rate all candidate responses. Although this made direct comparison of candidate responses straightforward, it introduced the potential for experts to identify expert versus model responses based on style or other non-material factors and, thus, introduce conscious or unconscious biases into ratings. Third, we observed that, despite improvements in referencing and integrating user data into insights, confabulations and incorrect referencing of user data still occasionally occurred. Preventing these issues will be critical to ensure the safe and effective deployment of these technologies into user-facing features. Fourth, the case studies were sampled broadly across demographics (sleep) or to identify common patterns in active individuals (fitness), and the DWB study recruited samples by self-selection into the study, so they may not be representative samples of the population (Supplementary Table 25) nor exhaustively explore the sleep and fitness concerns and unmeasured contextual factors affecting individuals. In particular, the case studies and broad Fitbit research population do not acquire race or ethnicity information, so these samples may be a biased subset of the US population. Furthermore, we sampled to enrich patterns of activity in fitness case studies that enriched those case studies for male participants and individuals aged 30–59 years. Also, the self-selection into the DWB study induced a skew toward female participants. Each of these distributional biases may affect generalizability of the results presented here for the entire US or global population. Fifth, we only considered textual representations of case study data input, owing to the absence of persisted raw time-series in the dataset. Exploring the potentially richer temporal signals encoded in raw sensor data is an important area for future work. Sixth, our exploration of multimodal encoding of sensor data similarly explored a small fraction of modeling possibilities owing to the relatively small dataset with nearly complete sensor data and paired outcome data. Further exploration of self-supervised pretraining on raw waveforms and granularly aggregated sensor features may yield richer representations of individuals useful for personal health outcome predictions43 that expand beyond sleep metrics. Seventh, by using daily sensor measurements and imposing selection criteria for case studies, we largely avoided challenges of missing data that wearable sensor studies frequently face. Although our analyses stratified by data missingness and sensor data ablations suggest that PH-LLM is robust to missingness, further exploration is warranted. Eighth, a primary overarching goal for developing models specific to personal health is to improve long-term health outcomes through effective behavior change and maintenance of healthy habits. Blinded user preference studies could provide important evidence of the model’s effectiveness and performance relative to expert coaching in the studied domains. Furthermore, we selected the three categories of tasks described herein as they represent important capabilities, but we recognize that there are many other dimensions that one could evaluate, and these remain important areas for future work. Finally, although the performance of PH-LLM on the tasks presented here is encouraging, we caution that much work remains to be done to ensure that LLMs are reliable, safe and equitable in personal health applications. Further reducing confabulations, considering an individual’s unique health circumstances not captured by sensor information alone and ensuring alignment of the training data with real-world distributions are a subset of important research areas that warrants further attention.
Despite these limitations, we have demonstrated here that Gemini-based models contain substantial health knowledge, and we can effectively finetune Gemini Ultra 1.0 to improve performance across multiple outcomes relevant for personal health. We provide a suite of benchmark tasks relevant for personal health coaching, contextualize performance through exhaustive comparison to human experts and highlight strengths and gaps in current LLMs for acting as coaching agents. The results from this study represent, in our estimation, an important step toward LLMs that deliver personalized information and recommendations that support individuals to achieve their health goals.
Methods
Owing to the absence of clearly defined language and multimodal datasets in the domain of personal health, we created datasets and associated tasks to evaluate different capabilities of PH-LLM. These datasets include professional examinations that test domain knowledge about sleep medicine and fitness, case studies about real-world coaching recommendations and PROs about sleep.
Professional examination dataset creation
Sleep medicine examinations: we compiled 629 MCQs from BoardVitals (https://www.boardvitals.com/) sleep medicine board review question banks. We used the American Medical Association (AMA) Physician’s Recognition Award (PRA) ‘Category 1 - Sleep Medicine’ question bank, which emulates examination content for the American Board of Internal Medicine (ABIM) Sleep Medicine Certification Exam. We also used the Sleep Medicine Maintenance of Certification (MOC) Exam and Longitudinal Knowledge Assessment Review question bank, which emulate examination content for the ABIM Sleep Medicine MOC Exam and the ABIM Longitudinal Knowledge Assessment. This compiled set of MCQs spanned a wide range of sleep-related topics: Normal Sleep and Variants (N = 127), Breathing Disorders (N = 84), Hypersomnolence (N = 60), Insomnias (N = 85), Movement Disorders (N = 23), Parasomnias (N = 57), Sleep in Other Disorders (N = 112) and Sleep-Wake Timing (N = 81).
A subset of 204 questions was selected for evaluation by sleep experts. Sampling was performed randomly to ensure similar distributions to the entire set of MCQs when stratified based on medical content categories (https://www.abim.org/Media/aypkdxpi/sleep-medicine.pdf) and their difficulty levels.
Fitness examinations: we compiled 99 MCQs sourced from multiple question banks that emulate examination content for the CSCS examination preparation book that the NSCA provides (https://www.nsca.com/certification/cscs/certified-strength-and-conditioning-specialist-exam-description). We used the test examination questions from the NSCA-CSCS textbook Essentials of Strength Training and Conditioning.
Accuracy was used as the metric to evaluate the performance of our model in professional examinations, in line with previous work evaluating MedMCQA44. Each question presents up to five possible answers, with a single correct answer. MCQs were not used for model training. All samples were used in evaluation.
Coaching recommendations dataset creation
Each of the 857 case studies was sampled from anonymized Fitbit production data from individuals who provided consent for research purposes. Case studies were designed to interpret a range of physiological sensor information toward deriving insights, potential causes or recommendations for future behaviors.
Creation of case studies consisted of four main parts, each developed in close collaboration with domain experts: identifying the primary goals of the case studies, selecting case study input features, sampling individuals for inclusion in case studies and creating gold standard case study responses.
Domain experts in sleep and fitness each consisted of ‘primary’ and ‘secondary’ contributors to case study response creation and evaluation. This categorization was based on each expert’s general availability; ‘primary’ contributors had more involvement and higher workload volumes. The level of domain expertise was similar across the two groups. The sleep and fitness verticals were each led by a clinical expert in their respective fields. The clinical lead oversaw case study development and provided feedback and quality control to the set of domain experts.
Sleep case study dataset creation
We recruited six domain experts, all of whom possessed advanced degrees (MD, DO or PsyD) in sleep medicine and professional sleep medicine work experience ranging from 4 years to 46 years. All experts were trained to read and interpret wearable data and map outputs to their corresponding sleep medicine literature counterparts.
Primary goals: the sleep case studies aimed to enhance understanding of sleep patterns, identify causes of irregular sleep and offer actionable recommendations based on these findings, with the ultimate objective of enhancing the sleep quality of the individual.
Input feature selection: each sleep case study incorporated demographic information (age and self-reported gender), daily sleep metrics (for example, bedtimes, wake times and sleep stage durations) and aggregated sleep statistics (for example, average bedtime) (Fig. 2a). The daily sleep metrics contained up to 29 days of data across 16 metrics (Supplementary Table 26). The aggregated sleep statistics included multiple summary measures for each daily metric computed over all the days, the percentile that the aggregated metric is in as compared to other individuals within the same demographic group and the 5th and 95th percentiles of the aggregated metric as compared to other individuals within the same demographic group. For some metrics, such as bedtime, the metrics were computed separately for all days, weekdays only and weekends only to understand weekday versus weekend patterns (Supplementary Table 27).
Sample selection: the dataset was sampled to enrich for non-missingness and achieve a representative group across age and self-reported gender (Extended Data Fig. 10a and Supplementary Table 25). We first limited the data to individuals with at least 28 sleep events within the 29-day window. We then sampled from 64 different demographic groups, determined by a combination of 32 different age buckets (13–20 years old and 20–80 years old, with each group within this range spanning 2 years, and 80 years old and above) and two gender buckets (male and female). Each case study (N = 507) was created from a unique individual. The distribution of the number of days for which data are available in the sleep case studies is presented in Supplementary Fig. 12a.
Gold standard response creation: experts were instructed to interpret the case study input data to craft guidance in the second-person narrative, fostering a direct and personalized dialogue with the user. The case study input data were presented to the experts in both graphical and tabular formats for ease of analysis (Fig. 2a and Supplementary Tables 26 and 27). The experts composed responses across the following sections:
Insights: implicitly, this section answered the question of ‘What are some sleep-related insights based on my data?’ The sleep medicine expert examined the data and provided an interpretation of whether a data point might represent an atypical sleep pattern. The experts were asked to systematically review each case to provide a holistic assessment of the user’s sleep patterns. To do so, Fitbit sleep metrics were assessed according to the validated Routine, Sleep Quality, Alertness, Timing, Efficiency and Duration (RU-SATED) framework to generate sleep insights45.
Etiology: implicitly, this section answered the question of ‘What are the possible underlying causes that could explain the observed data?’ The experts generally considered the contribution of circadian rhythm, homeostatic drive, psychophysiologic hyperarousal and extrinsic factors and indicated their likelihood.
Recommendations: implicitly, this section answered the question of ‘What can I do to improve my sleep?’ The experts were asked to provide personalized recommendations to the individual that can help them improve their sleep by addressing potential causes identified in the etiology section. The experts were instructed to create recommendations that are Specific, Measurable, Achievable, Relevant and Time-bound (SMART)46.
Fitness case study dataset creation
We recruited seven domain experts in fitness to analyze an individual’s quantitative fitness data. The seven fitness experts all possessed advanced degrees (MS, MA, MEd or DAT) related to the athletic training field and professional athletic training work experience ranging from 4 years to 25 years.
Primary goals: the fitness case studies were designed to provide a comprehensive analysis of an individual’s training load, sleep patterns and health metrics, with the ultimate objective of synthesizing the metrics into a data-driven assessment of the extent to which the individual is prepared for physical activity today, and to provide associated recommendations.
Input feature selection: similar to the sleep case studies, fitness case study inputs spanned demographic information (age, self-reported gender, height, weight and body mass index (BMI)), wearable sensor daily values for cardiovascular training load, sleep and health metrics and aggregated statistics (Fig. 2b). The quantitative fitness data contain up to 30 days of data.
Sample selection: we enriched for samples containing sufficient activity for interesting training readiness analysis by considering individuals who logged more than 30 lifetime runs and had, within a contiguous 21-day window, mean active zone minutes of at least 45 minutes and, at most, 5 days of missing data for step counts, sleep, heart rate variability and resting heart rates. We further filtered to person-days with at least two logged exercises in the past 28 days and that contained noticeable recent changes in heart rate variability, resting heart rate, sleep, active zone minutes or run activity. Each case study (N = 350) targets a unique person-day combination, with N = 58 unique individuals in the dataset. The distribution of the number of available days for fitness case studies is presented in Supplementary Fig. 12b. The demographics of the individuals in fitness case studies are presented in Extended Data Fig. 10b and Supplementary Table 25 for age and gender and in Supplementary Fig. 13 for BMI, height and weight.
Gold standard response creation: experts were instructed to formulate insights, assessments and recommendations in the second-person narrative. The case study input data were presented to the experts in tabular, text and graphical formats. The experts composed responses across the following sections:
Demographics: the experts considered the demographics data (age, self-reported gender, height, weight and BMI) and commented on whether any precautions should be taken when recommending a fitness program.
Training load: the experts were provided with a detailed table capturing daily metrics over the past 30 days, including day of the week, date, minutes spent in fat-burn, cardio and peak zones, training impulse (TRIMP), which is a training load measure derived from heart rate and exercise duration, and number of steps (Supplementary Table 28). Daily TRIMP values were additionally visualized in a bar plot (Fig. 2b). Aggregate statistics were also provided, including means; ranges; acute TRIMP (7-day total training load), a measure of acute training load; chronic TRIMP (28-day average acute training load), a measure of chronic training load; and acute:chronic workload ratio (ACWR), aiding the assessment of training stress. Recent exercise logs and associated metrics were provided for each exercise entry (Supplementary Table 29).
Sleep metrics: the data included nine daily sleep metrics: day of the week, date, bedtime, wake time, sleep time, awake time, deep sleep duration, rapid eye movement (REM) sleep duration and the sleep score (Supplementary Table 30). Experts were also given aggregated metrics, including means, standard deviations and z-scores, indicating the difference in metrics between the most recent 3 days and the past 28 days to identify recent trends (Supplementary Table 31). Select sleep metrics were additionally visualized (Fig. 2b).
Health metrics: daily data on day of the week, date, resting heart rate, heart rate variability and respiratory rate were provided to assess recovery and stress (Supplementary Tables 32 and 33), along with graphical representations (Fig. 2b). The experts were also given aggregate metrics, such as means, standard deviations, ranges and z-scores, indicating the difference in metrics between the most recent day and the past 28 days (Supplementary Table 34), to gauge changes over time and assess recovery and stress levels effectively.
Assessment and recommendation: the experts integrated multiple data sources to evaluate workout readiness and provide personalized recommendations. They synthesized the most important insights from the previous sections along with simulated user feedback about their subjective state. The synthetically generated user feedback included a qualitative description of subjective readiness (for example, ‘feeling fatigued’) and muscle soreness (for example, ‘manageable soreness’) (Supplementary Tables 35 and 36). The synthetically generated user feedback was generated using Gemini 1.0 Pro (prompts are shown in Supplementary Tables 37 and 38). Based on all available information, the experts assessed the individual’s readiness to work out on a scale of 1 to 5 and provided specific fitness recommendations (Fig. 2b).
Holistic view of case study creation
Sleep and fitness case studies were collected from distinct sets of individuals. For both the sleep and fitness verticals, we generated two sets of data: a dataset used for model training, validation and testing and a holdout dataset that was used only for final evaluation of the model by experts (Extended Data Fig. 1).
To generate the dataset used for training, validation and testing, we first prompted the Gemini family of models with the data for each section in order to generate baseline model (Gemini Ultra 1.0) responses (Extended Data Fig. 1a). The experts then reviewed and rewrote the responses as needed. The dataset also underwent multiple rounds of quality control, engaging the experts and clinical leads. Separately, to generate the holdout dataset, the experts wrote the responses from scratch (without any LLM assistance). This was done to ensure a clearer comparison between experts and the model during final evaluation.
In total, we created 507 case studies for sleep (457 case studies for the training, validation and test set and 50 case studies for the holdout set, all drawn from unique individuals) and 350 case studies for fitness (300 case studies drawn from 48 individuals for the training, validation and test set and 50 case studies drawn from 11 individuals for the holdout set). Data splits for fitness, which sample multiple case studies from single individuals, were created to ensure that person-months were unique between the train and holdout test set. Although a single individual contributes case studies with unique person-months to both the train and holdout splits (N = 5 and N = 3, respectively), excluding these holdout samples did not substantially impact holdout performance (Supplementary Fig. 6). Within each fitness split, case studies from a given individual were selected from distinct days such that target person-days were unique across samples.
Because each case study had multiple sections (three sections for sleep and five sections for fitness), in total we obtained 3,271 question–answer pairs (457 × 3 + 300 × 5 = 2,871 for train/validation/test and 50 × 3 + 50 × 5 = 400 for the holdout evaluation).
PRO dataset creation
To evaluate the ability of PH-LLM to predict PROs from longitudinal passive sensor data, we used a large institutional review board (IRB)-approved study in which wearable data were collected for a population of 10,099 consented individuals for a 4-week period28. At both intake and completion, participants were asked to complete two surveys: the PROMIS short-form Sleep Disruption and Sleep Impairment surveys27, each containing eight questions with answers on a five-point Likert scale (Supplementary Tables 39 and 40). The study thus linked individuals’ perceived sleep quality and its impact on their functioning with longitudinal observed physiological (for example, heart rate and sleep duration) and behavioral (activity) measurements.
We used the completion survey responses for all 16 questions as the basis for prediction and, for each question, defined a binary outcome that compared the highest answer (for example, ‘strongly agree’) against all others (Supplementary Fig. 14). Features used to predict each binary outcome include 20 time-varying wearable measurements (Supplementary Table 41), each of which was collected from study participants over a 4-week span. Although most individuals have sensor data for over 21 days, distributions are heavily left-skewed (Supplementary Fig. 15). To obtain data points with complete labels and standardized input features, we filtered the dataset to retain only individuals who have all completion survey responses and at least 15 days of sensor data (N = 4,163 individuals). For each individual, we kept only the last 15 (possibly non-contiguous) days of sensor data, resulting in a 20 × 15 matrix that represents the wearable sensor data for each research participant over 15 days as well as a vector of 16 binary values as the prediction targets. We performed sensor data quality filtering by replacing values more than 4 s.d. from the population median with their nearest non-outlier values.
In total, we created 4,163 PRO samples, randomly split into 2,914 (70%) training examples, 416 (10%) validation examples and 833 (20%) testing examples. Each sample corresponds to a unique individual and contains labels for 16 binary outcomes, resulting in 46,624 (=2,914 × 16) training, 6,656 validation and 13,328 testing data points with binary labels.
Base model selection
To train PH-LLM from the most capable base model, we performed automated evaluation of several Gemini candidate model sizes and a medical LLM on the professional examination questions. The candidate models were Gemini Nano 1.0, Gemini Pro 1.0, Gemini Ultra 1.0 (ref. 26) and Med-PaLM 2 (ref. 2).
Model prompting to create case study candidate responses
Because Gemini Ultra 1.0 was the most accurate Gemini family base model on professional examinations, suggesting that it has appropriate domain knowledge in the areas of sleep and fitness, we explored the performance of this model on case studies. We prompted Gemini Ultra 1.0 by summarizing guidelines given to the experts for dataset creation. For example, the sleep experts generally were asked to follow the RU-SATED format45 to generate sleep insights. To encourage Gemini Ultra 1.0 to generate high-quality case study responses, we similarly prompted it to follow the RU-SATED format and provided an explanation of what metrics should be used to assess each dimension (see Supplementary Tables 42–49 for details). We note that each case study consisted of multiple sections containing textual representations of aggregated sensor data that represent different queries and responses: three sections for sleep case studies (insights, etiology and recommendations) and five sections for fitness case studies (demographics, training load, sleep, health metrics and the assessment). Because each section represented a different aspect of the case study, we developed prompts specifically for each section (Supplementary Tables 42–44 for sleep; Supplementary Tables 45–49 for fitness). We represented aggregated sensor data as text to mimic observations of how the experts used and interpreted the case study data while creating gold standard responses. For sections that synthesized results from previous sections—that is, the etiology and recommendation sections in sleep case studies and the assessment section in fitness case studies—we substituted the model answers from previous sections into the prompt (Supplementary Table 49).
Training PH-LLM on case studies
We finetuned Gemini Ultra 1.0 on the case studies to create PH-LLM. We used the case studies from the training, validation and test sets for model training and selection (457 case studies from 457 individuals for sleep and 300 case studies from 48 individuals for fitness). For each of the sleep and fitness domains, we randomly split the case studies into separate training, validation and test splits using a 70:15:15 ratio. We used the same text prompts that were given to the baseline model to form prompt–response pairs for model tuning. Because each section was treated as a separate example, this resulted in 1,371 prompt–response pairs for sleep and 1,500 prompt–response pairs for fitness across the training, validation and test sets (Extended Data Fig. 1a,b).
Typically, LLMs are trained on mixtures of tasks47. Here, we finetuned the model on a 1:1 mixture of sleep and fitness prompt–response pairs. Within the fitness prompt–response pairs, we upsampled higher-quality case studies by a 2:1 ratio, where higher-quality case studies were defined as those that underwent additional rounds of quality control by the fitness experts.
The Gemini Ultra 1.0 model, which is based on a transformer decoder26, was finetuned for 1,500 steps with a global batch size of four using linear warmup over 50 steps and cosine decay. Finetuning was performed by minimizing the cross-entropy loss between predicted and target token distributions. We used a learning rate of 2.5 × 10−7, a weight decay of 1 × 10−2 and a learning rate decay minimum ratio of 0.1. We saved model checkpoints every 50 steps. For our final model candidate, we chose the first checkpoint (step = 550) after the model had been trained for at least one epoch. At this point, the validation loss had started to plateau and stabilize (Supplementary Fig. 16).
Training PH-LLM for PROs
Using PH-LLM finetuned on case studies, we further finetuned an adapter for PH-LLM to predict PROs from wearable data. To do so, we used a subset of the Google DWB study dataset28, which collected both wearable data and self-reported outcomes, and we trained the model following the methodology developed in HeLM37. The wearable data of each individual consisted of 20 device measurements over 15 days and were passed to the model through both textual prompt and the adapter. The textual prompt provided the age of the individual, listed the names of 20 device measurements and their mean values across 15 days and asked the model to predict either ‘yes’ or ‘no’ for a specific binary outcome (Supplementary Table 50). For the adapter, we standardized the wearable data by z-scoring using the training data mean and s.d. for each wearable sensor. Then, for each individual, we computed the mean and variance of z-scored sensor values across 15 days for each measurement, yielding a standardized and aggregated 20 × 2 sensor data matrix. This matrix was projected into four ‘soft tokens’ in the PH-LLM token embedding space by an MLP adapter with an input layer of size 40; three hidden layers of sizes 1,024, 4,096 and 1,024, each followed by rectified linear unit (ReLU) activation function; and an output layer of size 4 × 14,336 = 57,344 where 14,336 is the token embedding size of PH-LLM. Finally, these four ‘soft tokens’ were passed to PH-LLM as a prefix to the textual prompt.
The adapter was trained to minimize cross-entropy loss between the predicted and actual binary outcome value while keeping PH-LLM weights frozen. The adapter was trained for, at most, 30,000 steps using a global batch size of eight, a learning rate of 1 × 10−4 with linear warmup over 50 steps and cosine decay. The final checkpoint was selected as the one with the highest AUPRC (averaged over 16 binary outcomes) on the validation dataset.
We compared the predictions on the test dataset from PH-LLM with adapter to those from text-only PH-LLM using either zero-shot or few-shot prompting. For zero-shot, the prompt was almost the same as that for PH-LLM with adapter except that there was no adapter-based soft token prefix and the prompt also listed the variance across 15 days for each raw measurement value. For few-shot, the prompt additionally added three examples (the maximum number of complete examples that fit within the context window). For all three models, positive and negative outcomes were scored by computing the log likelihood for each outcome.
We fit a logistic regression model separately for each binary outcome where the input is the same as that to the PH-LLM adapter. We also trained a CNN using the z-scored wearable data of shape 20 × 15 (whose sensor-based means and variances are the input to the PH-LLM adapter and logistic regression models) as input. The CNN sequentially consists of two convolutional layers with kernel shapes (15, 1) and (1, 1), respectively, each followed by a batch normalization layer and a dropout layer with dropout probability 0.2, two feedforward layers with sizes 64 and 1 and a sigmoid activation at the end. Both convolutional layers have 16 filters, one-step strides and a ReLU activation.
Evaluating PH-LLM on professional examinations
We evaluated model performance on the professional examination dataset using a combination of CoT prompting32 and self-consistency31. For our primary results, we generated five reasoning paths per question and selected the most common response as the final answer. For questions that did not have a unique modal response, we selected the answer option with highest log probability as the final answer.
To assess the ability of PH-LLM to differentiate correct from incorrect answers based on confidence, we used a log probability analysis. For each multiple-choice examination question, we computed the model’s log probability for every answer option and applied softmax normalization to obtain a probability distribution over all answer options. For metric calculation, each answer option was treated as an individual prediction, comparing the model’s normalized log probability against a binary label (1 for a correct response, 0 otherwise). We then computed micro-averaged AUROC and micro-averaged AUPRC across the sleep and fitness examinations independently.
Expert grading of case study responses and consistency
Although evaluation against MCQs and PROs can be performed by comparing model predictions to gold standard structured responses and numerical values, respectively, the case studies involve longer-form outputs. To evaluate these model responses, the domain experts (including all experts involved in creating the case study responses) were asked to evaluate three responses written to each case study: one by Gemini Ultra 1.0, one by PH-LLM and one by a domain expert (Extended Data Fig. 1c). Of the 100 holdout case studies, 94 had at least one single expert rate all three responses (45/50 in sleep and 49/50 in fitness); the remainder had multiple experts contribute ratings. Each domain expert was assigned evaluations randomly to case studies for which they did not write the expert response and were blinded to the source of each response. The domain experts evaluated each case study response based on a custom rubric that quantifies incorporation of user data, appropriate personalization based on user data, use of expert domain knowledge, evidence of confabulations or unwarranted assumptions, potential for harm, readability and overall quality (Supplementary Table 9).
Holdout case studies were fully distributed across the primary group of experts. A portion of the holdout case studies was additionally assigned to the rest of the available domain experts.
To assess inter-rater agreement and analyze differences between primary and secondary rater groups, we introduced overlap across case study rating assignments within each vertical, resulting in 78–1,428 paired ratings within a subset of raters. To evaluate inter-rater reliability, we employed several established metrics: raw counts of agreement, pairwise Spearman’s rank correlation, Kendall’s coefficient of concordance (Kendall’s W), weighted Cohen’s kappa and Gwet’s AC2. Spearman’s correlation and Kendall’s coefficient of concordance are metrics based on comparing ranks of expert ratings. These may be more conservative in the presence of many ties, as in our data with many ratings clustered around 4 and 5 (Extended Data Fig. 4). Weighted Cohen’s kappa measures agreement between raters adjusted for chance agreement. The metric ranges from −1 to 1 with 0 indicating that the agreement among raters is similar to chance. Weighted Cohen’s kappa can exhibit paradoxical behavior (‘Kappa Paradox’), underestimating the true extent of agreement between raters34 in imbalanced datasets such as ours. Gwet’s AC1, and its weighted extension designed for ordinal data, Gwet’s AC2, were designed to deal with class imbalance while adjusting for chance agreement34,35. Qualitative trends for all metrics were consistent across rater pairs in both verticals (Supplementary Figs. 17 and 18).
Automatic evaluation of case study responses
Although expert grading of case study responses was our primary mechanism for assessing model performance, AutoEval allows us to obtain a quick, although potentially less accurate, signal by using secondary models to perform this grading task48.
AutoEval dataset: while exploring different modeling mechanisms, we performed an initial round of expert grading using the rubrics (Supplementary Table 9) and procedure described above for 50 expert-generated case studies from each vertical across three response sources: experts, an untuned Gemini Ultra 1.0 model and a finetuned Gemini Pro 1.0 model. We split these studies into vertical-specific training and validation splits containing approximately 80% (N = 38) and 20% (N = 12) of case studies, respectively. Splits were structured such that samples rated by a given expert were evenly distributed between sets. All ratings associated with a given case study were included in that split, resulting in N = 6,552 total ratings across case study sections and evaluation principles for sleep (N = 4,872 train and N = 1,596 validation) and N = 9,331 for fitness (N = 7,138 train and N = 2,193 validation).
AutoEval examples: using the rated case studies, we constructed LLM prompts and targets. Prompts included a description of the rating task objective for the given case study section, a summary of data describing the case study, the principle being assessed and the principle’s Likert scale options. Each target was the expert-generated rating followed by the rating’s Likert option text description (for example, for a ‘No incorrect domain knowledge’ principle rating of 5, the target is ‘5. No incorrect domain knowledge references exist’.) (Supplementary Table 51).
AutoEval model training: we created AutoEval rater models by finetuning the smaller Gemini Pro 1.0 model26 using LoRA49 across a variety of vertical-specific data mixtures, including all ratings for a vertical and all ratings from a single rater. All AutoEval modeling experiments used a fixed set of hyperparameters, varying only the training data mixture: a LoRA rank of 4 on attention heads, random Gaussian initialization for LoRA’s A matrix and zero initialization for LoRA’s B matrix. Models were tuned to minimize the cross-entropy loss between predicted and target token distributions using the Adafactor optimizer50 with a constant learning rate of 2 × 10−5, a β1 of 0.95, a decay of 0.995 and a gradient norm clipping threshold of 1.0. None of warmup, weight decay or quantization was used. Models were tuned for a maximum of 20 epochs with a global batch size of 32. Checkpoints were taken four times per epoch, and the best checkpoint was selected by minimizing validation set log perplexity. The total number of steps varied per model because each training mixture contained a different number of samples depending on the vertical and rater (Supplementary Table 52).
AutoEval model selection: we evaluated four candidate AutoEval models for each vertical. An untuned Gemini Pro 1.0 model served as a baseline. The other three AutoEval candidates were trained identically using different data mixtures: ‘All’ contained all ratings in the vertical; ‘High variance’ contained ratings from only the expert rater with highest rating variance (raters ‘Sleep Primary C’ and ‘Fitness Primary C’); and ‘Low variance’ contained ratings from only the expert rater with lowest rating variance (raters ‘Sleep Primary D’ and ‘Fitness Primary B’). We generated model predictions by scoring the likelihood of each Likert option given the input prompt, converted these scores into five-class multinomial probabilities and chose the option with the largest probability score. We selected the best AutoEval model for each vertical using a combination of log perplexity loss and Spearman’s rank correlation between predictions and the ground truth ratings in the validation dataset (referred to as ‘best AutoEval’ models).
AutoEval inference: given case study candidate responses, we used the Likert scoring procedure described above to automatically rate outputs across case study sections and evaluation principles with our best AutoEval models.
Statistical analyses
Confidence intervals (95%) were determined via bootstrapping with 1,000 iterations. Statistical significance of expert ratings was determined using a two-sided Wilcoxon rank-sum test with false discovery rate (FDR) (Benjamini–Hochberg) correction when multiple sections or multiple evaluation principles were analyzed together. All P values refer to P values after FDR correction. For each P value, we report the test statistic Z and the effect size \(r=Z/\sqrt{N}\), where N is the total sample size of the test.
Inclusion and ethics
All individuals who fulfilled the criteria for authorship required by Nature Portfolio journals have been included as authors. This work includes findings that are locally relevant, which have been determined in collaboration with local partners. Roles and responsibilities were agreed among collaborators ahead of the research. This research was not severely restricted or prohibited in the setting of the researchers. Ethics approval for the case studies was granted by the Western Institutional Review Board-Copernicus Group IRB, which classified the research as exempt under 45 CFR 46.104(d)(4). Ethics approval for the Digital Wellbeing Study was granted by the IRB of the University of Oregon. This research does not result in stigmatization, incrimination, discrimination or personal risk to participants. Local and regional research relevant to our study was taken into account in citations.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All input data and associated expert responses for all 857 case studies in sleep and fitness used for training and evaluation, along with all model responses to the 100 holdout case studies, are stored in JSON format and are publicly available at https://github.com/google-health/consumer-health-research/blob/main/phllm/data/sleep_case_studies.all.jsonl and https://github.com/google-health/consumer-health-research/blob/main/phllm/data/fitness_case_studies.all.jsonl, respectively. All input and target prediction data, which include bucketized age, daily sensor readings for 20 sensors over 15 days and 16 binary survey responses, for all 4,163 samples in the PRO task are stored in JSON format. To ensure privacy of participants, data used for the PRO task are available to approved researchers for non-commercial research after registration and attestation of a data use agreement. Requests should be made to google-digital-wellbeing-dataset@google.com and will be evaluated on a weekly basis. MCQs are available at the listed URLs for sleep (https://www.boardvitals.com) and fitness (https://www.nsca.com/certification/cscs/certified-strength-and-conditioning-specialist-exam-description). Source data are provided with this paper.
Code availability
All code provided with this paper is available in the phllm folder of https://github.com/google-health/consumer-health-research. This repository includes code for interaction with the case study datasets, automatic evaluation of model responses to case study inputs, query and evaluation of MCQ examinations, a reference implementation of the multimodal adapter for predicting PROs and analysis and data visualization tools. The text LLM code underlying PH-LLM is Gemini Ultra 1.0.
References
Katz, D. M., Bommarito, M. J., Gao, S. & Arredondo, P. GPT-4 passes the bar exam. Philos. Trans. A Math. Phys. Sci. Eng. 382, 20230254 (2024).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. Preprint at https://arxiv.org/abs/2311.16452 (2023).
Saab, K. et al. Capabilities of Gemini models in medicine. Preprint at https://arxiv.org/abs/2404.18416 (2024).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Meyer, J. G. et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 16, 20 (2023).
Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit. Med. 6, 135 (2023).
Yang, L. et al. Advancing multimodal medical capabilities of Gemini. Preprint at https://arxiv.org/abs/2405.03162 (2024).
Galatzer-Levy, I., McDuff, D. & Malgaroli, M. The capability of large language models to measure and differentiate psychiatric conditions through O-shot learning. Biol. Psychiatry 97, S62 (2025).
Sharma, A. et al. Cognitive reframing of negative thoughts through human-language model interaction. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A. et al.) 9977–10000 (Association for Computational Linguistics, 2023).
Sharma, A., Rushton, K., Lin, I. W., Nguyen, T. & Althoff, T. Facilitating self-guided mental health interventions through human-language model interaction: a case study of cognitive restructuring. In Proc. CHI Conference on Human Factors in Computing Systems 1–29 (Association for Computing Machinery, 2024).
Lin, I. et al. IMBUE: Improving interpersonal effectiveness through simulation and just-in-time feedback with human-language model interaction. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics 810–840 (Association for Computational Linguistics, 2024).
Althoff, T. et al. Large-scale physical activity data reveal worldwide activity inequality. Nature 547, 336–339 (2017).
Chaput, J.-P. et al. Sleep timing, sleep consistency, and health in adults: a systematic review. Appl. Physiol. Nutr. Metab. 45, S232–S247 (2020).
Fogelholm, M. Physical activity, fitness and fatness: relations to mortality, morbidity and disease risk factors. A systematic review. Obes. Rev. 11, 202–221 (2010).
Wannamethee, S. G., Shaper, A. G. & Walker, M. Changes in physical activity, mortality, and incidence of coronary heart disease in older men. Lancet 351, 1603–1608 (1998).
Öhlin, B., Nilsson, P., Nilsson, J.-Å & Berglund, G. Chronic psychosocial stress predicts long-term cardiovascular morbidity and mortality in middle-aged men. Eur. Heart J. 25, 867–873 (2004).
Zheng, N. S. et al. Sleep patterns and risk of chronic disease as measured by long-term monitoring with commercial wearable devices in the All of Us Research Program. Nat. Med. 30, 2648–2656 (2024).
Steinhubl, S. R., Muse, E. D. & Topol, E. J. The emerging field of mobile health. Sci. Transl. Med. 7, 283rv3 (2015).
Kao, C.-K. & Liebovitz, D. M. Consumer mobile health apps: current state, barriers, and future directions. PM. R. 9, S106–S115 (2017).
Gordon, W. J., Landman, A., Zhang, H. & Bates, D. W. Beyond validation: getting health apps into clinical practice. NPJ Digit. Med. 3, 14 (2020).
Merrill, M. A. et al. Transforming wearable data into health insights using large language model agents. Preprint at https://arxiv.org/abs/2406.06464 (2024).
Kim, Y., Xu, X., McDuff, D., Breazeal, C. & Park, H. W. Health-LLM: large language models for health prediction via wearable sensor data. Preprint at https://arxiv.org/abs/2401.06866 (2024).
De Zambotti, M. et al. State of the science and recommendations for using wearable technology in sleep and circadian research. Sleep 47, zsad325 (2024).
Xie, J. et al. Evaluating the validity of current mainstream wearable devices in fitness tracking under various physical activities: comparative study. JMIR Mhealth Uhealth 6, e9754 (2018).
Gemini Team Google. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).
Yu, L. et al. Development of short forms from the PROMIS™ sleep disturbance and sleep-related impairment item banks. Behav. Sleep Med. 10, 6–24 (2012).
McDuff, D. et al. The Google Health Digital Well-Being Study: protocol for a digital device use and well-being study. JMIR Res. Protoc. 13, e49189 (2024).
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Proc. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds Inui, K. et al.) 2567–2577 (Association for Computational Linguistics, 2019).
Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).
Wang, X. et al. Self-consistency improves chain of thought reasoning in language models. In Eleventh International Conference on Learning Representations (ICLR, 2023); https://openreview.net/forum?id=1PL1NIMMrw
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th International Conference on Neural Information Processing Systems 24824–24837 (Curran Associates, 2022).
Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit. Med. 7, 258 (2024).
Gwet, K. L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61, 29–48 (2008).
Gwet, K. L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters (Advanced Analytics, LLC, 2014).
Zheng, L. et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 46595–46623 (Curran Associates, Inc., 2023).
Belyaeva, A. et al. Multimodal LLMs for health grounded in individual-specific data. In Machine Learning for Multimodal Healthcare Data: First International Workshop https://doi.org/10.1007/978-3-031-47679-2_7 (Springer, 2023).
Fleming, S. L. et al. MedAlign: a clinician-generated dataset for instruction following with electronic medical records. In Proc. AAAI Conference on Artificial Intelligence 38, 22021–22030 (AAAI Press, 2024).
Marshall, S., Haywood, K. & Fitzpatrick, R. Impact of patient-reported outcome measures on routine practice: a structured review. J. Eval. Clin. Pract. 12, 559–568 (2006).
Dobrozsi, S. & Panepinto, J. Patient-reported outcomes in clinical practice. Hematology Am. Soc. Hematol. Educ. Program 2015, 501–506 (2015).
Lavallee, D. C. et al. Incorporating patient-reported outcomes into health care to engage patients and enhance care. Health Aff. 35, 575–582 (2016).
Liang, P. et al. Holistic evaluation of language models. In Transactions on Machine Learning Research (TMLR, 2023); https://openreview.net/forum?id=iO4LZibEqW
Abbaspourazad, S. et al. Large-scale training of foundation models for wearable biosignals. In Twelfth International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=pC3WJHf51j
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Buysse, D. J. Sleep health: can we define it? Does it matter? Sleep 37, 9–17 (2014).
Doran, G. T. There’s a S.M.A.R.T. way to write managements’s goals and objectives. Manage. Rev. 70, 35 (1981).
Wei, J. et al. Finetuned language models are zero-shot learners. In International Conference on Learning Representations (ICLR, 2022); https://openreview.net/forum?id=gEZrGCozdqR
Chiang, C.-H. & Lee, H.-Y. Can large language models be an alternative to human evaluations? In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Rogers, A. et al.) 15607–15631 (Association for Computational Linguistics, 2023).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR, 2022); https://openreview.net/forum?id=nZeVKeeFYf9
Shazeer, N. & Stern, M. Adafactor: adaptive learning rates with sublinear memory cost. In Proc. 35th International Conference on Machine Learning (eds Dy, J. & Krause, A.) 4596–4604 (PMLR, 2018).
Acknowledgements
We thank the Fitbit research community participants for making this research possible. We thank the sleep and fitness experts who developed case study responses and evaluated candidate model responses for their dedication, effort and detailed feedback on multiple model iterations. Contributing sleep experts include B. Graef, T. Wong, T. Dang, S. Gorovoy, N. Krishnamurthy and M. Jonelis. Contributing fitness experts include J. Spraggins, A. Hetrick, J. Hannon, M. Knight, N. Dozier, L. Grissom and J. Leach. We thank H. Emir-Farinas, F. Hormozdiari and J. Barral for feedback and discussions that substantially improved the work. We also thank S. Lachgar, L. Winer, M. Shiels, L. Gardner, N. Tal, A. Um’rani, O. Adewunmi and A. Mathur for their valuable insights, technical support and feedback during our research.
Author information
Authors and Affiliations
Contributions
S. Patel, S.S., D.M. and C.Y.M. conceived the study. J.K., A.B., X.L., Z.Y., N.A.F., Y.P., J.C., L.D.S., T.A., D.M. and C.Y.M. had major involvement in designing various aspects of the study. J.K., A.B., X.L., Z.Y., N.A.F., C.L., E.S., Y.P., J.C., L.D.S., R.B., R.G.G., A.J., S.T., M.W., J.Y., D.M. and C.Y.M. had major involvement in data acquisition and ingestion. J.K., A.B., X.L., Z.Y., N.A.F., C.L., E.S., Y.P., J.C., L.D.S., R.G.G., S.T., M.W., D.M. and C.Y.M. generated synthetic data and initial case studies. J.K., A.B., X.L., Z.Y., N.A.F., C.L., Y.P., J.C., S.T., M.W., D.M. and C.Y.M. preprocessed data. J.K., A.B., X.L., Z.Y., N.A.F., C.L., Y.P., J.C., D.M. and C.Y.M. performed experiments. J.K., A.B., X.L., Z.Y., N.A.F., C.L., E.S., L.D.S., C.S., T.A., J.Z., S. Prabhakara, D.M. and C.Y.M. analyzed results. L.D.S., C.S., T.A., C.H. and J.H. provided clinical and domain expertise. E.S., A.J., R.L., Y.L., J.P. and J.Y. administrated the project. R.B., Y.L., J.K.R., M.M., L.S., Y.M., G.S.C., S. Patel, S.S., J.Z. and S. Prabhakara provided organizational support for the project. J.K., A.B., X.L., Z.Y., N.A.F., D.M. and C.Y.M. wrote and revised the initial version of the paper. All authors had access to the data and have edited, reviewed and approved the final version of the paper for publication. All authors accept responsibility for the accuracy and integrity of all aspects of the research.
Corresponding authors
Ethics declarations
Competing interests
This study was funded by Google LLC. All authors are employees of Alphabet and may own stock as part of the standard compensation package.
Peer review
Peer review information
Nature Medicine thanks Jessilyn Dunn, Yonghui Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Michael Basson, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Case study creation and curation, model training, and response evaluation workflow.
a, Case studies were selected from anonymized Fitbit production data from individuals who provided consent for research purposes. Two sets of case studies were generated: one set for model training, validation, and testing and a separate holdout set for final evaluation. To facilitate rapid development of high-quality answers, the train/validation/test set of case studies had candidate responses generated by Gemini, which were then edited and rewritten by domain experts. To enable comparison of human and model-derived responses, the holdout set had responses written solely by the domain experts. b, For model training, each case study was split into multiple prompt/answer pairs based on how many sections the case study had: N=3 for sleep with insights, etiology, and recommendations sections, N=5 for fitness with demographics, training load, sleep metrics, health metrics, and assessment sections (Methods). Gemini Ultra 1.0 underwent full fine-tuning using those examples to create PH-LLM. c, Expert evaluation was performed independently on the holdout dataset by the same set of domain experts responsible for generating the expert responses. For each case study in the holdout set, one or more experts who did not write the corresponding expert response graded the candidate responses (expert-written response, Gemini Ultra 1.0 response, and PH-LLM response, with 94 of the 100 case studies having all three candidate responses graded by a single expert.
Extended Data Fig. 2 Overall performance on sleep and fitness professional exams across PH-LLM, other Gemini models, GPT models, Claude 3 Opus, and Med-PaLM 2.
All Gemini model sizes are based on the Gemini 1.0 model family. The sleep and fitness exams comprised n= 629 and n= 99 questions, respectively. Data are presented as mean categorical accuracy over 1,000 bootstrapping iterations and error bars show 95% confidence intervals.
Extended Data Fig. 3 Pairwise Gwet’s AC2 measuring inter-rater reliability between primary and secondary raters.
Metrics were computed using all ratings for each principle and section across case studies rated by more than one rater in the sleep (a) and fitness (b) domains. The number of overlapping ratings is denoted by n. Mean metrics and 95% confidence intervals derived from 1,000 bootstrapping iterations are reported for each pair.
Extended Data Fig. 4 Contingency tables showing pairwise rating agreement between raters.
Counts are aggregated across all case studies, sections, and principles for each case study for which multiple ratings are available in the sleep (a) and fitness (b) domains. Blue, primary versus primary raters. Green, primary versus secondary raters. Yellow, secondary versus secondary raters.
Extended Data Fig. 5 Sleep and fitness case study human evaluation results by principle.
Mean ratings given by experts for different case study evaluation principles across all sections in the sleep (a) and fitness (b) domains. The principles are ordered according to the rubric presented in Supplementary Table 9. ‘*’ indicates a statistically significant difference (p < 0.05) using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.
Extended Data Fig. 6 Contingency tables showing pairwise rating agreement between our best AutoRaters, their corresponding expert raters, and other experts.
Counts are aggregated across all case studies, sections, and principles for each case study for which at least one rating from the AutoEval training rater is available in the sleep (a) and fitness (b) domains. Blue, the primary expert rater versus other raters. Green, the AutoEval model trained on primary expert ratings versus other raters. Yellow, the primary expert rater versus the corresponding AutoEval model.
Extended Data Fig. 7 Automatic evaluation of coaching recommendations across PH-LLM, baseline models, and human experts.
Mean ratings were generated using our best AutoEval models for the holdout case study subsections in the sleep (a) and fitness (b) domains. Within each section, a ‘*’ indicates a statistically significant difference (p <0.05) from the top rated response type using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.
Extended Data Fig. 8 Effect of fine-tuning data scale on model performance in coaching recommendations.
Mean ratings were generated using our best AutoEval models for the holdout case study subsections in the sleep (a) and fitness (b) domains. ‘PH-LLM’ denotes standard performance while ‘Subsampled 25%’ and ‘Subsampled 50%’ denote responses from models trained on 25% and 50% of the training dataset, respectively. ‘Gemini Ultra’ denotes untuned baseline performance (that is, Gemini Ultra 1.0 trained on 0% of the training dataset). Within each section, a ‘*’ indicates a statistically significant difference (p <0.05) from the top rated response type using the two-sided Wilcoxon rank-sum test and multiple hypothesis testing correction. Error bars represent 95% confidence intervals bootstrapped over 1,000 iterations. Within each bar, n denotes the number of principle ratings per conversation source and circles show the proportion of scores at a given Likert rating.
Extended Data Fig. 9 Performance of PH-LLM and traditional ML models on patient-reported outcomes prediction.
We compared the ability of PH-LLM with and without a multimodal adapter, logistic regression, and a convolutional neural network (CNN) to infer subjective patient-reported outcomes in the test set (n=833). a, Area under the receiver operating characteristic curve (AUROC). b, Area under the precision-recall curve (AUPRC). Data are presented as mean performance measures over 100 bootstrapping iterations and error bars show 95% confidence intervals. The CNN underperforms logistic regression, likely due to the limited size of the dataset.
Extended Data Fig. 10 Distributions of age and gender in case studies.
a, Sleep (N=507 individuals). b, Fitness (N=58 individuals).
Supplementary information
Source data
Source Data Fig. 2
Statistical source data for panels c and d: human expert ratings across conversation sources. Figure panels a and b demonstrate case study data and do not include any statistical source data.
Source Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data: human expert ratings across conversation sources.
Source Data Extended Data Fig. 4
Statistical source data: human expert ratings across conversation sources.
Source Data Extended Data Fig. 5
Statistical source data: human expert ratings across conversation sources.
Source Data Extended Data Fig. 6
Statistical source data: human expert and AutoEval ratings across conversation sources.
Source Data Extended Data Fig. 7
Statistical source data: AutoEval ratings across conversation sources.
Source Data Extended Data Fig. 8
Statistical source data: AutoEval ratings across subsampled PH-LLM conversation sources.
Source Data Extended Data Fig. 9
Statistical source data.
Source Data Extended Data Fig. 10
Statistical source data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Khasentino, J., Belyaeva, A., Liu, X. et al. A personal health large language model for sleep and fitness coaching. Nat Med (2025). https://doi.org/10.1038/s41591-025-03888-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41591-025-03888-0