Abstract
Background: Social media timelines contain rich signals of users’ mental states but are too voluminous for direct clinical review. Although large language models (LLMs) demonstrate robust linguistic and summarization capabilities in general-purpose tasks, distilling clinically relevant insights demands deeper psychological analysis and sensitivity to each individual’s unique personality and context. Accurately capturing subtle, personalized affective and behavioral patterns remains a significant challenge for current models. A thorough, systematic evaluation of LLM-generated clinical summaries is therefore essential to understand their readiness for real-world mental health monitoring. Objective: This study evaluates the ability of an LLM-based pipeline to generate clinically meaningful summaries of social media timelines, compared to summaries written by human clinicians. The summaries are structured along 3 key clinical aspects, including an overall mental health assessment, intrapersonal and interpersonal patterns, and mental state changes over time. Methods: We use a recent state-of-the-art approach that combines a hierarchical variational autoencoder (VAE) with an LLM (Large Language Model-Meta AI 2 13-billion-parameter version; LLaMA2 13B). This method first summarizes the patient’s history using the VAE and then transforms this summary into a clinical narrative using the LLM. We also test both single-step and multistep LLM-prompting techniques and devise comprehensive clinical prompts. For 30 social media timelines, model outputs were evaluated against human-written summaries through human ratings and expert qualitative analysis. Linguistic diversity was automatically measured as a proxy for personalization. Results: Human summaries scored highest for factual consistency (3.75) and general usefulness (3.63). The timeline-hierarchical variational autoencoder (TH-VAE) model outperformed LLaMA for factual consistency (3.35 vs 3.08) and general usefulness (3.28 vs 3.38). Both 2-step models were comparable to humans in describing interpersonal and intrapersonal patterns (3.45-3.48 vs 3.33) and changes over time (3.42 vs 3.35-3.30). The naive LLaMA baseline scored lower on all criteria except factual consistency. Furthermore, a qualitative analysis observed that human summaries provided more accurate, deep, and personalized insights, while LLMs offered more exhaustive but generic descriptions. Quantitatively, linguistic diversity was higher in human summaries both at the semantic level (mean Cohen d=1.19) and at the surface level (mean Cohen d=1.31). Conclusions: At this time medium-size LLMs can generate largely accurate and informative clinical summaries of social media timelines, and advanced prompting boosts performance modestly. However, at the time of this writing, they underperform human clinicians in capturing subtle psychological nuances and individual idiosyncrasies. Future work should integrate domain-specific fine-tuning and enhanced context modeling to improve LLM clinical fidelity.
| Original language | English |
|---|---|
| Article number | e71230 |
| Journal | JMIR Formative Research |
| Volume | 10 |
| DOIs | |
| State | Published - 2026 |
Keywords
- clinical summaries
- large language models
- monitoring mental health
- natural language processing
- social media
Fingerprint
Dive into the research topics of 'Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver