Automating PTSD Diagnostics in Clinical Interviews:
Leveraging Large Language Models for Trauma Assessments

Sichang Tu ¹ Abigail Powers ¹ Natalie Merrill ¹ Negar Fani ¹
Sierra Carter ² Stephen Doogan ³ Jinho D. Choi¹
¹Emory University Atlanta GA USA
²Georgia State University Atlanta GA USA
³Doogood Foundation New York NY USA
{sichang.tu, abigail.d.powers, natalie.merrill, nfani,jinho.choi}@emory.edu
[email protected], [email protected]

Abstract

The shortage of clinical workforce presents significant challenges in mental healthcare, limiting access to formal diagnostics and services. We aim to tackle this shortage by integrating a customized large language model (LLM) into the workflow, thus promoting equity in mental healthcare for the general population. Although LLMs have showcased their capability in clinical decision-making, their adaptation to severe conditions like Post-traumatic Stress Disorder (PTSD) remains largely unexplored. Therefore, we collect 411 clinician-administered diagnostic interviews and devise a novel approach to obtain high-quality data. Moreover, we build a comprehensive framework to automate PTSD diagnostic assessments based on interview contents by leveraging two state-of-the-art LLMs, GPT-4 and Llama-2, with potential for broader clinical diagnoses. Our results illustrate strong promise for LLMs, tested on our dataset, to aid clinicians in diagnostic validation. To the best of our knowledge, this is the first AI system that fully automates assessments for mental illness based on clinician-administered interviews.

\NewColumnType

MQ[c,m] \NewColumnTypeW[1]Q[l,m,wd=#1] \NewColumnTypeLQ[l,m] \NewColumnTypeN[1]Q[c,m,wd=#1]

Automating PTSD Diagnostics in Clinical Interviews:
Leveraging Large Language Models for Trauma Assessments

1 Introduction

Mental health has become a vital element of overall well-being. The prevalence of mental illness poses, however, a critical challenge to healthcare, underscoring the urgent need for an increased capacity of mental health services. Only 29% of people with psychosis receive formal care, leaving a significant portion completely untreated (WHO: World Health Organization (2021)). Aside from obstacles such as high costs, limited awareness, and stigma surrounding mental health, the shortage of the mental health workforce has been a major factor exacerbating this gap. According to WHO, the average ratio of mental health workers per 100,000 population was 13,making it difficult for people to access reliable and readily administrated mental health diagnostics, as well as subsequent support and interventions.

The emergence of Large Language Models (LLMs) has suggested innovative solutions to this challenge. Several studies have explored LLM applications in mental health for condition detection Zhang et al. (2022), support and counseling Ma et al. (2023b) as well as clinical decision-making Fu et al. (2023), and shown the feasibility for LLMs to enhance the workforce of mental healthcare Hua et al. (2024). By harnessing LLMs’ ability to interpret languages that involve high expertise, it is possible to mitigate the service gap in the healthcare ecosystem through the automation of condition detection and diagnosis without the need of training so many professionals, which is both costly and time-consuming.

Despite these advancements, notable limitations persist in the current research on automatic diagnosis for mental health. Most studies have focused on prevalent conditions like stress Lamichhane (2023) and depression Qin et al. (2023), with scant attention to less common but more severe conditions like Post-traumatic Stress Disorder (PTSD). Moreover, while prior studies have leveraged data from social media, clinical notes, and electronic health records, very few have utilized clinical interviews, and even in those cases, they rely on basic self-administered scales estimated in dialogues between computers and patients Galatzer-Levy et al. (2023). No work has employed diagnostic interviews between real clinicians and patients that are systematically conducted, resulting in a dearth of practical research onthe automatic diagnosis of mental illness.

In this paper, we present an LLM-based systemthat listens to long-hour conversations between clinicians and patients and performs diagnostic assessments for PTSD. Our final model is evaluated by clinicians specialized in PTSD, suggesting a great potential for LLMs while highlighting certain limitations (Section 6). Our primary contributions are:¹¹1Our final model is publicly available through our open-source project at https://github.com/emorynlp/TraumaNLP.

•

A new dataset comprising over 700 hours of interviews between clinicians and patients is created. Every interview consists of multiple diagnostic sections, featuring a series of questions and corresponding assessments from clinicians based on the interview contents (Section 3).
•

A novel and comprehensive pipeline is developed to process the interview dataset, so it can be used to build automatic assessment models on PTSD, which can be easily adapted to a broad range of diagnostic interviews (Section 4).
•

Assessment models achieving promising results are developed using two state-of-the-art LLMs, showcasing LLMs’ ability to answer diagnostic questions through information extraction and text summarization on the interviews (Section 5).

To the best of our knowledge, this is the inaugural system designed to conduct diagnostic assessmentson mental health while interpreting real-world interviews administered by clinicians. We believe thatthis work will foster clinical collaboration between human experts and Artificial Intelligence, thus promoting equitable access to appropriate care for all populations affected by mental illness.

2 Related Work

Pre-trained language models have been widely applied in many healthcare tasks Englhardt et al. (2023); Hu et al. (2023); Peng et al. (2023); Ma et al. (2023a); Liu et al. (2023a). The emergence of LLMs has introduced new capabilities and innovations in healthcare to this domain (Nori et al., 2023; Cascella et al., 2023). This section introduces the related research of LLMs and their applications in healthcare, particularly in mental health.

2.1 LLMs in Mental Health

The advent of LLMs like GPT (OpenAI, 2023), Llama (Touvron et al., 2023), and PaLM (Chowdhery et al., 2022) has sparked research into their applications in mental health (Ji et al., 2023). One key area is using conversational agents for mental health support and counseling, where LLMs excel at generating empathetic responses (Lai et al., 2023; Ma et al., 2023b; Loh and Raamkumar, 2023), highlighting their potential as digital companions or on-demand service providers. Additionally, the research on decision-support systems for novice counselors underscores their potential to enhance mental healthcare provision (Fu et al., 2023).

Research has also explored LLMs in disease detection and diagnosis (Zhang et al., 2022), focusing on issues like depression (Qin et al., 2023), stress (Lamichhane, 2023), and suicidality (Bhaumik et al., 2023). Closer to our work, Bartal et al. (2023) use text-based narratives from new mothers to assess childbirth-related PTSD with GPT and neural network models. Although GPT showed moderateperformance, it holds promise for clinical diagnosis with further refinement. These studies typically use zero/few-shot prompting for binary or multi-label classification, demonstrating LLMs’ capabilities in detecting mental health issues without fine-tuning, despite challenges like unstable responses, potential bias, and interpretation inaccuracies.

Some research has pivoted towards fine-tuning LLMs for domain-specific performance enhancement. Xu et al. (2023) present two fine-tuned models, Mental-Alpaca and Mental-FLAN-T5, outperforming GPT-3.5 and GPT-4 in multiple mental health prediction tasks. Based on Llama-2, Yang et al. (2023) train MentaLLaMA on 105K social media data enhanced by GPT. The model performance is on par with other state-of-the-art methods, while providing interpretable analysis.

2.2 LLMs in Clinical Interview and Diagnosis

Research on using LLMs on clinical interview data and diagnosis is limited. Wu et al. (2023) utilize GPT to augment the Extended Distress Analysis Interview Corpus by generating a new dataset from provided profile and rephrasing existing data. The augmented data outperforms the original imbalanced data in PTSD diagnosis. Galatzer-Levy et al. (2023) adopt Med-PaLM-2 to predict Major Depression Disorder (MDD) and PTSD on eight item Patient Health Questionnaire and PTSD Checklist-Civilian version ratings.

Section	Questions	Variables	Example Question	Example Variable
LBI	31	15	What has been your primary source of income over the past month?	lbi_a1
THH	39	20	In the past, have you been treated for any emotional or mental health problems with therapy or hospitalization?	thh_tx_yesno
CRA	17	20	What would you say is the one that has been most impactful where you are still noticing it affecting you?	critaprobenotes
CAP	241	92	In the past month, have you had any unwanted memories of the [Event] while you were awake, so not counting dreams?	dsm5capscritb01
trauma1_distress

Table 1: Statistics and examples for each of the four sections employed in this study.

3 PTSD Interview Data

This study utilizes data from diagnostic interviews administered as part of a larger study on risk and resiliency to the PTSD development in a population seeking medical care Gluck et al. (2021). Participants were recruited from waiting rooms in primary care, gynecology and obstetrics, and diabetes medical clinics at a publicly funded, safety-net hospital. Data were collected from 2012 to 2023, and inclusion criteria were ages between 18 and 65 with the capacity to provide informed consent. The parent study was conducted according to the latest version of the Declaration of Helsinki World Medical Association (2013), and consent from the participants was obtained after explaining the procedures. The informed consent was approved by our Institutional Review Board and Research Oversight Committee.

3.1 Participants

Participants were paid $60.00 for this interview and underwent semi-structured diagnostic interviews conducted by doctoral-level clinicians or doctoral students supervised by a licensed clinical psychologist on staff. A total of 411 interviews were conducted with 336 unique participants, some of whom had follow-up interviews after >1 month. 93.4% ofthe participants were women and 79.5% were Black or African American ( $M_{age}$ = 31.4), where 38.7% had a high school education or less and 57.9% reported a monthly household income of < $1,000.

3.2 Interview Procedures

The diagnostic interview begins with a section of the Longitudinal Interval Follow-Up Evaluation to assess global adaptive functioning across various psychosocial domains, including work, household, relationship as well as general functioning, and life satisfaction in the past month Keller et al. (1987). Videos of the interviews are recorded using online conferencing software such as Zoom and Microsoft Teams. Each interview lasts 1.5 hours on average, involving the participant and 1-2 interviewers.

3.3 Psychiatric Diagnoses and Treatment

A total of 10 sections are applied during the interview. Among them, 4 sections are administered to the majority of participants; thus, this study focuses on those 4 sections. The first two sections, the Life Base Interview (LBI) and the Treatment History & Health (THH), are internally designed to assess the history of psychiatric diagnoses and treatment, as well as the presence of suicidality. The other two sections, the Criterion A (CRA) and the Clinician-Administered PTSD Scale for DSM-5 (CAP), follow the standard diagnostic criteria for PTSD outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; Weathers et al. (2018)). Everysection is accompanied by a set of questions, linked to variables that store pertinent values derived from the corresponding answers. Table 1 shows statistics and examples for each of the 4 sections.²²2Descriptions of all 10 sections are provided in Appendix A.

LBI

It assesses the participant’s functioning over the past month, addressing topics such as daily life, work, relationships with friends and family, and overall life satisfaction.

THH

It covers the participant’s treatment/health history, including past physical and mental conditions as well as treatments received, such as medication and therapeutic services.

CRA

It assesses whether the participant has been exposed to (threatened) death, serious injury, or sexual violence, with a focus on potential traumatic experiences the participant might have endured.

CAP

It centers on issues the participant may have encountered due to traumatic events, including distress, avoidance of trauma-related stimuli, negative thoughts and feelings, and trauma-related arousal.

4 Data Processing

Every video is converted into an MP3 audio file and transcribed by two automatic speech recognizers, whose results are aligned to produce a high-quality transcript. The transcript is segmented into multiple sections based on the relevant questions, and each question is paired with its assessment result.

4.1 Transcription

Two commercial tools, Rev AI³³3Rev AI: https://www.rev.ai and Azure Speech-to-Text⁴⁴4Azure Speech-to-Text: https://bit.ly/42r24pA, and an open-source tool, OpenAI Whisper Radford et al. (2023), are tested for automatic speech recognition (ASR) on our dataset. Whisper gives the lowest Word Error Rate (WER; Klakow and Peters (2002)) of 0.13, compared to 0.21 and 0.16 from Rev AI and Azure, respectively. Whisper also exhibits better performance in handling noisy environments and numbers that Azure often misses or inaccurately transcribes (Table 2). Despite its superior ASR performance, Whisper does not identify speakers, a feature found in the others. Thus, both Azure and Whisper are run on all audios and their results are combined to obtain the best outcomes.

Tool	Examples
Azure	(1) I got 2020 on the 24 with three. Three will be 3 is turning 2116, one 15211.
(2) They happened in 2017 and I’ll be 60 next month, so 5556 something like that.
Whisper	(1) I got two to be 20 on the 24th, well, three, three is turning 20, one 16, one 15, two 11.
(2) That happened in 2017 and I’ll be 60 next month, so. 55, 56, something like that.

Table 2: Comparisons between Azure and Whisper transcripts, with equivalent tokens coded in matching colors.

4.2 Alignment

To map the speaker diarization (SD) output from Azure to the Whisper output, Align4D⁵⁵5Align4D: https://github.com/emorynlp/align4d is used such that the first and last words of every utterance in the Azure output are aligned to their corresponding words in the Whisper transcript with speaker info, and form a speaker turn spanning all words between those words. Some words in the Whisper transcript may get left out from this mapping, which are combined with either preceding or following adjacent utterances using heuristics.

Text-based Diarization Error Rate (TDER; Gong et al. (2023)) is used, more suitable than traditional metrics like WER or Diarization Error Rate (DER; Fiscus et al. (2006)), for evaluating text-based SD. Transcripts from 29 audios produced by Microsoft Teams are used as the gold-standard, where Teams identifies speakers via different audio channels with near-perfect SD. Our aligned method achieves a TDER of 0.56, a significant improvement over the TDER of 0.62 achieved by Azure alone.

4.3 Segmentation

Each interview is conducted through multiple sections comprising a series of questions (Section 3.3), yet recorded as one continuous video. It is crucial to segment the video into sections, each of which is split into sessions, where a session contains content relevant to a specific question. Here, a session is defined as a list of utterances where the first utterance includes the corresponding question, and it is followed by another session whose first utterance includes the next question (if it exists). Algorithm 1 describes how a section is matched in the transcript.

Input:

U

: a list of utterances,

Q

: a list of questions.

Output: An ordered list of tuples comprising utterance IDs and their matching scores.

S\leftarrow\texttt{similarity\_matrix}(U,Q^{c})

;

T\leftarrow[\texttt{max}(S_{*,i}):1\leq i\leq|Q^{c}|]

;

4 if $\texttt{average}(T)>0.6\>\mathbf{and}$

\quad(|\texttt{select}(T,0.8)|\geq 3\>\mathbf{or}\>|\texttt{select}(T,0.9)|\geq 2)

then return $\texttt{sequence\_alignment}(S)$ ;

return $\varnothing$

Algorithm 1

\texttt{section\_match}(U,Q^{c})

Let $U$ be a list of utterances, and $Q^{c}$ a list of core questions for a specific section.⁶⁶6Core questions are required for retrieving essential information, while optional questions depend on the answers to the core questions, so are often skipped during the interview. $S\in\mathbb{R}^{|U|\times|Q^{c}|}$ is created, where $S_{i,j}$ is a similarity score between $u_{i}$ $\in U$ and $q_{j}\in Q^{c}$ (L1). $T\in\mathbb{R}^{|Q^{c}|}$ is then created by selecting the maximum similarity score for every question (L2). Given a function $\texttt{select}(T,s)$ that returns a list of scores in $T$ greater than $s$ , the section is matched if $T$ ’s average score is > 0.6 (L3) and if there exist at least 3 or 2 questions whose matching scores are > 0.8 or 0.9, respectively (L4). If the section is matched, Gong et al. (2023)’s sequence alignment algorithm is applied to $S$ , which returns an ordered list of utterance IDs and their matching scores for questions in $Q^{c}$ ; otherwise, it returns an empty list (L5). In our case, Sentence Transformer is used to create embeddings for utterances & questions Reimers and Gurevych (2019), and cosine similarity is used to estimate the scores.

Overlap between spans of two sections may occur due to incorrect matching. Algorithm 2 shows how to remove such overlaps. Let $Q^{c}_{i}$ be a list of core questions for the $i$ ’th section, and $R_{i}=\texttt{sm}(U,Q^{c}_{i})$ (sm: section_match). Given $(R_{1},R_{2})$ , $R^{\prime}_{1}$ is created by taking a subset of $R_{1}$ whose utterance IDs exist in $R_{2}$ (L1), and $R^{\prime}_{2}$ is created similarity (L2). If $R^{\prime}_{1}$ contains more questions with scores > 0.6 than $R^{\prime}_{2}$ , implying $Q^{c}_{1}$ is more likely matched to the overlapped span than $Q^{c}_{2}$ , $R^{\prime}_{2}$ is removed from $R_{2}$ (L4); otherwise, $R^{\prime}_{1}$ is removed from $R_{1}$ (L5).

Input:

R_{1}

R_{2}

: ordered lists of tuples comprising utterance IDs and their matching scores for the first and second sections, respectively.

Output:

(R_{1},R_{2})

: updated lists without overlaps.

R^{\prime}_{1}\leftarrow[(i,s):\forall{(i,s)\in R_{1}}\land(i,*)\in R_{2}]

;

R^{\prime}_{2}\leftarrow[(i,s):\forall{(i,s)\in R_{2}}\land(i,*)\in R_{1}]

;

4 if $|\texttt{select}(R^{\prime}_{1},0.6)|>|\texttt{select}(R^{\prime}_{2},0.6)|$ then

5 return $(R_{1},R_{2}\setminus R^{\prime}_{2})$

return $(R_{1}\setminus R^{\prime}_{1},R_{2})$

Algorithm 2

\texttt{remove\_overlap}(R_{1},R_{2})

Finally, Algorithm 3 shows how session spans are found for a specific section. $C^{e}$ is a list of tuples comprising utterance IDs and their scores for the $k$ ’th section created by Algorithms 1 and 2 (L1) (ro: remove_overlap). $C^{\ell}$ is created in the same manner, except adapting the Levenshtein Distance (LD) as the similarity metric (L2) Levenshtein (1966). $\texttt{sel}(C,s)$ returns a list of tuples comprising utterance IDs and their matched question IDs, where the scores > $s$ . $\texttt{last}(U,Q^{c}_{*})$ returns the first utterance ID of the $(k+1)$ ’th section if exists; otherwise, it returns the last utterance ID of $U$ . $C$ is created by taking the intersection of $C^{e}$ and $C^{\ell}$ whose scores > 0.8 and 0.7, and the last utterance ID (L3).⁷⁷7Any section not matched by Algorithm 1 is considered absent.

For each span $U^{\prime}$ of utterances between $C_{i}$ and $C_{i+1}$ (exclusive for both ends), a list $Q^{\prime}$ of optional questions related to $C_{i}$ is created (L5-7). $T^{e}$ is a list of tuples comprising utterance IDs in $U^{\prime}$ and their matched question IDs in $Q^{\prime}$ with scores > 0.8, and $T^{\ell}$ is created using LD (L8-9). The intersection of $T^{e}$ and $T^{\ell}$ is appended to a list $O$ (L10), which is then merged with $C$ and sorted to produce $V$ (L11).

For each span $U^{\prime\prime}$ between $V_{i}$ and $V_{i+1}$ , a list $Q^{\prime\prime}$ of any questions have not been matched in that spanis created (L14). Bipartite matching bw. $U^{\prime\prime}$ and $Q^{\prime\prime}$ are performed to find matches optimizing severalcriteria in Appendix B.1 (L15), accumulated, merged, and sorted to produce the final list (L16-17).

Input:

U

: a list of utterances,

Q^{c|o}_{1..4}

: lists of core

|

optional questions for the 1..4’th sections,

k

: the section index to segment sessions in.

Output:

(R_{1},R_{2})

: updated lists without overlaps.

C^{e}\leftarrow\texttt{ro}(\texttt{sm}^{e}(U,Q^{c}_{k}),\texttt{sm}^{e}(U,Q^{c}_{\forall j\neq k}))

;

C^{\ell}\leftarrow\texttt{ro}(\texttt{sm}^{\ell}(U,Q^{c}_{k}),\texttt{sm}^{\ell}(U,Q^{c}_{\forall j\neq k}))

;

C\leftarrow(\texttt{sel}(C^{e},0.8)\cap\texttt{sel}(C^{\ell},0.7))\cup\texttt{last}(U,Q^{c}_{*})

;

O\leftarrow\varnothing

;

6 for $i\leftarrow 1$ to $(|C|-1)$ do

U^{\prime}\leftarrow

a list of utterances between

C_{i}

and

C_{i+1}

;

Q^{\prime}\leftarrow

a list of questions in

Q^{o}_{k}

related to

C_{i}

;

T^{e}=\texttt{sel}(\texttt{sm}^{e}(U^{\prime},Q^{\prime}),0.8)

;

T^{\ell}=\texttt{sel}(\texttt{sm}^{\ell}(U^{\prime},Q^{\prime}),0.7)

;

O\leftarrow O\cup(T^{e}\cap T^{\ell})

;

(V,W)=(\texttt{sorted}(C\cup O),\varnothing)

;

14 for $i\leftarrow 1$ to $(|V|-1)$ do

U^{\prime\prime}\leftarrow

a list of utterances between

V_{i}

and

V_{i+1}

;

Q^{\prime\prime}\leftarrow

a list of questions in

Q^{c}_{k}\cup Q^{o}_{k}

that are between

V_{i}

and

V_{i+1}

;

T\leftarrow

the best bipartite matching results between

U^{\prime\prime}

and

Q^{\prime\prime}

optimizing several criteria in B.1;

W\leftarrow W\cup T

return $\texttt{sorted}(V\cup W)$

Algorithm 3

\texttt{session\_match}(U,Q^{c}_{1..4},Q^{o}_{1..4},k)

4.4 Assessment Pairing

Answers to the questions are used to determine the values of the variables (Table 1), resulting in many-to-many relations between questions and variables (many-questions to one-variable is the most common case). Our data comprises five variable types. (1) Scale assesses on an ordinal scale with ratings for intensity, severity, or likeness. (2) Category selects among binary choices or distinct class labels. (3) Measure captures various units such as duration, frequencies, and ages. (4) Notes are summarized texts documented by the interviewers. (5) Rule is calculated based on predefined rules derived from the other variable types. Table 3 shows the statistics of all variables for each section in our dataset.

Type	Variables					Count
	LBI	THH	CRA	CAP	Total
Scale	7	1	0	40	48	9,722
Category	4	9	15	3	31	4,258
Measure	2	0	1	24	27	3,482
Notes	1	10	3	0	14	1,146
Rule	1	0	1	25	27	6,326

Table 3: Statistics of the five types of variables. Examples of these variables are provided in Appendix B.2.

VT	Template
S&C	[INTRO]. Based on the patient’s interview history, please determine {keywords} that the patient {symptom}. [RETURN]. [REASON]. The "answer" should be in the range {range}.{attributes}
M	[INTRO]. Based on the patient’s interview history, please calculate {keywords} that the patient have {symptom}. [RETURN]. [REASON]. The "answer" should be {type}.
N	[INTRO]. Based on the formatted data from patient’s interview, please determine whether or not the formatted data includes this specified information {single_slot}. [RETURN]. The "reason" gives a brief explanation on whether the formatted data includes or omits the information. The "answer" should be either "yes" or "no", indicating the presence or absence of the information in formatted data.

Table 4: Instruction templates for Scale, Category, Measure, and Notes variables. VT: Variable type, [INTRO]: Imagine you are a professional clinician, [RETURN]: Return the answer as a JSON object with "reason" and "answer" as the keys, [REASON]: The "reason" should provide a brief justification or explanation for the answer.

5 Experiments

5.1 Dataset

The original data contains 411 interviews (Sec. 3). Whisper tends to generate irrelevant or repetitive sequences when prolonged silences occur, rendering about $\approx$ 20% of the resulting transcripts unusable. To address this issue, silence removal and noise cancellation techniques are applied, recovering $\approx$ 80% of them. Among the 393 successful transcripts, 322 of them have human assessments (§4.4), which are used to evaluate our approach (Table 5).

	Audios	Hours	Turns	Tokens
Original	411	703	116,501	6,035,027
Transcribe	393	651	90,174	5,499,662
Evaluation	322	515	71,412	4,335,977

Table 5: Statistics of our PTSD interview dataset.

Compared to other interview datasets⁸⁸8Statistics of the comparison is provided in Appendix C.1., our dataset is the largest in the mental health domain. While existing datasets often involve human-machine dialogues or crowdworker simulations, ours consists of formal diagnostic interviews conducted entirely by clinicians, making it the first clinician-administered interview dataset. Additionally, our dataset aims to generate comprehensive diagnostic reports rather than just single scores, providing more detailed resource for clinical practice.

5.2 Large Language Models (LLMs)

The state-of-the-art commercial and open-source large language models, GPT-4 and Llama-2 Touvron et al. (2023), are adapted for our experiments.⁹⁹9Specific versions, parameters, and costs for these large language models are provided in Appendix C.3 and C.4. For each question, a model takes all sessions related to the variable to which the question pertains (§4.4), and an instruction to provide the answer and explanation. Table 4 shows our templates including replaceable patterns to generate the instruction for each variable type. For Scale, {keywords} can be replaced with "how severe", and {symptom} with "have unwanted dreams in the past month". For Category, {keywords} can be replaced with "which of the following categories best describes", and {symptom} with "usual employment status". To constrain the answer generated by the model, details such as the answer {range} for S&C, and the value {type} for Measure are incorporated. S has a special pattern {attributes}, directing the model to return a particular score under certain conditions.

Assessing model performance for Notes poses a challenge as they must be compared against text summarized by interviewers. Given the complexity of this task, it is decomposed into multiple subtasks of binary classifications, information extraction, and categorization by adapting Chain-of-Thought Wei et al. (2023). First, GPT is asked to generate a list of slots for each N variable, based on a batch of summary notes from interviewers. Because many of these slots have similar meanings, albeit varying in naming, GPT is again asked to cluster them. The clusters generated by GPT are manually refined, resulting in final grouped slots that cover 95+% of the initial generation. For each of these slots, an LLM is tasked with determining if relevant content for the slot is present in the provided sessions.¹⁰¹⁰10Appendix C.5 gives slot examples for Notes variables.

Type	Count	Accuracy		RMSE		Bias		Recall
		GPT-4	Llama-2	GPT-4	Llama-2	GPT-4	Llama-2	GPT-4	Llama-2
Scale	9,722	58.9	46.7	1.10	1.63	-0.04	0.51	-	-
Scale_g	9,722	67.3	59.0	0.85	1.01	-0.04	0.51	-	-
Category	4,258	77.2	63.6	-	-	-	-	-	-
Measure	3,482	64.4	56.5	-	-	-0.34	-0.004	-	-
Notes	1,146	-	-	-	-	-	-	48.1	52.7
Rule	6,326	68.4	59.8	0.80	0.92	-0.15	0.44	-	-

Table 6: Model performance on all variable types (§4.4) using four evaluation metrics (§5.4).

5.3 Zero-shot V.S. Few-shot Settings

Zero-shot and few-shot settings are tested across all variable types¹¹¹¹11Appendix C.2 gives details on zero/few-shot settings.. For Scale, two few-shot settings are explored: one including an example for a single scale point, and the other covering examples for all scale points. For the GPT model, few-shot settings mostly outperform zero-shot settings in predicting Category, Measure, and Notes variables. For Scale, the few-shot setting with a single example results in the lowest performance. On the other hand, the few-shot setting including examples for all scale points shows a slight improvement in model performance. Thus, few-shot settings are used for all experimentswith GPT. In contrast, the Llama model consistently yields inferior outcomes with few-shot settings compared to zero-shot settings, leading us to adopt zero-shot settings for all Llama experiments.

5.4 Evaluation Metrics

Since each variable type is uniquely defined, different evaluation metrics are employed accordingly. Accuracy is computed for all types except Notes. For Notes, since the model identifies the presence of information in the provided sessions based on predefined slots, Recall is used as the primary metric to gauge the coverage of relevant information detected by the model. For Scale, the Root Mean Square Error (RMSE) and Bias evaluation are used. RMSE quantifies the magnitude of errors, whereas Bias evaluation calculates the proportion of positive and negative residuals, thereby revealing any directional bias in the model predictions.

5.5 Results

Table 6 gives the results for each variable type. For Scale, additional evaluation is conducted for CAP whose original scaling ranges from 0 to 4 where 0 indicates the absence of symptoms, 1 denotes minimal symptoms, and 2+ are considered symptoms that meet or exceed the threshold for clinical significance. To reflect this clinical demarcation, scale points are categorized into three scale groups, 0, 1, and 2+, and evaluated as Scale_g.¹²¹²12Appendix C.6/C.7 presents results for each section/variable.

GPT consistently shows significantly higher accuracy, averaging 10.5% more across all types than Llama, and reaches an accuracy of 68.4% for Rule accumulating outcomes of other types. Regarding RMSE, GPT exhibits an error rate of 0.8 for Rule using results of Scale, implying that it is less than one scale off from human judgment on average. In terms of Bias, ranging from -1: completely biased to negative to 1: completely biased to positive, GPT displays a marginal bias toward negative for Scale, while Llama shows a strong positive bias, implyingthat GPT is a bit conservative in predicting a higher scale, whereas Llama tends to overestimate. GPT underestimates more than Llama for Measure, however, showing a slight negative bias of 0.15 for Rule. For Notes, Llama exhibits better performance with a recall of 52.7% than GPT, suggesting that Llama is more effective in retrieving relevant information. Considering that these models are not fine-tuned onour data, this level of performance is very promising, as we can achieve a robust model for practical use with further training.

6 Error Analysis

Type	History	Gold	Auto		Ext
			GPT	LM
MR	Have you had any physical reactions when something reminded you of what happened? … I had a horrible headache. … How many times in the past month has that happened? … Those two times. … How long did it take you to sort of feel back to normal? I swear. It took me a minute. I got up. I got a glass of water. It took me about. I say two to three hours. … So how bad was that Headache? Do you think there are any other symptoms? It was extremely. I never had. I had it like that.	4	3	2
FN	… can you think about like how often that might happen in the last month about? I feel like about like five times a week.	5	20	20	✓
EI	… when did those start for you? … So, since around age 12, at least yeah yeah because it took me a long time to really trust my stepfather.	480	NA	108	✓
TE	… how satisfied and fulfilled have you felt about your life, with zero being like not at all, couldn’t have a worse life, and 10 being perfect, couldn’t have a better life? I would say a C, because it’s a lot more things that I want to do to be at a 10.	2	3	3	✓
SM	So how many times in the past month would you say some things made you upset that reminded you of it? Rarely, maybe like two, three times? Very rarely.	2	1	1	✓
CR	… thinking about your work in the past month, how have you been doing? … It’s a normal, consistent, um, it’s a normal, consistent routine where I do the same thing, do the same thing every day.	40	NA	40

Table 7: Examples of the six error types. MR: Misaligned Reasoning, FN: False Negative, EI: External Information, TE: Transcription Error, SM: Session Mismatching, CR: Commonsense Reasoning. Gold: clinician’s answers, Auto: model-predicted answers. Ext: errors caused by external factors, not LLMs. NA: the model predicts None. Clinician’s questions are highlighted in blue. Patient’s key information to the questions are highlighted in red.

A thorough error analysis is conducted by proportionally sampling 100+ examples per variable type. Six types of major errors are identified (Table 7), with only two attributed to LLMs and the remainder caused by external factors, implying that the true LLM performance may be even higher.

Misaligned Reasoning

One predominant error type occurs when models deviate from instructions of the rating scheme, presenting seemingly logical reasoning, although it ultimately leads to incorrect conclusions. In Table 7, both models fail to align the key term provided by the participant, extremely, with the definition of score 4 - “Extreme, dramatic physical reactivity”. Llama tends to deviate further than GPT, resulting in a higher RMSE.

False Negatives

is a major error type caused by:

1.

Inaccurate assessments by clinicians. In Table 7, the participant reports five times a week, yet the clinician incorrectly records the frequency of monthly basis as 5, which should have been 20 times a month.
2.

Ambiguity in Scale where answers may fall between two scales, resulting in potentially valid model predictions being marked incorrect.
3.

The model’s inability to recognize paraphrased information in Notes, mistakenly indicating the absence of slot information. This issue particularly affects GPT’s performance due to its strict interpretation of wording variations.

External Information

One common issue is the absence of external information, such as the prior knowledge about the patient (e.g., medical histories, demographics) or the content of previous interview questions. In Table 7, although both models see the onset of symptoms at age 12, they fail to provide an accurate response of the total symptom duration in months because the patient’s current age (that is 52) is not provided in the transcript. In this case, GPT tends to generate a None answer, while Llama tends to hallucinate the patient’s age, and thus produces an answer based on an arbitrary assumption.

Transcription Error

Transcription errors from automatic speech recognizers often cause LLMs to incorrectly interpret the answers, especially with short responses (e.g., yes, no, single digits like 6), medical terminologies, or non-verbal cues such as nodding. In Table 7, the number ‘6’ is incorrectly transcribed as ‘C’ in the participant’s response.

Session Mismatching

A question can be mismatched with the transcript, especially when the clinician extensively paraphrases it. In such cases, the segmented session may or may not contain all the necessary information to answer the question. In Table 7, both models correctly answer based on the patient’s response (1: Minimal). However, due to the mismatch, the session is missing a part where the patient also indicates 2 (clearly present but still manageable), which is recorded as gold.

Commonsense Reasoning

The models’ limitations extend to inferring basic human experiences. Unable to deduce standard working hours from a normal, consistent routine in Table 7, the models fall short of clinician-like assumptions of a typical 40-hour workweek, showcasing a gap in applying commonsense logic to the assessment.

7 Conclusion

In this study, we undertook the task of automating PTSD diagnostics using 411 clinician-administered interviews. To ensure the data quality, we develop an end-to-end pipeline streamlining transcription, alignment, segmentation, and assessment pairing. We also construct a pioneering framework for thistask by leveraging two state-of-the-art LLMs. Our findings reveal the substantial potential of LLMs in assisting clinicians with diagnostic validation and decision-making processes. Our error analysis suggests future directions for improvement, such as incorporating external information or common-sense knowledge to engineer more comprehensive instructions. We envision that this framework holds promise for addressing a broader spectrum of mental health conditions and offers novel insights into LLM applications within the mental health domain.We plan to collect more data and train a custom LLM to better preserve patients’ privacy, and develop a dialogue system to conduct the interviews.

Limitations

Although the experiment results prove the capability of LLMs to automate PTSD diagnosis, their applications in real-world unsupervised clinical settings are premature. To avoid the possible negative influence of model errors on the patients, we recommend using this framework as a supportive tool for clinicians in diagnostics and decision-making.

It should be noted that the clinician annotated gold assessment data is not perfect, which may affect evaluation accuracy. However, this framework makes it easier to identify and refine the inaccuracies in the gold assessment data and thus improve its overall validity. We leave the data augmentation as the next step of our future work.

In addition, the experiments in this paper utilize LLMs without fine-tuning. One limitation is that we have little control over the model predictions. The models, especially Llama-2, generate unexpected outputs that violate the instructions. Furthermore, data privacy concerns restrict the use of models like GPT for clinical data. To address these issues and enhance framework adaptability, future work will focus on developing more controllable, open-sourced models that guarantee data protection in line with clinical domain restrictions.

Due to strict Institutional Review Board (IRB) regulations concerning the confidentiality of real patient information, we are unable to release the dataset, even in an anonymized format. However, recognizing the importance of contributing to the research community, we are pleased to announce that we will release the framework utilized in our study. This, we believe, will facilitate further research and innovation, as our methodology is versatile and can be adapted to a wide array of mental health conditions, provided the requisite interview question sets and video/transcripts are available.

Ethical Considerations

The diagnostic interview data used in this paper was collected with informed consent approved by the Institutional Review Board (IRB) and Research Oversight Committee. The authors and clinicians involved in the research have passed Research, Ethics, Compliance, and Safety Training through Collaborative Institutional Training Initiative¹³¹³13https://about.citiprogram.org (CITI Program). For the use of LLMs, this study exclusively employs anonymized interviews, ensuring the confidentiality and privacy of all participants. All practices in this research adhere to the ACL Code of Ethics.

References

Bartal et al. (2023) Alon Bartal, Kathleen Jagodnik, Sabrina Chan, and Sharon Dekel. 2023. Chatgpt Demonstrates Potential for Identifying Psychiatric Disorders: Application to Childbirth-Related Post-Traumatic Stress Disorder.
Bhaumik et al. (2023) Runa Bhaumik, Vineet Srivastava, Arash Jalali, Shanta Ghosh, and Ranganathan Chandrasekaran. 2023. Mindwatch: A smart cloud-based ai solution for suicide ideation detection leveraging large language models. medRxiv, pages 2023–09.
Cascella et al. (2023) Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. 2023. Evaluating the Feasibility of Chatgpt in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. Journal of Medical Systems, 47(1).
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling Language Modeling with Pathways.
Englhardt et al. (2023) Zachary Englhardt, Chengqian Ma, Margaret E. Morris, Xuhai "Orson" Xu, Chun-Cheng Chang, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, and Vikram Iyer. 2023. From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models.
Fiscus et al. (2006) Jonathan G. Fiscus, Jerome Ajot, Martial Michel, and John S. Garofolo. 2006. The Rich Transcription 2006 Spring Meeting Recognition Evaluation. In Proceedings of International Workshop on Machine Learning and Multimodal Interaction, pages 309–322.
Fu et al. (2023) Guanghui Fu, Qing Zhao, Jianqiang Li, Dan Luo, Changwei Song, Wei Zhai, Shuo Liu, Fan Wang, Yan Wang, Lijuan Cheng, Juan Zhang, and Bing Xiang Yang. 2023. Enhancing Psychological Counseling with Large Language Model: A Multifaceted Decision-Support System for Non-Professionals.
Galatzer-Levy et al. (2023) Isaac R. Galatzer-Levy, Daniel McDuff, Vivek Natarajan, Alan Karthikesalingam, and Matteo Malgaroli. 2023. The Capability of Large Language Models to Measure Psychiatric Functioning.
Gluck et al. (2021) Rachel L. Gluck, Georgina E. Hartzell, Hayley D. Dixon, Vasiliki Michopoulos, Abigail Powers, Jennifer S. Stevens, Negar Fani, Sierra Carter, Ann C. Schwartz, Tanja Jovanovic, Kerry J. Ressler, Bekh Bradley, and Charles F. Gillespie. 2021. Trauma exposure and stress-related disorders in a large, urban, predominantly african-american, female sample. Archives of Women’s Mental Health, 24(6):893–901.
Gong et al. (2023) Chen Gong, Peilin Wu, and Jinho D. Choi. 2023. Aligning Speakers: Evaluating and Visualizing Text-based Speaker Diarization Using Efficient Multiple Sequence Alignment. In Proceedings of the 35th IEEE International Conference on Tools with Artificial Intelligence, ICTAI’23.
Gratch et al. (2014) Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Louis-Philippe Morency. 2014. The distress analysis interview corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3123–3128, Reykjavik, Iceland. European Language Resources Association (ELRA).
Hu et al. (2023) Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, Kirk Roberts, and Hua Xu. 2023. Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering.
Hua et al. (2024) Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Yi-han Sheu, Peilin Zhou, Lauren V. Moran, Sophia Ananiadou, and Andrew Beam. 2024. Large Language Models in Mental Health Care: a Scoping Review.
Ji et al. (2023) Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ananiadou, and Erik Cambria. 2023. Rethinking Large Language Models in Mental Health Applications.
Keller et al. (1987) Martin B. Keller, Philip W. Lavori, Barbara Friedman, Eileen Nielsen, Jean Endicott, Pat McDonald-Scott, and Nancy C. Andreasen. 1987. The Longitudinal Interval Follow-up Evaluation. A comprehensive method for assessing outcome in prospective longitudinal studies. Archives Of General Psychiatry, 44(6):540–548.
Klakow and Peters (2002) Dietrich Klakow and Jochen Peters. 2002. Testing the Correlation of Word Error Rate and Perplexity. Speech Communication, 38(1):19–28.
Lai et al. (2023) Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. 2023. Psy-Llm: Scaling up Global Mental Health Psychological Services with Ai-based Large Language Models.
Lamichhane (2023) Bishal Lamichhane. 2023. Evaluation of Chatgpt for Nlp-based Mental Health Applications.
Levenshtein (1966) Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady, 10(8):707–710.
Liu et al. (2023a) Jialin Liu, Changyu Wang, and Siru Liu. 2023a. Utility of Chatgpt in Clinical Practice. Journal of Medical Internet Research, 25:e48568.
Liu et al. (2023b) June M. Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023b. Chatcounselor: A Large Language Models for Mental Health Support.
Liu et al. (2021) Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. 2021. Towards emotional support dialog systems.
Loh and Raamkumar (2023) Siyuan Brandon Loh and Aravind Sesagiri Raamkumar. 2023. Harnessing Large Language Models’ Empathetic Response Generation Capabilities for Online Mental Health Counselling Support. arXiv preprint arXiv:2310.08017.
Ma et al. (2023a) Chong Ma, Zihao Wu, Jiaqi Wang, Shaochen Xu, Yaonai Wei, Zhengliang Liu, Xi Jiang, Lei Guo, Xiaoyan Cai, Shu Zhang, Tuo Zhang, Dajiang Zhu, Dinggang Shen, Tianming Liu, and Xiang Li. 2023a. Impressiongpt: An Iterative Optimizing Framework for Radiology Report Summarization with Chatgpt.
Ma et al. (2023b) Zilin Ma, Yiyang Mei, and Zhaoyuan Su. 2023b. Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support. In AMIA Annual Symposium Proceedings, volume 2023, page 1105. American Medical Informatics Association.
Nori et al. (2023) Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of Gpt-4 on Medical Challenge Problems.
OpenAI (2023) OpenAI. 2023. Gpt-4 Technical Report.
Peng et al. (2023) Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, and Yonghui Wu. 2023. Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction.
Qin et al. (2023) Wei Qin, Zetong Chen, Lei Wang, Yunshi Lan, Weijieying Ren, and Richang Hong. 2023. Read, diagnose and chat: Towards explainable and interactive llms-augmented depression detection in social media.
Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning, ICML’23, pages 28492–28518.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-Bert: Sentence Embeddings using Siamese Bert-Networks.
Sheehan et al. (1998) David V Sheehan, Yves Lecrubier, K Harnett Sheehan, Patricia Amorim, Juris Janavs, Emmanuelle Weiller, Thierry Hergueta, Roxy Baker, Geoffrey C Dunbar, et al. 1998. The mini-international neuropsychiatric interview (mini): the development and validation of a structured diagnostic psychiatric interview for dsm-iv and icd-10. Journal of clinical psychiatry, 59(20):22–33.
Shen et al. (2022) Ying Shen, Huiyu Yang, and Lin Lin. 2022. Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models.
Valstar et al. (2014) Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. 2014. Avec 2014: 3d dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, AVEC ’14, page 3–10, New York, NY, USA. Association for Computing Machinery.
Weathers et al. (2018) Frank W. Weathers, Michelle J. Bovin, Daniel J. Lee, Denise M. Sloan, Paula P. Schnurr, Danny G. Kaloupek, Terence M Keane, and Brian P. Marx. 2018. The Clinician-Administered PTSD Scale for DSM-5 (CAPS-5): Development and initial psychometric evaluation in military veterans. Psychological Assessment, 30(3):383–395.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
World Health Organization (2021) World Health Organization. 2021. Mental health atlas 2020. World Health Organization.
World Medical Association (2013) World Medical Association. 2013. World Medical Association Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects. The Journal of the American Medical Association, 310(20):2191–2194.
Wu et al. (2023) Yuqi Wu, Jie Chen, Kaining Mao, and Yanbo Zhang. 2023. Automatic Post-Traumatic Stress Disorder Diagnosis via Clinical Transcripts: A Novel Text Augmentation with Large Language Models. In 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS), pages 1–5.
Xu et al. (2023) Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K. Dey, and Dakuo Wang. 2023. Mental-Llm: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.
Yang et al. (2023) Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Sophia Ananiadou, and Jimin Huang. 2023. Mentallama: Interpretable Mental Health Analysis on Social Media with Large Language Models.
Yao et al. (2022) Binwei Yao, Chao Shi, Likai Zou, Lingfeng Dai, Mengyue Wu, Lu Chen, Zhen Wang, and Kai Yu. 2022. D4: a chinese dialogue dataset for depression-diagnosis-oriented chat.
Zhang et al. (2022) Tianlin Zhang, Annika M. Schoene, Shaoxiong Ji, and Sophia Ananiadou. 2022. Natural language processing applied to mental illness detection: a narrative review. npj Digital Medicine, 5(1).

Appendix A Section Details

Table 8 - 11 give examples for 4 core sections. Each example includes the standard interview Question, the Variable that the question belongs to, and the example Sessions between the Clinician and the Participant.

The Mini International Neuropsychiatric Interview (MINI) is a brief, structured diagnostic interview for diagnosing 17 major psychiatric disorders (Sheehan et al., 1998). We adopt 6 modules from MINI to assess conditions such as Major Depressive Episode (MDE), Mania & Hypomania (MH), PTSD (past incidents), Psychosis Symptoms (PS), Substance Use Disorder (SUD), and Alcohol Use Disorder (AUD). Table 12 provides an example from the MDE module.

\SetColumngray9 Q	What has been your primary source of income over the past month?
\SetHline2white, 0.5pt V	lbi_a1
\SetHline2white, 0.5pt S	C: You got to do it all over again. Are you working full time?
P: Yes.
Q	How would you rate your overall satisfaction on a scale of 1 to 10, with 1 being the best and 10 being the worst?
\SetHline2white, 0.5pt V	lbi_e1
\SetHline2white, 0.5pt S	C: In the past month, like how satisfied have you felt with your life? If we were doing like a scale of one to 10, one is like, it’s the worst. This is the worst I’ve ever had in my life. 10 being like, this is, I’m living my best life. Living my life like it’s golden.
P: I actually feel like that now. I actually do. Cause until January 1st of this year, I had been unemployed the last two years.

Table 8: Two examples of the LBI section.

\SetColumngray9 Q	Do you have any current physical health conditions?
\SetHline2white, 0.5pt V	thh_medicalcond
\SetHline2white, 0.5pt S	C: OK, so now we’re going to move on to talking about your health and treatment history. Do you currently have, do you have any current physical health conditions? Did you say no? OK, I couldn’t hear what you were saying. Go ahead.
P: I have a skin condition called eczema.
Q	In the past, have you been treated for any emotional/mental health problems with therapy or hospitalization?
\SetHline2white, 0.5pt V	thh_tx_yesno
\SetHline2white, 0.5pt S	C: In the past, have you been treated for any emotional or mental health problem with therapy or hospitalization?
P: No. Yes.

Table 9: Two examples of the THH section.

\SetColumngray9 Q	Tell me a little bit more about what happend.
\SetHline2white, 0.5pt V	trauma1whathappened
\SetHline2white, 0.5pt S	C: OK, and what would that be?
P: My mom worked at the airport here in xxx. It was the food catering place. They put the food, made the food for the planes. When I was a child, every year, they would sponsor a day at xxx. They would go out there and barbecue. We took over the whole picnic area. You had free entrance to the park, plus tickets to do all the little fair games and all that good stuff. Having a good time. My mom asked my stepfather to go with us because he had a car. He said he didn’t wanna go and he wasn’t going nowhere. So my mom put us all on the bus. We drove the bus out there. When we came home, it was like 11 o’clock. Of course, we living in xxx. You know that bus ride was long. It was dark, dark when we got home and she had all three of her children with her. My mom unlocked the door, closed that door, the house was pitch black. That man shot down them steps at my mama and all three of her children five times.

Table 10: An example of the CRA section.

\SetColumngray9 Q	Tell me a little bit more about what happend.
\SetHline2white, 0.5pt V	dsm5capscritb 01trauma1_distress
\SetHline2white, 0.5pt S	C: To this day, let’s say over the past month. So since like the beginning of April, end of March, have you had unwanted memories of this event? Does it randomly pop into your mind at all? Like while you’re awake?
P: Well, actually my daughter’s in an abusive relationship. So yes, I do think about it a lot. Every time I see her, all I think about is my mom. How she endured it.
Q	How often in the past month?
\SetHline2white, 0.5pt V	dsm5capscritc02trauma1_num
\SetHline2white, 0.5pt S	C: So in the last month, thinking about the things that you have tried to avoid, how often would you say you’ve done that?
P: I guess every day. I don’t know. I just, the most I’ve done is just, and me avoiding stuff is me just sitting here smoking and playing my video game. That avoids me from thinking about anything negative in my life. And I just try to avoid that.

Table 11: Two examples of the CAP section.

\SetColumngray9 Q	For the past two weeks, were you depressed or down, or felt sad, empty or hopeless most of the day, nearly every day?
\SetHline2white, 0.5pt V	miniv7_mde_c_a1
\SetHline2white, 0.5pt S	C: I’m going to ask you some different questions. We’re going to focus on the past two weeks right now. So for the past two weeks, did you feel depressed, down, sad, empty or hopeless for most of the day, almost every day the past two weeks?
P: Um, no.

Table 12: An example of the MINI section.

Appendix B Data Preprocessing Details

B.1 Final Matching Criteria

The best bipartite result should follow the criteria:

•

All matching IDs need to be ascending.
•

Only edges whose embedding cosine similarity > 0.4 are kept.
•

Maximize: $y=\sum_{i=1}^{n}a_{i}\cdot x_{i}$ , subject to $x_{i}\geq 0$ , for $i=1,\ldots,n$ .

In our case, let $n=9$ , with the following variables:

•

$x_{1}$ : the sum of Sentence Transformer (ST) cosine similarity scores of all edges
•

$x_{2}$ : the sum of Levenshtein Distance (LD) similarity scores of all edges
•

$x_{3}$ : the average ST cosine similarity scores of all matched questions
•

$x_{4}$ : the average LD similarity scores of all matched questions
•

$x_{5}$ : the total number of matched core questions
•

$x_{6}$ : the total number of matched questions that take the maximum ST cosine similarity result
•

$x_{7}$ : the total number of matched questions that take the maximum LD similarity result
•

$x_{8}$ : the total number of matched core questions that take the maximum ST cosine similarity result
•

$x_{9}$ : the total number of matched core questions that take the maximum LD similarity result

And the coefficients are set as:

•

$a_{1}=a_{2}=1$
•

$a_{3}=a_{4}=1$
•

$a_{5}=a_{6}=a_{7}=0.1$
•

$a_{8}=a_{9}=0.2$

B.2 Variable Examples

Table 13 - 17 show examples for each variable type. Every example includes the Variable name, replaceable Patterns for prompt generation (Section 5), answer Range, and covered Questions. Note that Measure, Notes, and Rule variables do not have a predefined range. And Rule variables are calculated from the results of their Related Variables.

V	dsm5capscritb01trauma1_distress
P	{keywords}: how intense in the past month
{symptom}: unwanted memories of the traumatic event while awake
{attributes}:
- If the symptom only exists in dreams, the answer should be 0.
- If the symptom is not perceived as involuntary and intrusive, the answer should be 0.
R	0: None,
1: Minimal, minimal distress or disruption of activities
2: Clearly Present, distress clearly presented but still manageable, some disruption of activities
3: Pronounced, considerable distress, difficulty dismissing memories, marked disruption of activities
4: Extreme, incapacitating distress, cannot dismiss memories, unable to continue activities
Q	In the past month, have you had any unwanted memories of it while you were awake, so not counting dreams?
- How does it happen that you start remembering it?
–Are these unwanted memories, or are you thinking about it on purpose?
- How much do these memories bother you?
- Are you about to put them out of your mind and think about something else?
– Overall, how much of a problem is this for you?
— How so?

Table 13: An example of the Scale variable. Questions start with - are optional questions that might be skipped based on the participant’s response.

V	lbi_a1
P	{keywords}: which of the following categories best describes,
{symptom}: usual employment status
R	1: Full-Time Gainful Employment
2: Part-Time Gainful Employment (30 hours or less/week)
3: Unemployed But Expected by Self or Others
4: Unemployed But Not Expected by Self or Others (e.g., physically disabled)
5: Retired
6: Homemaker
7: Student (Includes Part-Time)
8: Leave of Absence Due to Medical Reasons (e.g., holding job; plans to return)
9: Volunteer Work - Full Time
10: Volunteer Work - Part Time
11: Other
888: N/A
Q	What has been your primary source of income over the past month?

Table 14: An example of the Category variable.

V	dsm5capscritb01trauma1_num
P	{keywords}: how intense in the past month
{symptom}: unwanted memories of the traumatic event while awake
{type}: an integer representing the frequency of the symptom in the past month
Q	- How often have you had these memories in the past month?

Table 15: An example of the Measure variable. The corresponding question for this question is optional, which might be skipped if the participant denies the presence of the symptom.

V	critaprobenotes
P	{slots}:
- trauma_reactions
- trauma_details
- life_changes
- coping_and_changes
- worldview_changes
- health_concerns
- family_and_social_context
- nightmare_details
- intrusive_experiences
- trauma_cognition
- trust_and_safety
- impact_assessment
- age_and_time_factors
- substance_use
- therapy_and_progress
- eating_disorders
Q	You discussed a number of traumas in the last visit with our team members.
What would you say is the one that has been most impactful where you are still noticing it affecting you?
-* How much do you think about what happened to this day?
-* How often do you have nightmares about what happend?
-* How much did it change the way you think about yourself and the world?
- In the past month, which of these have you thought about more often or had nightmares about or find yourself purposely avoiding thinking about?
– Are there any other stressors that you find yourself thinking about when you don’t want to or avoiding?

Table 16: An example of the Notes variable. Questions start with - are optional questions which might be skipped based on the participant’s response. Questions start with * are recurrent questions which might be asked multiple times during the interview.

V	dsm5capscritb01trauma1
R	0: Absent
1: Mild/subthreshold
2: Moderate/threshold
3: Severe/markedly elevated
4: Extreme/incapacitating
RV	dsm5capscritb01trauma1_distress
dsm5capscritb01trauma1_num

Table 17: An example of the Rule variable.

Appendix C Experiments Details

C.1 Dataset Comparison

Table 18 gives the comparison with related datasets in the metal health domain.

Dataset	A	H	Turns	Utters
DAIC
(Gratch et al., 2014)	189	51	-	-
AViD
(Valstar et al., 2014)	300	240	-	-
EATD
(Shen et al., 2022)	162	2.26	-	-
Psych8k
(Liu et al., 2023b)	260	260	-	-
D4
(Yao et al., 2022)	-	-	28,855	81,559
ESConv
(Liu et al., 2021)	-	-	-	31,410
Ours	322	515	71,412	142,824

Table 18: Comparisons with existing mental health interview/dialogue datasets in terms of Audio counts, total Hours, total and utterances.

C.2 Details on Zero-shot/Few-shot Settings

We randomly sampled 30 instances for each variable type and asked both models to predict under zero-shot and few-shot settings. For the GPT model, few-shot settings generally yield better performance. However, the Llama model consistently fails to follow instructions as the context length grows, leading to significant degradation with few-shot prompting. Additionally, we observed a 28% increase in the likelihood of generating an unexpected response format, such as deviating from the requested JSON format, when using few-shot settings.

Type	Zero-shot		Few-shot
	GPT-4	Llama-2	GPT-4	Llama-2
Scale	60.0	50.0	63.3	36.7
Scale₁	-	-	56.7	40.0
Category	43.3	40.0	46.7	33.3
Measure	56.7	56.7	60.0	50.0
Notes	41.0	42.7	43.6	34.9

Table 19: Model performance on zero-shot and few-shot settings. Scale₁ refers to the few-shot setting that only include one example for a single scale point. Accuracy is the metric used for all variable types except Notes variables, which are evaluated using Recall.

C.3 Experiment Costs

GPT-4

The pricing of the GPT-4 Turbo model is $0.01/1K tokens for input and $0.03/1K tokens for output. We spend approximately $300 (upper bound) to complete GPT experiments in this paper.

Llama-2

We use a single NVIDIA H100 GPU for Llama inferences with a batch size of 1, taking roughly 10 seconds per request. Completing a full set of experiments on all samples requires ~3 days.

C.4 LLM Configurations

We utilize gpt-4-1106-preview, the latest GPT-4 Turbo model, and llama-2-70b-chat-hf, the largest Llama-2 model. For GPT, to enhance the stability and consistency of the model output, we configure the temperature parameter to 0. This adjustment makes the model’s response more deterministic. Besides, we also employ parameters exclusive to GPT-4 Turbo and GPT-3.5 Turbo, namely response_format and seed. Setting response_format to "json_object" constrains the model to generate parsable JSON strings, facilitating easier data handling and analysis. Despite ChatGPT’s non-deterministic nature, seed parameter enables users to obtain consistent outputs across multiple requests, as long as there are no changes at the system level.

As for the Llama, we conduct experiments involving different temperature, top_p, and repetition_penalty separately. The results indicate that the model gives better performance with a temperature setting of 0.3, a top_p of 0.9, and a repetition_penalty of 1.

Step	Template
NSG	As a clinician who has conducted interviews with multiple patients, you are tasked with structuring the interview data into a more organized format. To achieve this, identify general "slots" from the interview question and answers. These slots should represent key themes or types of information that can be adapted to various responses from different patients.
For each identified slot, provide a brief explanation of why it has been chosen, focusing on its relevance and utility in categorizing interview data.
Your findings should be presented in a JSON format as a list, for example: [{"reason": "This slot captures the primary health concern of the patient, a common theme across all interviews", "slot": "primary_health_concern" }, {"reason": "This slot pertains to the patient’s lifestyle habits, which is crucial for understanding health context", "slot": "lifestyle_habits" } ].
Remember to ensure that the slots are broad enough to be applicable across different patient responses yet specific enough to offer meaningful categorization.
NSM	Imagine you are a clinician who documents patient interviews in a structured, slot-filling manner. Sometimes, certain slots may have overlapping or similar content. Your task is to review a given list of slots and merge those that are similar. The merged results should be returned as a JSON object, where each key represents a merged slot, and the corresponding value is a list of the original slots that have been combined under this merged category.
For instance, if the list of slots is: ["daily_routine", "work_events", "daily_activity", "daytime_activities", "work_routine"], a possible merged result could be: {"daily_routine": ["daily_routine", "daily_activity", "daytime_activities"], "work": ["work_events", "work_routine"]}.
When you receive a list of slots, analyze and merge them accordingly, ensuring that the merged slots are logically grouped and accurately represent the original information categories.
NSF	Imagine you are a professional clinician. Based on the patient’s interview history, please extract specific information and fill in the following slots: {slots}. If the interview history does not provide information for any of these slots, please enter an empty string (”) for that slot. Return the answer as a JSON object.

Table 20: Prompts used for Notes Variable Slot structure Generation, Merging, and Formatting.

C.5 Slot Examples for Notes Variable

Table 20 outlines the process for generating, merging, and formatting the slots in Notes variables (§5.2). Initially, we compile all clinician-summarized notes for each Notes variable and input them into the GPT model using the NSG prompt to produce a list of slots. Due to potential overlaps, the NSM prompt directs the model to consolidate these slots into clusters, ensuring both conciseness and comprehensiveness. Subsequently, the NSF prompt is used to format both the gold-standard summaries and the corresponding interview sessions, facilitating a straightforward comparison of the structured slot arrangements.

C.6 Model Performance by Sections

Table 21 presents model performances by each section. Note that THH section lacks Measure and Rule variables, whereas CRA section does not contain Scale variables. The grouped scale_g is exclusively applied within the CAP section.

	Type	Count	Accuracy		RMSE		Bias		Recall
			GPT-4	Llama-2	GPT-4	Llama-2	GPT-4	Llama-2	GPT-4	Llama-2
LBI	Scale	1,281	54.6	44.7	1.26	1.42	0.46	0.45	-	-
	Category	594	74.6	67.3	-	-	-	-	-	-
	Measure	99	68.7	66.7	-	-	-0.16	-0.09	-	-
	Notes	203	-	-	-	-	-	-	42.0	50.8
	Rule	215	43.3	37.7	0.94	0.98	0.44	0.43	-	-
THH	Scale	29	55.2	51.7	1.20	1.25	0.23	0.43	-	-
	Category	1,527	92.6	85.9	-	-	-	-	-	-
	Notes	633	-	-	-	-	-	-	52.5	59.8
CRA	Category	1,737	63.7	42.4	-	-	-	-	-	-
	Measure	143	63.6	55.9	-	-	-0.58	-0.36	-	-
	Notes	310	-	-	-	-	-	-	47.2	43.5
	Rule	146	91.8	71.9	0.38	0.97	0.83	-0.51	-	-
CAP	Scale	8,412	59.6	47.0	1.07	1.66	-0.14	0.52	-	-
	Scale_g	8,412	69.3	61.2	0.77	0.93	-0.15	0.53	-	-
	Category	400	81.0	64.5	-	-	-	-	-	-
	Measure	3,240	64.2	56.3	-	-	-0.33	0.01	-	-
	Rule	5,965	68.8	60.4	0.81	0.92	-0.19	0.46	-	-

Table 21: Model performances on 4 sections (§4.4) using four evaluation metrics (§5.4).

C.7 Model Performance by Variables

Table 22 lists results for each variable following the four evaluation metrics (Section 5.4).

Variable	Count	Acc		RMSE		Bias		Recall
Variable	Count	GPT	LM2	GPT	LM2	GPT	LM2	GPT	LM2
Scale Variable
lbi_a2b	199	53.8	37.7	1.31	2.05	0.37	0.68	-	-
lbi_a3	201	56.7	52.7	0.96	0.99	0.31	0.47	-	-
lbi_a4	63	50.8	55.6	0.90	1.02	0.23	0.29	-	-
lbi_b1a_family	207	46.9	54.6	1.78	1.36	0.49	0.11	-	-
lbi_b2	212	60.8	44.8	1.19	1.02	0.47	0.49	-	-
lbi_d	194	52.1	38.7	1.29	1.29	0.53	0.53	-	-
lbi_e1	205	58.5	36.1	0.91	1.66	0.65	0.42	-	-
dx_understanding	29	55.2	51.7	1.20	1.25	0.23	0.43	-	-
dsm5capscritb01trauma1_distress	257	59.5	53.3	0.89	1.67	-0.44	0.48	-	-
dsm5capscritb02trauma1_distress	254	69.7	53.9	0.72	1.16	-0.56	0.32	-	-
dsm5capscritb03trauma1_distress	249	67.9	51.8	0.73	1.35	-0.05	0.68	-	-
dsm5capscritb04trauma1_distress	259	57.1	40.5	0.96	1.25	-0.35	0.47	-	-
dsm5capscritb05trauma1_distress	243	63.8	56.8	0.90	1.04	-0.11	0.37	-	-
dsm5capscritc01trauma1_distress	253	46.2	39.9	1.77	1.92	-0.34	0.53	-	-
dsm5capscritc02trauma1_distress	243	58.0	45.7	0.99	1.20	-0.04	0.64	-	-
dsm5capscritd01trauma1_distress	242	66.1	53.7	0.92	1.13	-0.10	0.30	-	-
dsm5capscritd02trauma1_distress	256	56.6	36.7	0.85	1.31	-0.06	0.83	-	-
caps5trauma1related_d02	164	57.9	55.5	0.97	0.85	-0.71	0.07	-	-
dsm5capscritd03trauma1_distress	248	61.7	58.9	0.94	0.92	-0.56	0.24	-	-
dsm5capscritd04trauma1_distress	252	56.0	49.2	0.93	1.13	-0.03	0.55	-	-
caps5trauma1related_d04	160	63.8	54.4	0.89	0.84	-0.28	0.10	-	-
dsm5capscritd05trauma1_distress	253	57.7	47.8	1.00	1.18	-0.08	0.53	-	-
caps5trauma1related_d05	138	53.6	44.9	1.06	0.96	-0.56	0.21	-	-
dsm5capscritd06trauma1_distress	255	53.5	47.5	1.01	1.23	0.09	0.66	-	-
caps5trauma1related_d06	156	51.3	41.0	0.98	0.90	-0.47	0.35	-	-
dsm5capscritd07trauma1_distress	257	59.5	45.5	0.88	1.22	0.04	0.67	-	-
caps5trauma1related_d07	128	55.5	44.5	0.96	0.94	-0.16	0.35	-	-
dsm5capscrite01trauma1_distress	257	60.3	46.7	0.79	1.13	0.33	0.78	-	-
caps5trauma1related_e01	148	52.7	33.8	3.54	3.44	-0.74	0.06	-	-
dsm5capscrite02trauma1_distress	251	67.3	61.0	0.71	1.11	0.02	0.31	-	-
caps5trauma1related_e02	50	74.0	58.0	1.09	1.26	-0.38	0.43	-	-
dsm5capscrite03trauma1_distress	255	51.4	47.1	1.09	1.20	0.32	0.54	-	-
caps5trauma1related_e03	155	50.3	51.6	0.93	0.86	-0.40	0.17	-	-
dsm5capscrite04trauma1_distress	252	63.1	52.8	0.85	1.05	-0.03	0.60	-	-
caps5trauma1related_e04	117	50.4	53.0	0.99	0.88	-0.55	0.13	-	-
dsm5capscrite05trauma1_distress	256	59.8	53.5	0.81	0.99	-0.13	0.58	-	-
caps5trauma1related_e05	161	57.8	41.6	1.09	0.99	-0.79	0.51	-	-
dsm5capscrite06trauma1_distress	256	53.5	52.7	1.02	1.06	0.09	0.37	-	-
caps5trauma1related_e06	181	63.0	38.7	1.0	10.2	-0.67	0.37	-	-
dsmiv_future_frequency_current	251	80.1	48.6	0.81	6.02	0.40	0.80	-	-
dsmiv_future_intens_current	246	69.1	40.2	0.93	1.77	0.61	0.90	-	-
dsm5capscritg_trauma1_distress	228	53.9	43.4	1.08	1.39	0.35	0.69	-	-
dsm5capscritg_trauma1_impair	226	51.8	42.0	0.93	1.27	-0.28	0.57	-	-
dsm5capscritg_trauma1_fx	205	54.1	29.3	1.10	1.56	-0.04	0.81	-	-
dsm5depersonalization_sev	255	67.5	52.2	0.80	1.25	-0.08	0.49	-	-
caps5trauma1related_diss01	76	53.9	31.6	1.16	1.26	0.31	0.19	-	-
dsm5derealization_sev	249	63.1	30.9	0.98	1.74	0.20	0.88	-	-
caps5trauma1related_diss02	70	55.7	27.1	1.11	1.25	-0.03	0.53	-	-
Category Variable
lbi_a1	200	70.0	41.0	-	-	-	-	-	-
lbi_student	201	95.0	89.1	-	-	-	-	-	-
lbi_c1a	192	57.8	71.9	-	-	-	-	-	-
lbi_c2	1	100	100	-	-	-	-	-	-
thh_medicalcond	206	92.7	88.8	-	-	-	-	-	-
thh_tx_curr_yesno	215	94.9	80.9	-	-	-	-	-	-
thh_tx_yesno	233	89.7	87.6	-	-	-	-	-	-
feedback_helpful	79	94.9	89.9	-	-	-	-	-	-
thh_txneed_yesno	96	92.7	88.5	-	-	-	-	-	-
thh_psychmed_curr_yesno	194	92.3	88.7	-	-	-	-	-	-
thh_psychmed_yesno	198	95.5	93.4	-	-	-	-	-	-
thh_suicide_yesno	236	90.7	77.1	-	-	-	-	-	-
thh_suicide_pw_yesno	70	94.3	78.6	-	-	-	-	-	-
trauma1lifeeventscl	146	61.6	12.3	-	-	-	-	-	-
trauma1_exposure_type___1	146	77.4	67.1	-	-	-	-	-	-
trauma1_exposure_type___2	146	77.4	43.2	-	-	-	-	-	-
trauma1_exposure_type___3	146	67.8	28.1	-	-	-	-	-	-
trauma1_exposure_type___4	146	65.8	22.6	-	-	-	-	-	-
caps_e1_lt	145	62.1	45.5	-	-	-	-	-	-
caps_e1_ltself	73	64.4	64.4	-	-	-	-	-	-
caps_e1_ltother	74	41.9	44.6	-	-	-	-	-	-
caps_e1_si	146	43.8	39.7	-	-	-	-	-	-
caps_e1_siself	61	54.1	65.6	-	-	-	-	-	-
caps_e1_siother	61	60.7	29.5	-	-	-	-	-	-
caps_e1_tpi	146	54.1	52.7	-	-	-	-	-	-
caps_e1_tpiself	79	84.8	75.9	-	-	-	-	-	-
caps_e1_tpiother	77	49.4	26.0	-	-	-	-	-	-
trauma1_nomemory	145	75.2	44.8	-	-	-	-	-	-
dsm5caps_critf_cur1_yesno	202	78.7	41.6	-	-	-	-	-	-
dsm5caps_critf_cur1_c	198	83.3	87.9	-	-	-	-	-	-
Measure Variable
lbi_a2a	99	68.7	66.7	-	-	41.9	45.5	-	-
trauma1age	143	63.6	55.9	-	-	21.2	31.7	-	-
dsm5capscritb01trauma1_num	162	63.6	58.0	-	-	37.3	52.9	-	-
dsm5capscritb02trauma1_num	98	74.5	63.3	-	-	28.0	52.8	-	-
dsm5capscritb03trauma1_num	84	72.6	59.5	-	-	47.8	76.5	-	-
dsm5capscritb04trauma1_num	177	62.1	58.8	-	-	17.9	42.5	-	-
dsm5capscritb05trauma1_num	137	59.1	57.7	-	-	28.6	50.0	-	-
dsm5capscritc01trauma1_num	170	59.4	53.5	-	-	31.9	54.4	-	-
dsm5capscritc02trauma1_num	140	63.6	54.3	-	-	27.5	54.7	-	-
dsm5capscritd01trauma1_num	87	50.6	48.3	-	-	32.6	82.2	-	-
dsm5capscritd02trauma1_num	168	76.2	69.0	-	-	47.5	59.6	-	-
dsm5capscritd03trauma1_num	120	65.0	57.5	-	-	23.8	43.1	-	-
dsm5capscritd04trauma1_num	166	72.3	68.1	-	-	39.1	45.3	-	-
dsm5capscritd05trauma1_num	138	65.9	59.4	-	-	42.6	46.4	-	-
dsm5capscritd06trauma1_num	155	69.7	63.2	-	-	27.7	40.4	-	-
dsm5capscritd07trauma1_num	140	61.4	60.0	-	-	40.7	53.6	-	-
dsm5capscrite01trauma1_num	135	65.9	62.2	-	-	21.7	54.9	-	-
dsm5capscrite02trauma1_num	61	83.6	68.9	-	-	80.0	89.5	-	-
dsm5capscrite03trauma1_num	159	73.0	67.3	-	-	37.2	34.6	-	-
dsm5capscrite04trauma1_num	131	68.7	60.3	-	-	24.4	50.0	-	-
dsm5capscrite05trauma1_num	168	69.0	66.1	-	-	21.2	35.1	-	-
dsm5capscrite06trauma1_num	184	72.8	61.4	-	-	40.0	31.0	-	-
dsmcaps_critf_cur1_nummonths	191	49.7	22.5	-	-	60.4	81.8	-	-
dsm5caps_critf_cur1_b	181	35.7	22.0	-	-	17.9	19.0	-	-
dsm5depersonalization_num	84	59.5	51.2	-	-	32.4	65.9	-	-
dsm5derealization_num	3	100	33.3	-	-	0.00	50.0	-	-
Notes Variable
life_base_typicalday	203	-	-	-	-	-	-	42.0	50.8
thh_medicalcond_desc	100	-	-	-	-	-	-	56.8	80.1
thh_tx_curr_descr	59	-	-	-	-	-	-	53.6	73.8
thh_tx_descr	135	-	-	-	-	-	-	44.0	57.4
dx_knowledge	33	-	-	-	-	-	-	59.4	48.5
dx_lackknowledge	20	-	-	-	-	-	-	60.7	37.9
feedback_info	66	-	-	-	-	-	-	75.1	48.1
thh_txneed_desc	45	-	-	-	-	-	-	59.7	49.4
thh_psychmed_descr	89	-	-	-	-	-	-	40.4	59.0
thh_suicide_desc	73	-	-	-	-	-	-	56.9	67.0
thh_suicide_pw_desc	13	-	-	-	-	-	-	62.2	62.3
critaprobenotes	143	-	-	-	-	-	-	50.8	37.4
trauma1whathappened	143	-	-	-	-	-	-	42.7	51.4
trauma1describe	24	-	-	-	-	-	-	51.4	48.9
Rule Variable
lbi_e2	215	43.3	37.7	0.94	0.98	0.44	0.43	-	-
caps_e1_crita	146	91.8	71.9	0.38	0.97	0.83	-0.51	-	-
dsm5capscritb01trauma1	253	62.8	63.6	0.81	0.90	-0.51	0.28	-	-
dsm5capscritb02trauma1	250	88.0	70.0	0.53	0.80	-0.47	0.47	-	-
dsm5capscritb03trauma1	246	86.2	63.4	0.54	0.96	0.00	0.82	-	-
dsm5capscritb04trauma1	255	67.5	60.0	0.86	0.95	-0.57	0.25	-	-
dsm5capscritb05trauma1	241	74.7	69.7	0.73	0.75	-0.18	0.26	-	-
dsm5capscritc01trauma1	250	55.2	54.8	0.94	0.97	-0.64	0.36	-	-
dsm5capscritc02trauma1	242	71.9	61.2	0.83	0.93	-0.29	0.51	-	-
dsm5capscritd01trauma1	239	81.2	66.5	0.68	1.02	0.24	0.60	-	-
dsm5capscritd02trauma1	222	62.6	46.4	0.79	1.09	-0.16	0.83	-	-
dsm5capscritd03trauma1	246	72.0	72.0	0.85	0.74	-0.48	0.36	-	-
dsm5capscritd04trauma1	251	63.7	62.2	0.94	1.02	-0.08	0.35	-	-
dsm5capscritd05trauma1	252	59.9	53.6	0.98	1.00	-0.19	0.42	-	-
dsm5capscritd06trauma1	254	55.9	50.8	1.03	1.10	-0.07	0.57	-	-
dsm5capscritd07trauma1	255	63.1	60.0	0.85	0.95	-0.17	0.59	-	-
dsm5capscrite01trauma1	255	72.5	51.4	0.69	0.90	0.03	0.66	-	-
dsm5capscrite02trauma1	250	90.4	76.8	0.38	0.73	0.75	0.90	-	-
dsm5capscrite03trauma1	220	57.3	58.6	0.91	0.96	0.09	0.43	-	-
dsm5capscrite04trauma1	250	75.6	72.0	0.71	0.79	-0.08	0.63	-	-
dsm5capscrite05trauma1	254	65.7	67.7	0.80	0.77	-0.36	0.54	-	-
dsm5capscrite06trauma1	254	55.9	52.8	1.05	0.96	0.05	0.27	-	-
dsmcaps_critf_admin	28	75.0	100	0.50	0.00	-1.00	-1.00	-	-
dsm5depersonalization	246	85.4	64.2	0.61	0.78	-0.06	0.70	-	-
dsm5derealization	243	75.3	39.1	0.69	1.14	0.27	0.76	-	-
dsm5capsglobalvalidtrauma1	255	63.5	63.5	0.84	0.84	-1.00	-1.00	-	-
dsm5capsglobalsevtrauma1	254	44.1	42.9	0.91	0.97	0.21	0.45	-	-

Table 22: Model performances on all variable (§4.4) using four evaluation metrics (§5.4).

Automating PTSD Diagnostics in Clinical Interviews: Leveraging Large Language Models for Trauma Assessments

Abstract

1 Introduction

2 Related Work

2.1 LLMs in Mental Health

2.2 LLMs in Clinical Interview and Diagnosis

3 PTSD Interview Data

3.1 Participants

3.2 Interview Procedures

3.3 Psychiatric Diagnoses and Treatment

LBI

THH

CRA

CAP

4 Data Processing

4.1 Transcription

4.2 Alignment

4.3 Segmentation

4.4 Assessment Pairing

5 Experiments

5.1 Dataset

5.2 Large Language Models (LLMs)

5.3 Zero-shot V.S. Few-shot Settings

5.4 Evaluation Metrics

5.5 Results

6 Error Analysis

Misaligned Reasoning

False Negatives

External Information

Transcription Error

Session Mismatching

Commonsense Reasoning

7 Conclusion

Limitations

Ethical Considerations

References

Appendix A Section Details

Appendix B Data Preprocessing Details

B.1 Final Matching Criteria

B.2 Variable Examples

Appendix C Experiments Details

C.1 Dataset Comparison

C.2 Details on Zero-shot/Few-shot Settings

C.3 Experiment Costs

GPT-4

Llama-2

C.4 LLM Configurations

C.5 Slot Examples for Notes Variable

C.6 Model Performance by Sections

C.7 Model Performance by Variables

Automating PTSD Diagnostics in Clinical Interviews:
Leveraging Large Language Models for Trauma Assessments