Doctor Recommendation in Online Health Forums via Expertise Learning

Xiaoxin Lu¹ Yubo Zhang^1∗ Jing Li¹ Shi Zong²
¹Department of Computing, The Hong Kong Polytechnic University, HKSAR, China
²Department of Computer Science and Technology, Nanjing University, Nanjing, China
¹{xiaoxin.lu, yubo.zhang}@connect.polyu.hk
¹[email protected]
²[email protected] Equal contribution. Yubo Zhang was supported by PolyU Undergraduate Research and Innovation Scheme (URIS). Jing Li is the corresponding author.

Abstract

Huge volumes of patient queries are daily generated on online health forums, rendering manual doctor allocation a labor-intensive task. To better help patients, this paper studies a novel task of doctor recommendation to enable automatic pairing of a patient to a doctor with relevant expertise. While most prior work in recommendation focuses on modeling target users from their past behavior, we can only rely on limited words in a query to infer a patient’s needs for privacy reasons. For doctor modeling, we study the joint effects of their profiles and previous dialogues with other patients and explore their interactions via self-learning. The learned doctor embeddings are further employed to estimate their capabilities of handling a patient query with a multi-head attention mechanism. For experiments, a large-scale dataset is collected from Chunyu Yisheng, a Chinese online health forum, where our model exhibits state-of-the-art results, outperforming baselines only considering profiles and past dialogues to characterize a doctor.¹¹1Our dataset and code are publicly available in: https://github.com/polyusmart/Doctor-Recommendation

1 Introduction

The growing popularity of health communities on social media has revolutionized the traditional doctor consultancy paradigm in a face-to-face manner. Massive amounts of patients are now turning to online health forums to seek professional help; meanwhile, popular healthcare platforms are able to recruit a large group of licensed doctors to provide online service (Liu et al., 2020b). In the COVID-19 crisis, the social distancing policies further flourish the use of these forums, where numerous patients would query diverse varieties of health problems every day (Gong et al., 2020).

Refer to caption — Figure 1: The sample patient query $q$ on the top, followed by the profile of a sample doctor $D$ and three dialogues $D$ engaged before. Salient words indicating patient needs and doctor expertise are in red.³³3The original texts in our dataset are written in Chinese. We translated them into English in parentheses for reading.

Nevertheless, in much practice (Cao et al., 2017), manual doctor allocations are adopted to handle each query, largely limiting the efficiency to help patients in sheer quantities and resulting in an extremely expensive process. Under this circumstance, how can we automate and speed up the pairing of patients to doctors who are able to offer the help?

In this paper, we present a novel task of doctor recommendation, whose goal is to automatically figure out a patient’s needs from their query on online health forums and recommend a doctor with relevant expertise to help. The solution can not be trivially found from the mainstream recommendation approaches. It is because most recommender systems acquire the past behavior of target users (e.g., their purchase history) to capture their potential requirements (Wu et al., 2020; Huang et al., 2021); whereas our target users – the patients – should be anonymized to protect their privacy. Language features consequently play a role in our task because only a few query words are accessible for models to make sense of how a patient feels and who can best help them.

To illustrate our task, Figure 3 shows a patient’s query $q$ concerning insomnia and muscle aches, where it is hard to infer the cause of such symptoms from the short text, not to mention to recommend a suitable doctor for problem-solving. It is hence crucial to explore the semantic relations between patient queries and doctor expertise for recommendation. To characterize a doctor’s expertise, the modeling of their profile (describing what they are good at) provides a straightforward alternative. Nevertheless, the profiles are usually written in a professional language, while a patient tends to query with layman’s terms. For instance, the doctor $D$ who later solved $q$ ’s problem is profiled with “neurological diseases”, whose correlations with the symptom descriptions in $q$ are rather implicit. Therefore, we propose to adopt previous dialogues held by a doctor with other patients (henceforth dialogues) to narrow the gap of language styles between doctor profiles and patient queries. Take the history dialogues of $D$ in Figure 3 as an example: the words therein like “dizziness”, “muscular atrophy”, and “cyclopyrrolones” (treatments for insomnia) are all helpful to bridge $D$ ’s expertise in neurological diseases with $q$ ’s symptoms.

To capture how a doctor’s profile is related to their dialogue history, we first construct a self-learning task to predict whether a profile and a dialogue are from the same doctor. It is designed to fine-tune a pre-trained BERT (Devlin et al., 2018) and align the profile writing and colloquial languages (used in patient queries and doctor responses) into the same semantic space to help model a doctor’s expertise. Profiles and dialogues are then coupled with the query embeddings to explore how likely a doctor is qualified to help the patient. Here multi-head attention in aware of the doctor profile is put over the history dialogues to capture the essential content able to indicate a doctor’s suitability from multiple aspects, e.g., the capabilities of $D$ in Figure 3 to handle both “insomnia” and “myopathy”. Such design reflects the intricate nature of health issues and would potentially allow the models to focus on the salient and relevant matters instead of being overwhelmed by the massive dialogues a doctor has engaged, which may concern diverse points.

In comparison to other NLP studies concerning health forum dialogues (Xu et al., 2019; Zeng et al., 2020a), it is found that few of them attempt to spotlight doctors in these dialogues and examine how their expertise is reflected by what they say in these dialogues. Different from them, we explore doctor expertise from their profiles and history dialogues in order to fit a doctor’s qualification to a patient’s requests, which would advance the so far limited progress of doctor expertise modeling with NLP.

To the best of our knowledge, we are the first to study doctor recommendation to automate the pairing of doctors and patients in online health forums, where the joint effects of doctor profiles and their previous interrogation dialogues are explored to learn what a doctor is good at and how they are able to help handle a patient’s request.

For experiments, we also gather a dataset with 119K patient-doctor dialogues involving 359 doctors from 14 departments from Chunyu Yisheng, a popular Chinese health forum.⁴⁴4chunyuyisheng.com The empirical results show that doctor profiles and dialogue history work together to well reflect a doctor’s expertise and how they are able to help a patient. In the main comparison, our model achieves state-of-the-art results (e.g., 0.616 by P@1), outperforming all baselines and ablations without employing self-supervised learning and multi-head attention.

Moreover, we quantify the effects of doctor profiles, history dialogues, and patient queries in recommendation and our model shows consistently superior performance in varying scenarios. Furthermore, we probe into the model outputs to examine what our model learns with a discussion on multiple heads (in our attention map), a case study, and an error analysis, where the results reveal the potential of multi-head attention to capture various aspects of a doctor’s expertise and point out the future direction to distinguish profile quality and leverage data augmentation and medical knowledge.

2 Data Collection and Analysis

Despite the previous contributions of large-scale data with doctor-patient dialogues Zeng et al. (2020a), we note some essential information for doctor modeling is missing, e.g., the profiles. In this work, we present a new dataset to study the characterization of doctor expertise on health forums from both profiles and dialogue history.

Data Collection.

We developed an HTML crawler to obtain the data from Chunyu Yisheng, one of the biggest online health forums in China. Then, seed dialogues involving 98 doctors were gathered from the “Featured QA” page. To ensure doctor coverage in varying departments, we also collected doctors from the “Find Doctors” page for each department, which results in the 359 doctors in our dataset. Finally, for each doctor, we crawled their “Favorable Dialogues” page and obtained the profile and history dialogues therein. All stop words were removed from each dialogue.

Data Analysis.

The statistics of our dataset are reported in Table 1. We observe that dialogues are in general much longer than profiles. We also observe that a doctor engages in over 300 dialogues on average. It indicates that rich information are contained in dialogues to learn doctor expertise, while presenting challenges to capture the essential content therein for effective doctor embedding.

# of dialogues	119,128
# of doctors	359
# of departments	14
# of tokens in vocabulary	8,715
Avg. # of dialogues per doctor	331.83
Avg. # of doctors per department	25.64
Avg. # of tokens in a query	89.97
Avg. # of tokens in a dialogue	534.28
Avg. # of tokens in a profile	87.53

Table 1: Data statistics. Each dialogue starts with a patient query and each doctor is associated with a profile.

We further plot the distribution of dialogues a doctor engages and the dialogue length distribution in Figure 2. It is observed that doctors contribute diverse amounts of dialogues, which reflects the wide range of doctor expertise and qualifications in practice. Nonetheless, a large proportion of doctors are involved in over 100 dialogues while many dialogues are lengthy (with over 200 tokens). We can hence envision a doctor’s expertise may exhibit diverse aspects and dense information is available in history dialogues, whereas an effective mechanism should be adopted to capture salient content.

We finally examine doctors’ language styles by counting the number of medical terms based on THUOCL medical lexicon.⁵⁵5github.com/thunlp/THUOCL/blob/master/data/THUOCL_medical.txt Results show that medical terms take 30.13% of tokens in doctor profiles, while the number is 7.83% and 5.52% for patient and doctor turns in dialogues, respectively. It is probably because doctors tend to profile themselves with professional language while adopting layman’s language to discuss with patients.

3 Doctor Recommendation Framework

We now introduce the proposed framework for our doctor recommendation task (overviewed in Figure 3). It contains three modules: a query encoder that encodes patient needs from queries, a doctor encoder that encodes doctor expertise from profiles and dialogues, and a prediction layer that couples above outputs for recommendation prediction.

Model’s Input and Output.

The input of our model is from three sources: a query $q$ from a patient, the profile $p_{i}$ of doctor $D_{i}$ , and a collection of $D_{i}$ ’s history dialogues $\langle d_{i_{1}},d_{i_{2}},...,d_{i_{n}}\rangle$ ( $i_{n}$ denotes the number of dialogues $D_{i}$ previously engaged). For each given query $q$ , we first pair it with each doctor $D_{i}$ from a candidate pool of $m$ doctors and output a matching score $s_{i}$ to reflect how likely $D_{i}$ owns the expertise to handle the request of $q$ . A recommendation is then made for $q$ by ranking all the doctor candidates based on these matching scores $s_{i}\,(i\in\{1,...,m\})$ .

3.1 Doctor Encoder

Here we introduce how we encode embeddings for a doctor $D$ to reflect their expertise, which starts with the embedding of their profile and dialogues.

Profile and Dialogue Embedding.

Built upon the success of pre-trained models for language representation learning, we employ a pre-trained BERT Devlin et al. (2018) to encode the profile $p$ and obtain its rudimentary embedding $\mathbf{e}_{p}$ . Likewise, for a dialogue $d$ , we convert it into a token sequence via linking turns in chronological order and encode its semantic features with BERT, which yields the dialogue embedding $\mathbf{e}_{d}$ .

Self-Learning.

As analyzed in Section 2, doctor profiles are usually written in a professional language while dialogue language tends to be in layman’s styles. To marry semantics of profiles and dialogues into the space, we design a self-learning task to predict whether a profile and a dialogue come from the same doctor, where random profile-doctor pairs are adopted as the negative samples. Then, the pre-trained BERT at doctor encoder’s embedding layer is fine-tuned via tackling the self-learning task and shaping an initial understanding of how profiles are related to dialogues.

Multi-head Attention.

We have shown in Figure 2 that a doctor may engage in massive amounts of dialogues, where only part of them may be relevant with a query. To allow models to attend to the salient information from the dense content provided by history dialogues, we put a profile-aware attention mechanism over dialogues. Here, multi-head attention is selected because of its capabilities in capturing multiple key points. It potentially reflects the complicated nature of doctor expertise, which in practice would exhibit multiple aspects.

Concretely, the profile embedding $\mathbf{e}_{p}$ is used to query and attend ${[\mathbf{e}_{d_{1}},\mathbf{e}_{d_{2}},…,\mathbf{e}_{d_{n}}]}^{T}$ (the dialogue embedding array) to both key and value argument:

\displaystyle\begin{cases}\text{Query}_{\text{att}}=\mathbf{e}_{p},\\ \text{Key}_{\text{att}}={[\mathbf{e}_{d_{1}},\mathbf{e}_{d_{2}},…,\mathbf{e}_{d_{n}}]}^{T},\\ \text{Value}_{\text{att}}={[\mathbf{e}_{d_{1}},\mathbf{e}_{d_{2}},…,\mathbf{e}_{d_{n}}]}^{T}.\end{cases}

(1)

For the $j$ -th head, these three arguments are then respectively transformed through the neural perceptions with learnable weight matrices $W^{Q}_{j}$ , $W^{K}_{j}$ , and $W^{V}_{j}$ ( $Q$ for query, $K$ for key, and $V$ for value). Their outputs $\mathbf{Q}$ , $\mathbf{K}$ , and $\mathbf{V}$ jointly produce an intermediate doctor representation $\mathbf{h}_{j}$ , which characterize a doctor’s expertise from one perspective:

\mathbf{h}_{j}=Att(QW^{Q}_{j},KW^{K}_{j},VW^{V}_{j})

(2)

where the $Att(\cdot)$ operation is defined as:

Att(\mathbf{Q},\mathbf{K},\mathbf{V})=softmax(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{dim}})\mathbf{V}

(3)

Here $dim$ is the dimension of key and value. The scaling factor $\frac{1}{\sqrt{dim}}$ helps keep the softmax output away from regions with extremely small gradients.

Finally, to combine the learning results from multiple heads, outputs are concatenated altogether and transformed with a learnable matrix $W^{O}$ to obtain the final doctor embedding $\mathbf{e}_{D}$ :

\mathbf{e}_{D}=Concat(\mathbf{h}_{1},\mathbf{h}_{2},...,\mathbf{h}_{l})W^{O}

(4)

Here $l$ denotes the number of heads. The doctor embedding $\mathbf{e}_{D}$ , carrying features indicating the doctor expertise of $D$ , will then be coupled with the query encoder results for recommendation, which will later be described in the coming section.

3.2 Query Encoder and Prediction Layer

Then we describe how we measure the qualification of a doctor (embedded in $\mathbf{e}_{D}$ ) to handle a query $q$ .

Query Embedding.

For anonymous reasons, only the linguistic signals in a query are available to encode a patient’s request. Therefore, we adopt a similar strategy for the embedding of profiles and dialogues to customize the query encoder with a pre-trained BERT. The learned feature is denoted as a query embedding $\mathbf{e}_{q}$ to represent patient needs.

Recommendation Prediction.

Given a pair of doctor $D$ and query $q$ , the embedding results of doctor encoder $\mathbf{e}_{D}$ and query encoder $\mathbf{e}_{q}$ are coupled in the prediction layer for recommendation. We adopt a MLP architecture to measure the matching score $s$ of the $D$ - $q$ pair, which indicates the likelihood of doctor $D$ able to provide a suitable answer to query $q$ and is calculated as following:

s=\sigma(W_{MLP}\cdot{Concat(\mathbf{e}_{D},\mathbf{e}_{q})}+b_{MLP})

(5)

Here $\sigma$ denotes sigmoid activation function and $W_{MLP}$ (weights) and $b_{MLP}$ (bias) are trainable.

3.3 Training Processes

Our framework is based on the pre-trained BERT and then fine-tuned in the following two steps. The first is to fine-tune the embedding layer of doctor encoder (as described in Section 3.1). For the second, we fine-tune the entire framework by optimizing the weighted binary cross-entropy loss introduced in Zeng et al. (2020b):

L=-\sum_{(D,q)\in\tau}(\lambda\cdot{\hat{s}_{D,q}}\log({s_{D,q}})+(1-{\hat{s}_{D,q}})\log(1-{s_{D,q}}))

(6)

Here $\tau$ is the training set formed with doctor-query pairs and $\hat{s}_{D,q}$ denotes the binary ground-truth labels, with $1$ indicating $D$ later responded to $q$ while $0$ the opposite. $\lambda>1$ balances the weights of positive and negative samples in model training, where the model would weigh more on positive $D$ - $q$ pairs ( $D$ indeed handled $q$ ) because negative samples may be less reliable and affected by many unpredictable factors, e.g., a doctor is too busy at some time. Intuitively, this training objective encourages models to assign high matching scores $s_{D,q}$ to a doctor $D$ who actually helped $q$ .

4 Experimental Setup

We now describe the set up for our experiments.

Dataset Preprocessing and Split.

To pre-process the data for non-neural models, we employed an open-source toolkit jieba for Chinese word segmentation.⁶⁶6github.com/fxsjy/jieba For neural models, texts were tokenized with the attached toolkit of MC-BERT, a pre-trained BERT for biomedical language understanding Zhang et al. (2020a), to be able to feed into BERT.⁷⁷7github.com/alibaba-research/ChineseBLUE In the experiments, we maintained a vocabulary without stop words for dialogues’ non-query turns while keeping them in queries and profiles, considering the high information density of the latter and colloquial styles of the former.

In terms of dataset split, 80% dialogues were randomly selected from each doctor to form the training set. For the rest 20% dialogues, we took their first turns (patient query) to measure recommendation and split the queries into two random halves, one for validation and the other for test. In the training stage, we adopted negative sampling with a sampling ratio of 10 to speed up the process while for inference, the doctor ranking is conducted on the top 100 doctors handling the most queries.

Model Settings.

As discussed above, the pre-trained MC-BERT was employed to encode the queries, profiles, and dialogues, whose parameters were first fine-tuned on the self-learning task, followed by a second fine-tuning step to tackle the doctor recommendation task with the other neural modules. The maximum input length of BERT is 512, and the dimension of all text embeddings from the output of MC-BERT is 768. The hyper-parameters are tuned on validation results and the following presents the settings. The head number of multi-head attention is set to 6 and the tradeoff parameter $\lambda=5$ (Eq. 6) to weigh more on positive samples. The MLP at the output side contains one hidden layer in size 256. For training, we employ the Adam optimizer with an initial learning rate of 0.008 and batch size 256. The entire training procedure is 50 epochs, with early stop strategy adopted and the parameter sets result in the lowest validation loss used for test.

Baselines and Comparisons.

We first consider weak baselines that rank doctors (1) randomly (henceforth Random), (2) by the frequency of queries they handled measured on the training dialogues (henceforth Frequency), (3) by referring to the doctors who responded to $K$ (in practice $K$ is set to 20) nearest patient queries in the semantic space (henceforth KNN), (4) by the cosine similarity of profile and query embeddings yielded by the pre-trained MC-BERT (henceforth Cos-Sim (P+Q)), and its counterpart matching dialogues and queries (henceforth Cos-Sim (D+Q)). Then, a popular non-neural learning-to-rank baseline GBDT Friedman (2001) with TF-IDF features is adopted (henceforth GBDT).

For neural baselines, we compare with the MLP that simply matches query embeddings with profile embeddings (henceforth MLP (P+Q)), with dialogue embeddings (henceforth MLP (D+Q)), and with the average embeddings of profile and dialogue (henceforth MLP (P+D+Q)).⁸⁸8We also test the alternative concatenates profile and dialogue embeddings, yet it results in very poor performance. A possible reason is the diverse styles of profile and dialogue languages and it is consistent with the observations from Table 2, where concatenation operations tend to result in compromised performance. We will discuss more in Section 5.1. We also consider Deep Structured Semantic Models (DSSM Huang et al. (2013)), a popular latent semantic model for semantic matching. In this work, the original encoding bag-of-words module in DSSM is replaced with BERT. The query embeddings are matched with profile embeddings (henceforth DSSM (BERT with P)) or the average embeddings of dialogues (henceforth DSSM (BERT with D)).

To further examine the effects of our attention design for doctor modeling in recommendation, we attend a doctor’s history dialogues in aware of their profile with two popular alternatives – dot and concat attention Luong et al. (2015) (the former is henceforth referred to as Dot-Att and the latter Cat-Att). They both went through a fine-tuning with the self-learning task before the training of recommendation to gain the initial view of how profiles and dialogues are related to each other. For comparison, we also experiment on our ablation based on multi-head attention without this self-learning step (henceforth Mul-Att (w/o SL)).

At last, we examine the other two ablations that encode profiles only with a multi-head self-attention (henceforth Mul-Att (w/o D)) and its counterpart fed with dialogues only (henceforth Mul-Att (w/o P)). The full model is henceforth named as Mul-Att (full).

For all models, we initialize them with three random seeds and average the results in three runs for the experimental report below.

Evaluation Metrics.

Following the common practice Zeng et al. (2020b); Zhang et al. (2021), the doctor recommendation results are evaluated with the popular information retrieval metrics: precision@ $N$ (P@ $N$ ), mean average precision (MAP), and ERR@ $N$ . In the experimental report, $N$ is set to 1 for P@ $N$ and 5 for ERR@ $N$ , whereas similar trends hold for other possible numbers.

5 Experimental Results

In this section, we first present the main comparison results in Section 5.1. Then, we quantify the model sensitivity to queries, profiles, and dialogues in varying lengths in Section 5.2. Finally, Section 5.3 analyzes the effects of head number in validation performance, followed by a case study to interpret our superiority and error analysis to provide insights to future work.

5.1 Main Comparison Results

Table 2 reports the comparison results across different models. We draw the following observations.

First, it may require deep semantics to match doctor expertise with patient needs, infeasible to rely on heuristic rules (e.g., frequency or similarity) or shallow features (e.g., TF-IDF) to well tackle the task. Second, compared to profile, dialogues may better indicate how likely a doctor can help a patient, probably because of the richer content therein and the closer language style to a query (as analyzed in Section 2). Third, although the profiles and dialogues may potentially collaborate to better characterize a doctor (than the individual work), effective methods should be employed to couple their effects as their writings vary in the styles.

For models with multi-head attention, all of them yield better results than other attention counterparts. This may imply the fact doctor expertise might be multi-faceted and multi-head attention works well to capture such feature. We also notice a self multi-head attention over profile performs much worse than other ablations. It is probably because profile content is very dense and may challenge multi-head attention in distinguishing various aspects therein.

In comparison to Mul-Att (w/o SL), Mul-Att (w/o P) (modeling doctors with dialogues only) and the results of our full model is almost twice better. This again demonstrates the challenges present by the diverse wording patterns of profile and dialogues and the self-learning step to fine-tune pre-trained BERT would largely help in aligning them into the same semantic space.

Models	P@1	MAP	ERR@5
Simple Baselines
Random	0.010	0.052	0.001
Frequency	0.005	0.032	0.001
KNN	0.082	0.151	0.008
Cos-Sim (P+Q)	0.049	0.122	0.005
Cos-Sim (D+Q)	0.056	0.136	0.006
GBDT	0.018	0.052	0.002
Neural Comparisons
MLP (P+Q)	0.164	0.331	0.018
MLP (D+Q)	0.174	0.341	0.019
MLP (P+D+Q)	0.153	0.312	0.017
DSSM (BERT with D)	0.087	0.182	0.009
DSSM (BERT with P)	0.151	0.231	0.012
Dot-Att	0.219	0.380	0.021
Cat-Att	0.167	0.332	0.018
Our Ablations
Mul-Att (w/o SL)	0.309	0.319	0.019
Mul-Att (w/o D)	0.198	0.217	0.013
Mul-Att (w/o P)	0.521	0.526	0.033
Mul-Att (full)	0.616	0.620	0.039

Table 2: Results for doctor recommendation (averaged over queries). For all the metrics, the higher the better. Our model obtains the best results (in boldface) and significantly outperform others (

p

< 0.02, paired

t

-test).

5.2 Quantitative Analyses

In Section 5.1, we have shown our model achieves a better performance compared to various baselines. In this section, we further quantify its performance in varying lengths of queries, dialogues, and profiles, and compare the full models’ results with its two ablations Mul-Att (w/o P) and (w/o SL) – the first and second runner-up in Table 2. Afterwards, we provide the comparisons of model performance across different medical departments to examine the scenarios where patients are able to know which department they should go to.

Sensitivity to Query Length.

Figure 4 shows the P@1 over varying lengths of patient queries. All models perform better for longer queries, owing to more content available to infer patient needs. Besides, our full model consistently outperforms its two ablations while showing a relatively smaller performance gain for longer queries compared to Mul-Att (w/o P). A possible reason is: long queries may simplify the matching with doctors and dialogue content may be sufficient to handle recommendation, minoring the profile effects.

Sensitivity to Dialogue Length.

We then study the model sensitivity to the length of dialogues for doctor modeling and show the results in Figure 5. Dialogue length exhibits similar effects to query length, possibly because they contribute homogeneous features to understand doctor-patient match. After all, other patients’ queries are part of the dialogues and involved in learning doctor expertise.

Sensitivity to Profile Length.

Furthermore, we quantify the profile length and display the models’ P@1 in Figure 6. Here profile length exhibits different effects compared to query and dialogue length discussed above, where models suffer the performance drop for very long profiles, because of the potential noise therein hindering the collaboration with profiles and dialogues. Nevertheless, the self-learning step enables profiling language to blend in the colloquial embedding space of dialogues or queries, which hence presents more robust results.

Comparisons of Model Performance over Varying Departments.

In the realistic practice, patients might have already known which department they should turn to before seeking help from doctors. To better study doctor recommendation in this scenario, here we examine the model performance within different medical departments in our data. We select 4 models with highest P@1 scores in the main experiment (Table 2) for comparison: Mul-Att (w/o SL), Mul-Att (w/o D), Mul-Att (w/o P), and Mul-Att (FULL). Their setups are described in Section 4.

Experimental results are shown in Figure 7. We observe for all 14 departments, our model has the best performance in 13 departments and achieves comparable results with the best model for the left department (otolaryngology). We also find all models exhibit varying performance when handling queries from different departments. It is related to departments’ characteristics. For example, all models obtain low scores for Internal Medicine because of its significant overlap with others and the challenges to understand the needs from queries therein. Another factor is the imbalance of training data scale from each department. For instance, the training samples for Oncology, Surgery, Otolaryngology are much fewer than the average, resulting in the worse model performance on them.

5.3 Further Discussions

Analysis of Head Number.

In Table 2, multi-head attention shows the superiority to model doctors. We are hence interested in the effects of head numbers and vary them in validation set with the results shown in Table 3. It is seen that model performances first increase and then decrease, with 6 heads achieving the best performance. It indicates that head number reasonably affects model performance because it controls the granularity of aspects a model should capture to learn doctor expertise.

Head Number	P@1	MAP	ERR@5
2	0.601	0.605	0.038
4	0.609	0.613	0.038
6 (OURS)	0.616	0.620	0.039
8	0.564	0.568	0.035

Table 3: The validation results of our multi-head attention with different hyper-parameters in head number.

Case Study.

To interpret what is learned by multi-head attention we take the example in Figure 3 and analyze the attention map produced by 6 heads, where 4 of them attend to dialogue $d_{3}$ and the other 2 respectively highlights $d_{1}$ and $d_{2}$ . Recall that $d_{1}$ , $d_{2}$ , and $d_{3}$ each reflects a different aspects of doctor expertise. To further probe into the attended content, we rank the words by the sum of attention weights assigned to a dialogue they occur in and show the top 5 medical terms in Table 4. It is observed that the heads vary in their focusing point, while all related to the queried symptom of “insomnia” and “muscle ache” and further contribute to a correct recommendation of a neurological expert. This again demonstrates the intricacy of doctor expertise and the capabilities of multi-head attention to well reflect such essence. More cases are shown in Appendix A to offer more insight of how our model recommends doctors.

Head

i

Top 5 Keywords

肌肉、神经、抽搐、无力、萎缩

(muscle, nerve, convulsion, weakness, atrophy)

头晕，神经，头痛，内科，呕吐

(dizziness, nerve, headache, internal medicine,

sickness)

神经，肌肉，酸痛，劳损，按摩

(nerve, muscle, ache, strain, massage)

睡眠，焦虑，失眠，神经，右佐匹克隆

(sleep, anxiety, insomnia, nerve, Dexzopiclone)

肌肉，颈部，头痛，恶心，颈椎

(muscle, neck, headache, sickness,

cervical vertebrae)

神经，肌肉，颈部，酸痛，腰椎

(nerve, muscle, neck, ache, lumbar vertebrae)

Table 4: The top 5 medical terms attended by each head given the input sample in Footnote 3. The medical terms are from the THUOCL lexicon used in Section 2.

Error Analysis.

We observe two major error types of our model, one resulting from doctor modeling and the other from the query.

For doctor modeling, we observe many errors come from the diverse quality of profiles. As we have shown in Figure 6, not all content from profiles is helpful. For example, some doctors tend to profile themselves generally from experience (e.g., how many years they worked) instead of the specific expertise (what they are good at). Future work should concern how to further distinguish profile quality to learn doctor expertise.

In real world, some doctors are skilled comprehensively while others are more specialized. It causes the models tend to recommend the “Jack of all trades” rather than a more relevant doctor, as the former usually engaged in more dialogues and it is safer to choose them. For example, in a query concerning “continuous eye blinking”, the model recommends a doctor with 100 “eyes”-related dialogues instead of the one specialized in “Hordeolum” and “Conjunctivitis” yet involved in only 30 dialogues. To mitigate such bias, it would be interesting to employ data augmentation Zhang et al. (2020b) to “enrich” the history for doctors handling relatively fewer queries.

In terms of queries, many patients are observed to describe their symptoms with minutiae rather than focusing on the key points. So the model, lacking professional knowledge, may consequently be trapped with these unimportant details. For instance, a patient queried a “pimple” on the “eyelid”; the model wrongly attends to “eyelid” thus recommends an ophthalmologist but not a dermatologist to solve the “pimple” problem. A future direction to tackle this issue is to exploit knowledge from medical domains Liu et al. (2020a) to allow a better understanding of patient needs.

6 Related Work

Our work is in the research line of recommender systems widely studied because of their practical value in industry (Huang et al., 2021). For example, previous work explores users’ chatting history to recommend conversations Zeng et al. (2018, 2020b) and hashtags Li et al. (2016); Zhang et al. (2021), browsing history to recommend news Wu et al. (2019); Qi et al. (2021), and purchase history to recommend products Guo et al. (2020). In contrast to most recommendation studies focusing on exploiting target users’ personal interest modeling from their history behavior, our work largely relies on wordings of a short query to figure out what is needed by a target user (patient) because they are anonymous for privacy concern.

Within several branches of recommendation research, our task is by concept similar to expert recommendation for question answering Wang et al. (2018); Nikzad–Khasmakhi et al. (2019). In this field, many previous studies encode expertise knowledge in diverse streams, such as software engineering Bhat et al. (2018), social activities Bok et al. (2021), etc. Nevertheless, few of them attempt to model expertise with NLP methods. On the contrary, language representations play an important role here to tackle our task: we substantially explore how semantic features help characterize doctor expertise, which has not been studied before.

Our work is also related to the previous language understanding research over doctor-patient dialogues on online health forums Zeng et al. (2020a), where various compelling applications are explored, such as information extraction Ramponi et al. (2020); Du et al. (2019); Zhang et al. (2020c), question answering Pampari et al. (2018); Xu et al. (2019), and medical report generation Enarvi et al. (2020). In comparison with them, we concern doctor expertise and characterize it from both doctor profiles and the past patient-doctor dialogues, which is a gap in previous work filled in this work.

7 Conclusion

This paper has studied doctor recommendation in online health forums. We have explored the effects of doctor profiles and history dialogues in the learning of doctor expertise through a self-learning task and a multi-head attention mechanism. Substantial experiments on a large-scale Chinese dataset demonstrate the effectiveness of our method.

Ethical Considerations

It should be mentioned that all data, including doctors’ profiles, patients’ queries, and doctor-patient dialogues, are collected from the openly accessible online health forum Chunyu Yisheng whose owners make such information visible to the public (while anonymizing patients). Our dataset is collected by a crawler within the constraints of the forum. Apart from the personal information de-identified by the forum officially, to prevent privacy leaks, we manually reviewed the collected data and deleted sensitive messages. Additionally, we replaced each doctor’s name with a unique code randomly generated to distinguish them while protecting their privacy. We ensure there is no identifiable or offensive information in the released dataset.

The dataset, approach, and model proposed in this paper are for research purposes only and intended to facilitate studies of using NLP methods for doctor expertise learning and recommendation to allow a better user experience on online health forums. We also anticipate they could advance other NLP researches like question answering (QA) in the biomedical domain.

Acknowledgements

This paper is substantially supported by NSFC Young Scientists Fund (62006203), a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU/25200821), PolyU internal funds (1-BE2W, 4-ZZKM, 1-ZVRH, and 1-TA27), CCF-Tencent Open Fund (R-ZDCJ), and CCF-Baidu Open Fund (No. 2021PP15002000). The authors would like to thank Yuji Zhang and the anonymous reviewers from ACL 2022 for their insightful suggestions on various aspects of this work.

References

Bhat et al. (2018) Manoj Bhat, Klym Shumaiev, Kevin Koch, Uwe Hohenstein, Andreas Biesdorf, and Florian Matthes. 2018. An expert recommendation system for design decision making: Who should be involved in making a design decision? In 2018 IEEE International Conference on Software Architecture (ICSA), pages 85–8509. IEEE.
Bok et al. (2021) Kyoungsoo Bok, Heesub Song, Dojin Choi, Jongtae Lim, Deukbae Park, and Jaesoo Yoo. 2021. Expert recommendation for answering questions on social media. Applied Sciences, 11(16).
Cao et al. (2017) Xianye Cao, Yongmei Liu, Zhangxiang Zhu, Junhua Hu, and Xiaohong Chen. 2017. Online selection of a physician by patients: Empirical study from elaboration likelihood perspective. Computers in Human Behavior, 73:403–412.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Du et al. (2019) Nan Du, Kai Chen, Anjuli Kannan, Linh Tran, Yuhui Chen, and Izhak Shafran. 2019. Extracting symptoms and their status from clinical conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 915–925, Florence, Italy. Association for Computational Linguistics.
Enarvi et al. (2020) Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba, Brian Delaney, Frank Diehl, Stefan Hahn, Kristina Harris, Liam McGrath, Yue Pan, Joel Pinto, Luca Rubini, Miguel Ruiz, Gagandeep Singh, Fabian Stemmer, Weiyi Sun, Paul Vozila, Thomas Lin, and Ranjani Ramamurthy. 2020. Generating medical reports from patient-doctor conversations using sequence-to-sequence models. In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations, pages 22–30, Online. Association for Computational Linguistics.
Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
Gong et al. (2020) Kai Gong, Zhong Xu, Zhefeng Cai, Yuxiu Chen, and Zhanxiang Wang. 2020. Internet hospitals help prevent and control the epidemic of covid-19 in china: multicenter user profiling study. Journal of medical Internet research, 22(4):e18908.
Guo et al. (2020) Mingming Guo, Nian Yan, Xiquan Cui, San He Wu, Unaiza Ahsan, Rebecca West, and Khalifeh Al Jadda. 2020. Deep learning-based online alternative product recommendations at scale. In Proceedings of The 3rd Workshop on e-Commerce and NLP, pages 19–23, Seattle, WA, USA. Association for Computational Linguistics.
Huang et al. (2021) Chao Huang, Jiahui Chen, Lianghao Xia, Yong Xu, Peng Dai, Yanqing Chen, Liefeng Bo, Jiashu Zhao, and Jimmy Xiangji Huang. 2021. Graph-enhanced multi-task learning of multi-level transition dynamics for session-based recommendation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4123–4130.
Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 2333–2338.
Li et al. (2016) Yang Li, Ting Liu, Jing Jiang, and Liang Zhang. 2016. Hashtag recommendation with topical attention-based LSTM. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3019–3029, Osaka, Japan. The COLING 2016 Organizing Committee.
Liu et al. (2020a) Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020a. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2901–2908.
Liu et al. (2020b) Wenge Liu, Jianheng Tang, Jinghui Qin, Lin Xu, Zhen Li, and Xiaodan Liang. 2020b. Meddg: A large-scale medical consultation dataset for building medical dialogue system. arXiv preprint arXiv:2010.07497.
Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
Nikzad–Khasmakhi et al. (2019) N. Nikzad–Khasmakhi, M.A. Balafar, and M. Reza Feizi–Derakhshi. 2019. The state-of-the-art in expert recommendation systems. Engineering Applications of Artificial Intelligence, 82:126–147.
Pampari et al. (2018) Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2357–2368, Brussels, Belgium. Association for Computational Linguistics.
Qi et al. (2021) Tao Qi, Fangzhao Wu, Chuhan Wu, and Yongfeng Huang. 2021. PP-rec: News recommendation with personalized user interest and time-aware news popularity. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5457–5467, Online. Association for Computational Linguistics.
Ramponi et al. (2020) Alan Ramponi, Rob van der Goot, Rosario Lombardo, and Barbara Plank. 2020. Biomedical event extraction as sequence labeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5357–5367, Online. Association for Computational Linguistics.
Wang et al. (2018) Xianzhi Wang, Chaoran Huang, Lina Yao, Boualem Benatallah, and Manqing Dong. 2018. A survey on expert recommendation in community question answering. Journal of Computer Science and Technology, 33(4):625–653.
Wu et al. (2019) Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie. 2019. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6389–6394, Hong Kong, China. Association for Computational Linguistics.
Wu et al. (2020) Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2020. SentiRec: Sentiment diversity-aware neural news recommendation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 44–53, Suzhou, China. Association for Computational Linguistics.
Xu et al. (2019) Yichong Xu, Xiaodong Liu, Chunyuan Li, Hoifung Poon, and Jianfeng Gao. 2019. DoubleTransfer at MEDIQA 2019: Multi-source transfer learning for natural language understanding in the medical domain. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 399–405, Florence, Italy. Association for Computational Linguistics.
Zeng et al. (2020a) Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, Hongchao Fang, Penghui Zhu, Shu Chen, and Pengtao Xie. 2020a. MedDialog: Large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9241–9250, Online. Association for Computational Linguistics.
Zeng et al. (2018) Xingshan Zeng, Jing Li, Lu Wang, Nicholas Beauchamp, Sarah Shugars, and Kam-Fai Wong. 2018. Microblog conversation recommendation via joint modeling of topics and discourse. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 375–385, New Orleans, Louisiana. Association for Computational Linguistics.
Zeng et al. (2020b) Xingshan Zeng, Jing Li, Lu Wang, Zhiming Mao, and Kam-Fai Wong. 2020b. Dynamic online conversation recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3331–3341, Online. Association for Computational Linguistics.
Zhang et al. (2020a) Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao, and Nengwei Hua. 2020a. Conceptualized representation learning for chinese biomedical text mining. arXiv preprint arXiv:2008.10813.
Zhang et al. (2020b) Yi Zhang, Tao Ge, and Xu Sun. 2020b. Parallel data augmentation for formality style transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3221–3228, Online. Association for Computational Linguistics.
Zhang et al. (2020c) Yuanzhe Zhang, Zhongtao Jiang, Tao Zhang, Shiwan Liu, Jiarun Cao, Kang Liu, Shengping Liu, and Jun Zhao. 2020c. MIE: A medical information extractor towards medical dialogues. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6460–6469, Online. Association for Computational Linguistics.
Zhang et al. (2021) Yuji Zhang, Yubo Zhang, Chunpu Xu, Jing Li, Ziyan Jiang, and Baolin Peng. 2021. #HowYouTagTweets: Learning user hashtagging preferences via personalized topic attention. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7811–7820, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Appendix A More Case Study Results

To provide more insight of why our model can exhibit superior performance, we further discuss two more cases to understand how the multi-head attention mechanism makes use of the information from both the doctors’ profiles and their history dialogues, in addition to example cases shown in Footnote 3 and Table 4. Because a dialogue is mostly lengthy (as shown in Table 1), we only show the dialogue snippets in English translations for a better display (while the model is fed with the entire dialogues in the experiments).

We present in Table 5 a case sampled from the Department of Gynecology. As can be seen, the profile of the doctor is short, while the attended dialogues provide detailed information for the symptoms, treatments, and medicine. The top 5 keywords identified by the sum of attention weights for each head are shown in Table 5(b).While several heads seem to attend to one or two specific tokens, for example head $1$ , $4$ , and $5$ attend to the token “menstruation”, we observe each head has its own focus. For example, it is reasonable to infer that head $1$ concerns messages related to the preparation of pregnancy, head $4$ irregular period, and head $5$ prognosis of abortion.

Query

q

from Anonymous Patient

P

There is brown secreta after my last menstruation, and it

disappears after sex. What’s wrong with me?

Profile

p

of Doctor

D

30 years of experience in obstetrics and gynecology.

Attended Dialogue

d_{1}

u_{P}

: The urine test result for pregnancy is negative for the

24th day after my last sexual behaviour, and it is the same

for the 20th, 22nd, 23rd. Can I rule out pregnancy?

u_{P}

: I took contraceptive pill last month. I don’t know my

current ovulation.

u_{D}

: Not pregnant, don’t worry.

u_{P}

: I’ve been getting yellowish vaginal discharge lately.

Am I inflamed?

u_{D}

: It’s fungal vaginitis. I suggest you take fluconazole

pills.

Attended Dialogue

d_{2}

u_{P}

: My boyfriend and I had sex with condom. It was the

first time for me, but my boyfriend had had sex life with

others. Is there a high chance I get infected with HPV?

u_{D}

: From your description, it is not likely to happen.

Attended Dialogue

d_{3}

u_{P}

: My period has been lasting for 8 days. I bleed a lot

and have blood clots. What’ the matter with me?

u_{P}

: In the past year my period has always been regular.

But in the past two months I took Ejiao for a few days.

u_{D}

: It’s abnormal and it could be caused by Ejiao. I

suggest hemostasis, or it could lead to anemia.

Attended Dialogue

d_{4}

u_{P}

: My vaginal opening is like white petal. I have had

sexual experience, but I feel alright. Is it condyloma

acuminatum?

u_{D}

: How long has this lasted? How are you feeling?

u_{P}

: I feel nothing.

u_{D}

: It’s normal, not condyloma acuminatum. It is likely

to be hymen residue.

(a)

Head

i

Top 5 Keywords

怀孕，生理，月经，性行为，排卵期

(pregnancy, physiology, menstruation, sexual

behaviour, ovulation)

炎症，白带，阴道，宫颈，分泌物

(inflammation, leukorrhea, vagina, cervix

uteri, secreta)

性行为，预防措施，避孕药，艾滋，精子

(sexual behaviour, precaution, contraceptive,

AIDS, sperm)

月经，激素，子宫内膜，出血，避孕药

(menstruation, hormone, endometrium,

bleeding, contraceptive)

月经，怀孕，性行为，排卵期，流产

(menstruation, pregnancy, sexual behaviour,

ovulation, abortion)

增生，肿块，卵巢，痛经，宫颈

(hyperplasia, lump, ovary, dysmenorrhea,

cervix uteri)

(b)

Table 5: (a) The sample patient query

q

from anonymous patient

P

on the top, followed by the profile of a sample doctor

D

and four dialogues

D

engaged before.

u_{P}

refers to utterances of

P

, and

u_{D}

utterances of

D

. (b) The top 5 medical terms attended by each head given the input sample in Table 5(a). The medical terms are from the THUOCL lexicon in Section 2.

Table 6 shows another example sampled from the Department of Dermatology. In this case, the doctor’s profile is more detailed while generic. Top 5 keywords for each attention head are shown in Table 6(b). Similar to the observation from Table 5, the token “pruritus” occurs in most attended keywords of 5 heads for that it is one of the most common symptoms, whereas each head focuses on different aspects related to the query.

Query

q

from Anonymous Patient

P

In the past week, she has been keeping saying that her

back, her legs, and her whole body were all itchy. I

observe she has a few dry eczema spots on her body, a

little wrinkled and peeling.

Profile

p

of Doctor

D

Good at treating common skin diseases, including

diagnosis and treatment of acne, urticaria, viral warts,

eczema, shingles, etc.

Attended Dialogue

d_{1}

u_{P}

: I’ve had beriberi for over a year. At night, I feel itchy

and the skin of my feet peels off.

u_{D}

: I suggest you apply topical antifungal ointment to

your feet and wash socks with boiled water every day. It

takes 4-6 weeks to cure tinea pedis.

Attended Dialogue

d_{2}

u_{P}

: It’s red and itchy around my mouth and nose. What’s

the matter with me?

u_{D}

: Are they blisters or pimples? You possibly got

seborrheic dermatitis.

u_{P}

: My husband has beriberi, is it possible I’m infected

by him?

u_{D}

: Not likely.

Attended Dialogue

d_{3}

u_{P}

: I froze this spot, is it going to scab and peel off?

u_{D}

: It’s already dark red, so theoretically it should soon

peel off.

u_{P}

: It’s nearly fourteen days after freeze, can I bath now?

u_{D}

: You could shower but should not bath. Be careful not

to irritate this spot.

Attended Dialogue

d_{4}

u_{P}

: I have nail fungus, and I felt itchy after I applied

ciclopirox amine cream the day before yesterday. Today I

observe my toes swell.

u_{D}

: There is a possible delayed allergic reaction to the

drug. I suggest you rinse your toes with warm water and

stop applying that cream.

(a)

Head

i

Top 5 Keywords

瘙痒、湿疹、红肿、刺激、疱疹

(pruritus, eczema, redness and swelling,

irritation, herpes)

瘙痒、脱皮、传染、皮癣、细菌

(pruritus, desquamation, infection, ringworm,

bacteria)

过敏，红肿，皮炎，瘙痒，药膏

(allergy, redness and swelling, dermatitis,

pruritus, ointment)

红肿，皮炎，痤疮，伤疤，粉刺

(redness and swelling, dermatitis, acne, scar,

pimple)

脱皮，开裂，瘙痒，红斑，感染

(desquamation, chap, pruritus, erythema,

infection)

瘙痒，性病，传染，疱疹，尖锐湿疣

(pruritus, venereal disease, infection, herpes,

condyloma acuminatum)

(b)

Table 6: (a) The sample patient query

q

from anonymous patient

P

on the top, followed by the profile of a sample doctor

D

and four dialogues

D

engaged before.

u_{P}

refers to utterances of

P

, and

u_{D}

utterances of

D

. (b) The top 5 medical terms attended by each head given the input sample in Table 6(a). The medical terms are from the THUOCL lexicon in Section 2.