This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modeling Tag Prediction based on Question Tagging Behavior Analysis of CommunityQA Platform Users

Kuntal Kumar Pal [email protected] 1234-5678-9012 Arizona State UniversityTempeArizonaUnited States  and  Michael Gamon, Nirupama Chandrasekaran, Silviu Cucerzan Microsoft ResearchRedmondWashingtonUnited States
(2018)
Abstract.

In community question-answering platforms, tags play essential roles in effective information organization and retrieval, better question routing, faster response to questions, and assessment of topic popularity. Hence, automatic assistance for predicting and suggesting tags for posts is of high utility to users of such platforms. To develop better tag prediction across diverse communities and domains, we performed a thorough analysis of users’ tagging behavior in 17 StackExchange communities. We found various common inherent properties of this behavior on those diverse domains. We used the findings to develop a flexible neural tag prediction architecture, which predicts both popular tags and more granular tags for each question. Our extensive experiments and obtained performance show the effectiveness of our model.

Text mining, question tagging, community question answering, tag prediction, transformers, stack exchange, tagging behavior modeling
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06submissionid: 901ccs: Information systems Recommender systemsccs: Information systems Question answeringccs: Information systems Document topic modelsccs: Computing methodologies Natural language generationccs: Computing methodologies Information extraction

1. Introduction

Community Question Answering (CQA) platforms have become a very important online source of information for Web users. On these platforms, information seeking takes the form of questions and answers in communities formed around common domains of interest. StackExchange, Quora, AnswerBag, Question2Answer, Reddit111stackexchange.com, quora.com, answerbag.com, question2answer.org, reddit.com and Biostars (Parnell et al., 2011) are some of the most popular public CQA platforms. Many enterprise entities offer similar private platforms for their employees. These communities have amassed over time large online information repositories, with high numbers of daily active users. Thus, there is a need to organize and retrieve information efficiently, as well as to facilitate question routing to interested and qualified experts in order to provide a seamless user experience and interaction. Semantic tagging of questions plays an important role in this context.

Most CQA platforms require users to assign tags to their questions. Tags are keywords representative of the topics covered by those questions. They help communities to (1) categorize and organize information (2) retrieve existing answers for users looking for information, which in turn reduces duplicate question creation (3) route questions to topic experts which improves query response time and answer quality (4) provide tag-based notifications, which allow knowledgeable community members to answer questions in their areas of expertise and gain reputation (5) assess the popularity of various areas and topics in the targeted domain.

Asking users to annotate their questions with tags without providing adequate support poses several challenges, in particular with respect to novice users and to the lack of knowledge about tag usage in a community, which may lead to the creation of various tags with the same meaning, as well as different orthographic forms of those tags. This makes question routing difficult (for tag-based subscription platforms), delays response time, and leads to poor information organization. In turn, addressing these issues would require community administrators to constantly work on identifying and merging near-duplicate tags. Additionally, lack of support in suggesting adequate tags may inhibit novice users from asking questions and/or lead to questions being mistagged and not answered. These challenges may become more severe in enterprise CQA platforms due to community size and topic sparsity.

Against this background, tag-prediction becomes an extremely important while challenging task for both public and private CQA platforms. In this investigation, our first goal was to understand the commonalities of the tagging behaviors of users through a large scale analysis of 17 diverse domains in StackExchange (Section 3). Our analysis revealed that while these domains are quite diverse in terms of volume of questions, users and tags, they share common distributional properties for tag and tag pair usage. Also, there is a large lexical overlap between the tags and user texts in every domain. Post coverage of tags is high in all domains. Tags also show positional stability and tag pairs show particular ordering preferences forming a soft hierarchy among tags.

Refer to caption
Figure 1. Community Diversity in terms of Volume

We incorporate the findings to develop a neural model with two tag-prediction heads - one trained to predict existing popular tags such as the name of important topics in a domain (e.g. ”harry-potter”, doctor-who”, and ”star-wars” in the scifi domain) and frequently-used meta-tags (e.g. ”video-games”, ”books”, and ”short-stories” in scifi) and another one generate finer-grained tags, which may have been used rarely on previous questions or are new. Typically, the former category of tags represent the main topic area of a question while the latter help in further scoping down and clarifying it. Both types of tags are equally important in identifying the question and hence it is necessary for the tag prediction systems to not only predict the main generic tags but also the refined ones.

Our experiments show that the proposed approach significantly outperforms baseline methods in prediction of both generic tags and finer-grained tags. We also investigate and show the effect of reducing the pre-defined vocabulary size, as well as the contributions of each prediction head. Our main contributions in this work are:

  • We present an in-depth analysis of the tagging behaviors of the users of a CQA platform (StackExchange) on 17 diverse domains. We present our findings of question tag analysis across four dimensions: tag space, tag co-occurrence, tag pair ordering, and tag positional stability.

  • We propose a tag prediction architecture for both predicting popular tags from a pre-defined vocabulary and generating refined tags not present in the vocabulary.

  • We perform comprehensive experiments on the 17 domains and show effects of each model component under various experimental settings.

2. Dataset Preparation

We collected data from 17 communities of StackExchange that correspond to a diverse set of domains. We use the StackExchange data dumps222https://archive.org/details/stackexchange_20210301 (2021-03-01) for our analysis and model. We find that the Post.xml file is sufficient for our tag analysis and predictions. We only consider the posts from the dataset which are either questions or answers (PostTypeId) for our analysis. We reject posts with no owners (OwnerUserId, OwnerDisplayName). As imposed by StackExchange, the minimum and maximum number of tags assigned to each posts are one and five respectively and all the posts in this data set are dated prior to March, 2021. We chose several domains from each of the following StackExchange categories333https://stackexchange.com/sites#: Technology, Culture & recreation, Life & arts, Science and Professional. Each selected domain has at least a decade of posts. We do not include the stackoverflow domain because of its enormous volume and also a random sample set might not be representative of the full data of this domain. Hence we consider askUbuntu which is also a representative community of the Technology domain.

3. Tagging Behavior Analysis

To understand the user behavior of question tagging and to identify the inherent commonalities, we analyze ten years of data from these 17 domains.

Mathematical Notation: Without loss of generality, let DD denote one of the domains (out of 17) being investigated, PP the set of posts in the data for this domain, and T={t1,t2,,t|T|}T=\{t_{1},t_{2},\dots,t_{|T|}\} the set of all tags used in domain DD. Each post pjPp_{j}\in P has associated a sequence of tags S(pj)=(t(1),t(2),,t(l))S(p_{j})=\big{(}t_{(1)},t_{(2)},\dots,t_{(l)}\big{)}, 1l51\leq l\leq 5, where t(i)t_{(i)} denotes the tag at position ii in that sequence. We employ parentheses to distinguish between the positional information of a tag in a sequence and the indexes that identify elements tit_{i} of the tag set TT observed for domain DD.

Refer to caption
Figure 2. Tag Distribution Patterns,Y-Axis: # posts the tag appear in, X-Axis: Top-100 Tags

3.1. Community Diversity

We observed a high degree of variability for the selected domains in terms of Question Volume, Tag Space and Asker Volume. Figure 1 shows a visual comparison of this variability, while Table 1 shows general statistics for each domain. In terms of the amount of information created over a decade, only four domains have over 100K posted questions while the domains politics and history have merely 12K. If we consider the number of unique tags (#T) created, the domain movies ranks highest, as new movie titles are added to the tag set on weekly basis. To quantify tag re-use in each domain, we define post-per-Tag (PPT) as the number of posts available for one tag. We also observe that physics, askubuntu, and chemistry are domains with the most tag-reuse (PPT >> 100) while movies domain (PPT << 5) shows frequent new tags. The number of posts having views over 100 (V>>100) can be used to infer the popularity of posts in each domain. From the average number of tags (AvgT) per post, we can infer the need for detailed tagging in each domain. In travel, physics, and money, AvgT >> 3 indicates users feel the need to assign more than 3 tags to clarify their questions. Also, the movie domain has the least AvgT (2.09), showing that only two tags on average are sufficient. Some domains like aviation, philosophy, history, movies, politics are not popular (#A << 10K in a decade). More statistics are in Appendix Table 10.

Table 1. Community Diversity. V:Views, PPT:Posts/Tag, QPA:#Q/#A, AvgT: Average #T per Q, #T: Unique Tags
Domain #Q #T PPT AvgT V>>100 #A QPA
askubuntu 371800 3121 119.13 2.78 1093 201912 1.84
aviation 20345 1002 20.30 2.56 12 7066 2.88
biology 25671 739 34.74 2.58 11 12089 2.12
chemistry 37476 375 99.94 2.37 7 17202 2.18
cooking 24513 833 29.43 2.30 13 12413 1.97
electronics 152980 2226 68.72 2.77 36 61869 2.47
history 12562 813 15.45 2.84 19 5296 2.37
money 32648 995 32.81 3.11 37 18010 1.81
movies 20749 4348 4.77 2.09 30 6931 2.99
music 20925 512 40.87 2.52 5 10447 2.00
philosophy 15624 559 27.95 2.40 6 6640 2.35
physics 180166 893 201.75 3.17 131 59774 3.01
politics 12416 739 16.80 2.90 27 3970 3.13
rpg 42693 1195 35.73 2.91 56 11541 3.70
scifi 62987 3433 18.35 2.25 153 22717 2.77
serverfault 299895 3814 78.63 2.90 327 130214 2.30
travel 42201 1891 22.32 3.28 42 24895 1.70

3.2. Tag-Space Analysis

We analyzed each domain’s tag spaces into (1) General Tag Statistics (2) Tag Distributions (3) Tag-Post Coverage (4) Tag-Post Overlap.

General Tag Statistics: The shortest tag in every domain is merely 1-3 characters long (c, air, 3g) while the longest tag is 22-35 characters long (valerian-city-of-a-thousand-planets, neurodegenerative-disorders). askubuntu has the lowest average tag length (8.17) while movies has the highest (13.66). We believe that the tags in askubuntu are short technical terms of a subtopic but movie names tend to be quite long in comparison and are often used as a part of a tag in the movie domain. Table 2 shows the distribution based on the number of words of the tags. With the exception of movies, rpg, and scifi the majority of tags in all the domains consist of three or fewer words. The shortest and longest tags for each domain are presented in Appendix Table 9.

Table 2. Tag % based on the Number of Words in the Tag
Domain 1 2 3 4 5 >5
askubuntu 80.83 18.73 0.37 0.07 0 0
aviation 49.74 43.86 6.34 0.05 0 0
biology 69.30 29.95 0.75 0 0 0
chemistry 47.17 50.36 2.31 0.16 0 0
cooking 78.53 21.11 0.36 0.01 0 0
electronics 74.23 23.9 1.33 0.54 0 0
history 56.86 36.1 7.01 0.03 0 0
money 50.00 45.51 4.05 0.45 0 0
movies 32.81 41.58 16.32 5.61 2.57 1.1
music 77.74 21.07 1.17 0.02 0 0
philosophy 69.42 14.02 16.29 0.27 0.01 0
physics 41.37 49.31 9.02 0.3 0 0
politics 51.26 45.05 3.59 0.08 0.02 0
rpg 42.43 51.39 4.82 1.11 0.16 0.09
scifi 31.04 49.23 13.19 3.23 2.1 1.2
serverfault 67.91 23.09 7.62 1.32 0.06 0
travel 65.78 26.5 6.87 0.85 0 0

Tag Distributions: There is a long tail in the distribution of tags in every domain (Figure 2). We observe that (1) most larger domains where the tag re-use is high, have smoother tag distributions like askubuntu, electronics, biology and (2) for some smaller domains like scifi, movies, rpg, the most frequent tag dominates the distribution. The rest of the distributions are shown in the Appendix Figure 12. Also, Table 3 shows that the 100 most frequent tags (100Tag%) constitute a very small portion of the tag space for large domains.

Table 3. Top-n Tag’s Post Coverage. #T:#distinct tags, 100Tag%:Frequent 100 tag % among whole tag-space.
Domain #T 100Tag% Top1 Top10 Top100
askubuntu 3121 3.20 5.67 40.21 82.68
aviation 1002 9.98 11.05 45.93 89.43
biology 739 13.53 9.22 55.05 91.76
chemistry 375 26.67 23.05 61.38 95.35
cooking 833 12.00 9.55 38.99 85.19
electronics 2226 4.49 4.94 32.81 81.98
history 813 12.30 10.86 45.91 89.95
money 995 10.05 37.04 68.52 94.18
movies 4348 2.30 36.93 66.84 85.88
music 512 19.53 14.93 58.04 94.54
philosophy 559 17.89 19.39 63.30 93.77
physics 893 11.20 12.70 55.10 91.68
politics 739 13.53 46.00 66.41 94.95
rpg 1195 8.37 42.50 79.75 92.66
scifi 3433 2.91 27.86 70.67 85.04
serverfault 3814 2.62 11.92 42.76 82.86
travel 1891 5.29 22.20 58.34 92.36
Refer to caption
Figure 3. Tag-Post Overlap: % posts where at least one tag appears in user texts: Title(T), Body(B), Answers(A). EMS%: single word tag exact match, EMM%: single & multi-word tag exact match.

Post Coverage by Tags: We consider a tag to cover a post if it is present in the tag sequence of the post. Table 3 shows the percentage of total posts that can be covered by the top nn most frequent tags in each domain. We observe that the most frequent tag covers (Top1) at most 10% of posts in electronics, askubuntu, cooking, and biology domains but more than 40% in politics and rpg domains. More than 81% of all posts in each domain are covered by the 100 most frequent tags.

Tag-Post Overlap: Figure 3 shows whether the tags appear in user contents (question-title / question-body / answers) using two metrics: (1) single worded tag exact-match (EMS) and both single and multiple worded tag exact-match (EMM). We observe that in 8/17 domains, tags appear in more than 50% of post titles. The movie domain has more multi-worded tags than single worded tags (9.49% compared to 34.51%). Two science domains - biology and chemistry - have the lowest tag overlap (<<30%) with the question title (T-EMS). When we include the question body, we observe, in 9/17 domains, question tags appear in more than 70% of posts. Finally, if we include every answer for each question, all the domains (except chemistry and biology) have their tags appear in more than 70% of the posts. The three larger domains (askubuntu, serverfault, and electronics) have more than 90% overlap. The overlap is lowest (56%) for the chemistry and biology domains.

Table 4. Tag Pairs Post Coverage : % posts covered by top-k tag pairs. Single: % of posts with single tag
Domain Top-1 Top-3 Top-5 Top-10 Top-50 Top-100 Single
askubuntu 1.57 2.89 5.33 9.43 17.97 23.45 17.70
aviation 2.05 3.49 4.78 6.99 17.00 23.81 19.27
biology 2.85 4.90 7.41 11.39 25.85 33.34 20.67
chemistry 4.33 7.62 9.99 14.56 29.82 36.95 23.89
cooking 1.60 3.45 4.34 5.89 13.51 18.54 25.81
electronics 0.76 2.16 3.20 5.08 13.03 18.62 18.31
history 2.37 4.86 6.09 9.93 20.97 27.58 15.34
money 10.39 17.13 18.52 24.16 39.92 46.49 10.51
movies 2.50 6.28 7.81 10.93 20.29 25.30 21.98
music 2.48 5.32 7.56 13.52 31.20 38.17 20.49
philosophy 1.74 4.97 7.32 11.08 26.54 33.79 27.85
physics 2.32 5.29 7.10 11.07 28.54 37.46 11.39
politics 4.59 11.98 17.60 27.24 43.27 49.24 10.40
rpg 12.48 17.23 22.27 28.13 43.54 52.73 9.96
scifi 5.58 12.09 17.92 26.12 43.57 49.29 25.86
serverfault 1.09 2.84 4.16 6.23 16.07 22.29 13.03
travel 5.17 12.01 14.64 18.04 31.01 38.43 6.45

3.3. Tag Co-Occurrence Analysis

For a post pkp_{k}, we define tag co-occurrence Cij={{ti,tj}:ti,tjS(pk),titj}C_{ij}=\{\{t_{i},t_{j}\}:t_{i},t_{j}\in S(p_{k}),t_{i}\neq t_{j}\} as a pair of tags {ti,tj}\{t_{i},t_{j}\} appearing in a post together irrespective of their positions.

Soft Tag Hierarchy: From the tag co-occurrence analysis in the 17 domains, we find that there exists a soft hierarchy among the tag pairs. One of the tags indicates the main topic or area of the question and the other tag is often fine-grained which makes the question more specific. For these examples, the second tag is a sub-category of the first: (baking, bread) in cooking, (dnd-5e, spells) in rpg and (aircraft-design, wing) in aviation. In the science domain, similar examples of topic-subtopic relationships are (organic-chemistry, carbonyl-compounds) in chemistry and (hilbert-space, quantum-mechanics) in physics. The most frequently occuring tag-pair for each domain is shown in Table 5, in Appendix Table 11 a more comprehensive set of the top-5 most frequent pairs per domain are shown.

Refer to caption
Figure 4. Tag Position Stability : δ=\delta= 99% (Left) and δ=\delta= 90% (Right)
Table 5. Most Frequently Co-Occurring Tag-Pairs
Domain Top Pair Post-Count
askubuntu (’boot’, ’grub2’) 5845
aviation (’aerodynamics’, ’aircraft-design’) 417
biology (’entomology’, ’species-identification’) 731
chemistry (’organic-chemistry’, ’reaction-mechanism’) 1621
cooking (’baking’, ’bread’) 393
electronics (’current’, ’voltage’) 1161
history (’nazi-germany’, ’world-war-two’) 298
money (’taxes’, ’united-states’) 3393
movies (’character’, ’plot-explanation’) 518
music (’chords’, ’theory’) 519
philosophy (’logic’, ’philosophy-of-mathematics’) 272
physics (’homework-and-exercises’, ’newtonian-mechanics’) 4182
politics (’donald-trump’, ’united-states’) 570
rpg (’dnd-5e’, ’spells’) 5330
scifi (’short-stories’, ’story-identification’) 3514
serverfault (’linux’, ’ubuntu’) 3261
travel (’uk’, ’visas’) 2181

Tag Pair Post Coverage: We consider a tag-pair ({tit_{i},tjt_{j}}) to cover a post if the tag-pair occurs in the sequence of tags for that post in any position. Table 4 shows the tag pair post coverage across the domains. We see around 10-20% of posts have only a single tag. Considering the most frequent 100 pairs we can cover 18-53% posts. Also, the most frequent tag pair can cover more than 10% of posts in money and rpg domains which shows that this tag-pair is extremely essential for these two domains.

Tag Pair Distribution: On analyzing the distribution of top-50 frequently occurring tag pairs in each domain, we observe three patterns: (1) Smooth Distribution (2) Spike in Top-1 and (3) Spikes in top few pairs. Larger domains (askubuntu, serverfault, electronics) have smooth distributions. In smaller domains (movies, scifi, travel) few tag pairs dominate the distributions, indicating their popularity. More Details are available in Appendix Section C and Figures 10 and 11.

3.4. Tag Pair Ordering

We analyze the top-10 most frequent tag pairs in each domain to identify users’ ordering preferences for tags. For a post pkp_{k}, Oij=(t(m),t(n))O_{ij}=(t_{(m)},t_{(n)}) (and OjiO_{ji}) are the tag ordering for the tag pairs tit_{i} and tjt_{j}, where mm and nn are the positions of tit_{i} and tjt_{j} respectively in the tag sequence S(pk)S(p_{k}). We find that community users have a tendency to assign the more generic tags prior to the specific ones, for each domain by analyzing the occurrence of OijO_{ij} and OjiO_{ji}. For example, aircraft-design always appears before wings out of 221 times they appear together in aviation, united-states appears before income-tax, 99.95% of times out of 3393 times they appear in the money domain and dnd-5e always appears before magic-items out of 1367 times in rpg. More examples are in the Appendix F.

3.5. Tag Position Stability

Table 6. Sets of five randomly picked stable tags for positions 1,2 and five for positions 3, 4, 5, respectively, across 17 domains.
Domain Position 1, 2 Position 3, 4, 5
askubuntu [’software-installation’,’server’,’community’,’locoteams’,’10.04’] [’multiple-workstations’,’equalizer’,’speakers’,’workflow’,’flicker’]
aviation [’air-traffic-control’,’radio-communications’,’airspace’,’flight-planning’,’faa-regulations’] [’rotary-wing’,’rvsm’,’sfo’,’dash-8’,’special-vfr’]
biology [’biochemistry’,’immunology’,’cell-biology’,’dna’,’molecular-biology’] [’ribosome’,’binding-sites’,’exons’,’dendritic-spines’,’rna-interference’]
chemistry [’crystal-structure’,’equilibrium’,’organic-chemistry’,’thermodynamics’,’inorganic-chemistry’] [’nitro-compounds’,’bent-bond’,’phenols’,’organosulfur-compounds’,’reaction-coordinate’]
cooking [’baking’,’oven’,’eggs’,’substitutions’,’sauce’] [’oregano’,’condensed-milk’,’chopping’,’blind-baking’,’scottish-cuisine’]
electronics [’arduino’,’motor’,’soldering’,’ethernet’,’avr’] [’basic-stamp’,’debugwire’,’sinking’,’nxp’,’fuse-bits’]
history [’20th-century’,’world-war-one’,’language’,’china’,’political-history’] [’proof’,’dday’,’crusaders’,’templars’,’republic-of-ireland’]
money [’investing’,’united-states’,’canada’,’taxes’,’credit-card’] [’pension-plan’,’contractor’,’contribution’,’limits’,’debt-reduction’]
movies [’wedding-crashers’,’analysis’,’star-wars’,’comedy’,’the-pink-panther’] [’manichitrathazhu’,’chandramukhi’,’bhool-bhulaiyaa’,’clint-eastwood’,’for-a-few-dollars-more’]
music [’learning’,’voice’,’theory’,’tuning’,’scales’] [’stick-control’,’archeterie’,’instrumentation’,’rsi’,’rock-n-roll’]
philosophy [’epistemology’,’philosophy-of-mathematics’,’ethics’,’existentialism’,’logic’] [’dreams’,’plantinga’,’rationalism’,’rule-ethics’,’arithmetic’]
physics [’quantum-mechanics’,’particle-physics’,’string-theory’,’acoustics’,’experimental-physics’] [’action’,’faq’,’stability’,’wavefunction-collapse’,’coriolis-effect’]
politics [’election’,’political-theory’,’democracy’,’united-kingdom’,’israel’] [’first-past-the-post’,’checks-and-balances’,’redistricting’,’faithless-elector’,’puerto-rico’]
rpg [’pathfinder-1e’,’dnd-3.5e’,’game-recommendation’,’dungeons-and-dragons’,’dogs-in-the-vineyard’] [’feywild’,’group-scaling’,’round-robin-gming’,’romance’,’charmed’]
scifi [’novel’,’vorkosigan-saga’,’total-recall-2070’,’star-trek’,’the-road’] [’star-trek-data’,’3001-the-final-odyssey’,’rama-revealed’,’star-trek-emh’,’skylark-series’]
serverfault [’sql-server’,’backup’,’sql-server-2008’,’raid’,’windows’] [’tempdb’,’fakeraid’,’tuning’,’su’,’debian-etch’]
travel [’loyalty-programs’,’transportation’,’public-transport’,’sightseeing’,’safety’] [’amazon-river’,’amazon-jungle’,’singapore-airlines’,’sin’,’trans-siberian’]
Refer to caption
Figure 5. Tag Stability (80%)

We study the positional stability of tags i.e., whether some tags frequently appear in any particular position among the five allowed by StackExchange. We consider ϕx(t)\phi_{x}(t) as the percentage of occurrence of a tag (tt) in any position xx, given by,

(1) ϕx(t)=c(t(x))k=15c(t(k))%\phi_{x}(t)=\frac{c(t_{(x)})}{\sum_{k=1}^{5}{c(t_{(k)})}}\%

where c(t(x))c(t_{(x)}) denotes the count of tag tt in position xx. We consider three stability thresholds (δ\delta) - 80%, 90%, 99% (Figure 4 and 5). For a tag tt and position xx, ϕx(t)>δ\phi_{x}(t)>\delta indicates that the tag is stable at that position.

(2) QX={tT:xXϕx(t)δ}\displaystyle Q_{X}=\{t\in T:\sum_{x\in X}\phi_{x}(t)\geq\delta\}
(3) STX=|QX||T|%\displaystyle ST_{X}=\frac{|Q_{X}|}{|T|}\%

where QXQ_{X} is the set of tags that occurs more than δ\delta in sets of positions defined by XX and STXST_{X} is the percentage of tags in a domain that are stable at positions XX. In Figure 4, (rpg domain) for δ=99%\delta=99\%, we find ST1,2=13.81ST_{1,2}=13.81 i.e. 13.81% of all tags in rpg are stable in positions 1 and 2 combined, and ST3,4,5=15.06ST_{3,4,5}=15.06 are stable in positions 3, 4 and 5 combined. The rest of the tags are unstable. Also, the stable tags (Q3,4,5Q_{3,4,5}) appearing in positions 3, 4, and 5 are finer-grained (or refined) tags that support the stable tags present in positions 1 and 2 (Q1,2Q_{1,2}).

The travel domain, has the highest number of stable tags appearing in positions 3,4, and 5 (Q3,4,5Q_{3,4,5}) with δ=90%,99%,80%\delta=90\%,99\%,80\% threshold showing that to make a question specific more than one refined tags is needed in this domain. We neither find any conclusive evidence of this stability within positions 1 and 2 (i.e. Q1Q_{1} and Q2Q_{2}), nor within positions 3, 4 and 5 (i.e. Q3Q_{3}, Q4Q_{4} and Q5Q_{5}) individually.

Table 6 shows five randomly selected examples of position-stable tags in 17 domains. These positions account for more than 99% of the occurrences of these tags in their respective domains.

4. Modeling Tag Prediction

Based on the observations from our tagging behavior analysis (Section 3), we develop an automated generic tag prediction approach for CQA platforms that predicts both generic and refined tags. The inherent commonalities in community diversity influence our decision to develop a common tag generation framework. The long tail in tag-space analysis guided us to develop a predictive-generative hybrid model. Tag co-occurrence analysis, tag-pair ordering, and tag-positional analysis on these domains led us to generate nn tags from a common vocabulary of popular tags at certain positions and mm related granular tags at the remaining positions.

4.1. Majority Baseline

Five most frequent tags per domain from training, data are considered as Top1-Top5 predictions for the test data in order (Hit@1 to Hit@5). We introduce this baseline as the top few tags cover a large number of posts in each domain (Table 3).

4.2. Feature-Based Models

We use linear multi-label classifiers using the one-vs-all strategy with tf-idf and bag-of-word features as two baselines since most of the feature-based tag prediction models use either of them as features. We hypothesize that these models can leverage the high amount of tag-post overlap (Figure 3). Here we train the models for each domain with classes corresponding to all the unique tags.

4.3. MetaTag Predictor Model (MP)

In this model (Figure 6), we first select a vocabulary (MetaTag) of tags based on a frequency analysis of the tag’s post coverage per domain. Here, we consider popular tags as meta tags. We formulate this multi-label classification task as a language model mask-filling task using pre-trained roberta-base (Liu et al., 2019) as the base of this model. We train separately for each domain.

Training: We tokenize the question title (QTQ_{T}) and body (QBQ_{B}) and hide the tags from the MetaTag vocabulary with a mask token, \langlemask\rangle. These are concatenated and provided as input to the model.

QTQ_{T} + QBQ_{B} +\langlemask\rangle\langlemask\rangle

This model is trained to predict those masks optimizing the prediction loss (P\mathcal{L}_{P}) over all masked tokens (=P\mathcal{L}=\mathcal{L}_{P}). Here the number of mask tokens may vary based on the post (shown as above). \mathcal{L} is the total loss.

Inference: We tokenize QTQ_{T} and QBQ_{B}, and append five \langlemask\rangle tokens at the end, enforcing the model to predict exactly five tags for the post (the most probable tag for each position). This is because StackExchange allows a maximum of five tags to be associated with a question. This ensures that the model predicts the tags from the MetaTag vocabulary.

QTQ_{T} + QBQ_{B} +\langlemask\rangle\langlemask\rangle\langlemask\rangle\langlemask\rangle\langlemask\rangle

Refer to caption
Figure 6. Meta-Refined Predictor Generator Model (MRPG), MetaTag Predictor (MP) model is highlighted in the dotted line having only the P-Head.

4.4. Meta Refined Tag Predictor Generator Model (MRPG)

This model (Figure 6) is similar to the MP model, with the additional ability to generate tags not present in the MetaTag vocabulary (OOV). In a more general sense, here the motivation is to develop a model capable of predicting tags from a predefined set and generating novel tags as well.

Training: Similar to MP model, we tokenize QTQ_{T} and QBQ_{B}, and replace the tags present in the MetaTag vocabulary with \langlemask\rangle token. The rest of the tags (out-of-vocab or OOV) are tokenized and each token is replaced with a separate mask token, \langlemaskref\rangle. A \langletagsep\rangle token is added to mark the boundaries (start and end) of these OOV tag tokens. The model is trained on joint loss (\mathcal{L}) of meta tag prediction head loss (P\mathcal{L}_{P}) and refined tag generation head loss (G\mathcal{L}_{G}) given by =P+G\mathcal{L}=\mathcal{L}_{P}+\mathcal{L}_{G}.

Inference: Our goal is to encourage the model to generate a combination of meta and refined tags. Based on our tag-stability analysis (Section 3.5), tag pair ordering analysis (Section 3.4) and soft tag-hierarchy findings (Section 3.3), we train the MRPG model to predict the first two tags from the MetaTag vocabulary and to generate the remaining three tags based on the user texts. We append two \langlemask\rangle tokens and a parameterized number of \langlemaskref\rangle tokens with tokenized QTQ_{T} and QBQ_{B}.

QTQ_{T} + QBQ_{B} +\langlemask\rangle\langlemask\rangle\langlemaskref\rangle\langlemaskref\rangle

Tag Generation: For each \langlemaskref\rangle tokens, MRPG generates one token from the tokenizer vocabulary following a greedy approach by selecting the most probable token. We concatenate the generated tokens between two \langletagsep\rangle tokens and form a tag. We choose the most probable three generated refined tags based on our earlier data analysis and stack exchange tag limitations. However, for implementing this model to any other CQA platform, this number can be incremented or decremented based on the above-mentioned parameter. Also, there is no restriction in the model that will limit it to generating tags with more than 3 words. But they are rare for most of the domains, as can be seen from Table 2. More details are in Appendix Section H.

5. Experiments

5.1. Settings

We split our dataset into train-dev-test in the ratio 70:10:20 based on a random seed value. In our experiments we build our model on top of the base version (125M parameters) of pre-trained roberta language model. We remove html tags (since these tags are irrelevant to StackExchange tags) from the user contents (question title and body) before tag prediction. We ran all experiments on 4 NVIDIA RTX A6000 GPUs (48GB GPU memory) with a batch size of 60 and an input length of 256. We use AdamW (Loshchilov and Hutter, 2017) optimizer, linear warmup scheduler, and a learning rate of 5e-5.

5.2. Metrics:

We define Hit@k (where k=1,,5k=1,\dots,5) as the percentage of posts where at least one predicted tags match with the actual tags for kk predictions. We generate at most 5 tag predictions in line with StackExchange’s upper limit of tags. This metric aligns with our motivation of maximizing the probability that a user will be able to find at least one tag among the recommended fixed number of tags. Hence we do not consider other metrics like precision and recall.

5.3. Performance Analysis

5.3.1. Baseline vs MP vs MRPG

In Table 7, we compare our models with the baselines (mean of five different runs). The feature-based models, bag-of-word, and tf-idf models are able to achieve good performance for those domains where we found a high overlap between user texts and tags. We find that our MP model shows improvements over the majority baseline and the feature-based models by a substantial margin (p-values << 0.05 on Wilcoxon test) in Hit@5 performance. The MRPG model outperforms other methods in almost all the domains (significant improvements in 12 out of 17 domains). This is because it was able to generate tags outside the MetaTag vocabulary. In the biology domain, the MP model performs better than MRPG. This might be because of the high tag reuse in this domain. All the model performance numbers (Hit@k for k=15k=1\dots 5) are present in Appendix Table 20. In this table, we observe that for Hit@1 MRPG model is always better than MP model.

Table 7. Performance Hit@5, 90% Tag-Post Coverage. Significant improvements (p-values<<0.05) of MRPG over MP are in bold. P-values are in Appendix J
Domain Majority TF-IDF Bag-of-Words MP MRPG
askubuntu 24.84 59.76±\pm0.06 71.25±\pm0.56 80.44±\pm0.11 82.94±\pm0.15
aviation 35.05 55.12±\pm0.29 65.58±\pm0.64 77.09±\pm0.44 77.63±\pm0.56
biology 37.94 54.91±\pm0.16 64.79±\pm0.50 78.96±\pm0.34 77.55±\pm0.41
chemistry 48.89 58.76±\pm0.17 68.09±\pm0.46 77.66±\pm0.10 79.17±\pm0.45
cooking 29.04 70.28±\pm0.19 71.69±\pm0.34 80.86±\pm0.42 85.18±\pm0.29
electronics 20.68 57.80±\pm0.11 70.12±\pm0.13 77.51±\pm0.26 81.30±\pm0.53
history 34.67 58.93±\pm0.32 59.29±\pm0.36 80.45±\pm0.09 81.23±\pm1.00
money 55.96 75.54±\pm0.19 79.70±\pm0.30 84.15±\pm0.23 87.94±\pm0.42
movies 54.99 60.80±\pm0.14 64.57±\pm0.24 82.91±\pm0.55 83.25±\pm0.99
music 47.91 68.15±\pm0.15 74.26±\pm0.42 82.66±\pm0.26 83.71±\pm0.51
philosophy 48.93 62.71±\pm0.10 64.06±\pm0.34 79.45±\pm0.20 79.49±\pm0.56
physics 39.98 66.81±\pm0.16 79.59±\pm0.17 81.12±\pm0.22 86.34±\pm0.37
politics 64.16 81.50±\pm0.21 83.37±\pm0.73 86.29±\pm0.25 90.98±\pm0.46
rpg 76.66 75.79±\pm0.23 82.71±\pm0.24 83.31±\pm0.33 89.09±\pm0.16
scifi 62.24 80.48±\pm0.10 85.88±\pm0.21 85.91±\pm0.11 91.53±\pm0.32
serverfault 29.84 62.83±\pm0.06 73.07±\pm0.20 81.66±\pm0.16 85.82±\pm0.26
travel 48.31 76.82±\pm0.48 83.73±\pm0.27 83.96±\pm0.12 89.50±\pm0.30

5.3.2. Effects of Vocabulary Size Reduction

We build the MetaTag vocab with 85% post-coverage by tags (\downarrow5%) and show the impact in Figure 7. We observe that the performance gap between MP and MRPG at 90% (Table 7) reduces as vocab size decreases by 5% (Figure 7) across all domains.

Refer to caption
Figure 7. Effect of Vocab Size Reduction on MP vs MRPG for Hit@5 metric at 85% post-coverage by tags

This is because the MP model suffers the most (2-5%) for this reduction. This is expected since MP’s performance (by P-head) is based on how big the MetaTag vocabulary is. MRPG model, however, is robust to this vocabulary reduction, i.e., the performance (Hit@5) only changes in the range 0-1.13% with the exception of askubuntu domain (2.26%). Details are in Appendix Table 18. Also with reduced vocab, the maximum performance difference is 9.12% (travel) since it has more refined tags (Section 3.5). The minimum difference is 1.06% (biology). Here the MRPG model could not take much advantage over MP because of high tag reusablity and fewer refined tags.

Refer to caption
Figure 8. MRPG’s Heads Contributions (Hit@5)

5.3.3. Head Contribution of MRPG

Figure 8 shows the contribution of P-Head and G-Head in the prediction performance (Hit@5 for 90% coverage vocab). We extract for how many posts (%) (1) only the P-Head correctly predicted at least one tag and (2) only the G-Head correctly predicted at least one tag. P-Head’s contributions were highest (45-74%) since the MetaTag Vocabulary is created using popular tags in each domain. The G-Head was able to predict at least one tag correctly for an extra 4-13% of the posts. The effect of decreasing and increasing the MetaTag vocabulary size by 5% change in tag-post coverage is shown in Appendix Table 19. We observe that the G-Head’s contribution increases up to 4% (on vocab size decrease) and decreases up to 5% (on vocab size increase). We also find that both the heads combined were able to suggest some non-overlapping tags in up to 33% of the posts.

5.3.4. Out-of-Vocabulary Tags Generation %

Table 8 shows MRPG’s performance in the prediction of tags outside MetaTag Vocabulary for 90% Tag-Post Coverage. % Posts shows the percentage of posts where MRPG correctly predicted at least one OOV tag. It has the least contribution in two domains movies (13.88%) and scifi (17.01%). % ALL Tags and % OOV Tags shows that MRPG was able to correctly predict a considerable amount of OOV tags because of the generative head.

Table 8. MRPG’s Out-of-Vocab (OOV) Tag Prediction Match on Test. % Posts: % total posts where MRPG correctly predicted at least one OOV tag. % ALL Tags: % correctly predicted OOV tags out of total #gold tags. % OOV Tags: % correctly predicted OOV tags out of total #OOV gold tags.
Domains

askubuntu

aviation

biology

chemistry

cooking

electronics

history

money

movies

music

philosophy

physics

politics

rpg

scifi

serverfault

travel

% Posts 31.38 23.10 22.09 24.68 29.21 27.32 22.37 35.8 13.88 25.19 19.10 41.93 34.35 36.49 17.01 34.69 43.59
% ALL Tags 12.74 9.49 8.98 11.31 13.65 10.73 8.17 12.74 6.85 10.60 8.55 15.32 13.18 13.92 8.10 13.62 15.66
% OOV Tags 41.92 28.03 27.34 37.84 47.81 31.39 22.78 33.21 22.76 36.87 28.55 36.55 35.37 43.04 38.96 36.38 38.80
Refer to caption
Figure 9. Case Studies

5.4. Case Studies

We compare tag predictions of our methods in Figure 9. MRPG was able to generate two extra refined tags than MP in askubuntu domain and was able to predict four out of five tags in physics domain. Included below are examples for five other domains.

Domain: Physics Title: Does matter become energy at the speed of light? Gold: special-relativity, speed-of-light, mass-energy, matter MP: special-relativity, energy, speed-of-light, mass MRPG: special-relativity, speed-of-light, mass-energy, matter

Domain: Travel Title: Nigerian citizen (university student) was refused a UK visit visa due to lack of funds and connection to school - how to resolve? Gold: UK, visa-refusals, nigerian-citizens MP: visas, customs-and-immigration, visa-refusals, paperwork, standard-visitor-visas MRPG: uk, visa-refusals, nigerian-citizens

Domain: Music Title: Piano tuning just under the absolute pitch Gold: piano, tuning MP: piano, tuning, maintenance MRPG: piano, tuning, alternative-tunings, pitch, relative-pitch

Domain: Biology Title: Why aren’t all infections immune-system resistant? Gold: evolution, microbiology, immunology, bacteriology MP: evolution, microbiology, bacteriology, bacteriology, immune-system MRPG: evolution, bacteriology, immunity, antibiotic-resistance

Domain: History Title: Where to find a list of participants in The Crusades? Gold: middle-ages, crusades MP: middle-ages, middle-ages, europe, historiography MRPG: middle-ages, sources, crusades

5.5. Adaptability of the MP & MRPG Architectures

Both the MP and MRPG models can be adapted for use in other domains or in different public and private CQA platforms with specific tag-space restrictions. This can help in efficient question routing to area-experts for faster response time, especially in private CQA platforms where the motivation of the community authority is to get queries resolved faster. Such adaptations can be done by customizing the MetaTag vocabulary based on prior behavioral analysis. Additionally, the number of meta and refined tags can be controlled based on the domain and platform requirements without changes in architecture (through a parameter). Also, the MRPG model can be used in platforms where a soft-hierarchy of tags is known, and routing requires the prediction of top-level tags and leaf tags. In such a scenario, the MetaTag vocabulary could be populated with only top-level tags, allowing the model to generate lower-level tags (from the tail of the tag distribution) based on user texts. With the combination of both types of tags, a query can be routed to a specific sub-area expert without overwhelming all the experts to a specific topic.

6. Related Work

Community QA platform analysis: There have been several studies on Folksonomy (Vander Wal, 2007), the practice of associating custom tags to questions in a social environment. Some of the prior works are: a large-scale analysis of tags and their correlation with other tags (Fu et al., 2020), tag-distribution and tag-occurrence of 168 SE communities (Fu et al., 2020), quality analysis of SO (Singh et al., 2015). User behavior analysis was done on Quora (Wang et al., 2013), Yahoo Answers (Adamic et al., 2008), Google Answers (Chen et al., 2010) and StackOverflow (Anderson et al., 2013). However, here we perform a large-scale study of tags, tag occurrences, and tag relation for 17 domains to understand how they have some common properties in spite of being quite diverse, an observation similar to a prior work (Fu et al., 2020).

Community QA NLP Tasks: As the use of community QA platforms increased and with it the volume of community-created data, various NLP approaches were used to address some of the issues of each platform and also to understand behaviors of users. There have been various insights gathered through analysis of such communities. Similar Question Identification (Zhang et al., 2017, 2018; Vanam and Pulipati, 2021; Kumar and Chauhan, 2022), Similar Tag Identification (Beyer and Pinzger, 2015; Chen et al., 2019), Tag popularity prediction (Fu et al., 2017), Popular Question Prediction (Zhao et al., 2021), Tag predictions (Lipczak, 2008; Lipczak and Milios, 2010; Wang et al., 2015; Wu et al., 2016; Sonam et al., 2019; Tang et al., 2019; Wankerl et al., 2020; Venktesh et al., 2021), detecting anomalous tag combinations (Banerjee et al., 2019), CQA entity linking (Li et al., 2022), expert recommendation (Tondulkar et al., 2018; Lv et al., 2021; Menaha et al., 2021; Anandhan et al., 2022; Krishna and Antulov-Fantulin, 2022; Askari et al., 2022; Liu et al., 2022), question routing (Krishna et al., 2022), identifying unclear questions (Trienes and Balog, 2019), automatic identification of best answers (Burel et al., 2012) and tag-hierarchy predictions (Chen et al., 2019) are some of the interesting tasks. We, perform a large-scale analysis with data over 10 years and across 17 diverse communities. We focus only on the tag-prediction NLP task for CQA platform.

Text Tagging: There are some feature-based machine learning approaches (Wang et al., 2015; Charte et al., 2015; Sonam et al., 2019; Zangerle et al., 2011; Sigurbjörnsson and Van Zwol, 2008; Zangerle et al., 2011; Lipczak and Milios, 2010; Wu et al., 2016) and some deep learning approaches (Tang et al., 2019; Li et al., 2020; Wankerl et al., 2020) for tag prediction. Tagcombine (Wang et al., 2015) uses software object similarity while TagStack (Sonam et al., 2019) uses tf-idf features with Naive Bayes classifier on StackOverflow texts. QUINTA (Charte et al., 2015) works on 6 StackExchange domains using KNN, (Zangerle et al., 2011) on microblogging sites (Twitter) based on tweet-similarity, Tag2word (Wu et al., 2016) in math and StackOverflow domains using an LDA variant, (Lipczak, 2008; Lipczak and Milios, 2010) on BibSonomy and StackOverflow datasets based on tag co-occurrence and user preference. Among the deep learning methods, F2Tag (Wankerl et al., 2020) is on math domains based on visual and textual formula representation, ITAG (Tang et al., 2019) is on the math domain using RNN and TagDC (Li et al., 2020) is based on software object similarity using an LSTM. We here, predict a soft hierarchy of tags (predicting both meta and fine-grained tags) unlike the above-mentioned methods.

7. Conclusion

We perform an in-depth analysis of 17 domains in a popular CQA platform, StackExchange, focusing on various aspects of question tagging such as domain diversity analysis, tag-space analysis, tag co-occurrence analysis, tag order, and tag positional stability. We present multiple insights into user behavior in assigning tags to the questions they post. Based on these findings we develop a tag prediction architecture that generates rarer and finer-grained tags in addition to popular tags from a pre-selected vocabulary. Our approach significantly out-perform feature-based baselines and also shows significant improvement in 12 domains when compared with vocabulary-based approach.

8. Limitations

The analysis and its findings presented here are limited to 17 selected StackExchange domains considering their diversity. However, they may vary for the remaining 150 domains. Some of the findings (e.g. tag’s positional stability) may vary for other CQA platforms which do not have any bounds on the number of tags. We use roberta-base and a smaller input size (256 tokens) for our experiments. With larger models and more context, the performance is expected to increase since more context usually leads to better learning by larger parameterized models. We have ignored the answers in StackExchange for model training. We believe that indiscriminately selecting all answers as context for a question could be too noisy and if we were to select one or more appropriate answers, this would add complexity in choosing between the fastest answer, best answer, accepted answers, etc. We consider this as a separate area of research and future work. We randomly sampled the data for each domain to create the train and test split to show that our MRPG model is capable of both predicting and generating tags. Splitting with respect to timestamp would require tag temporal analysis and tag-evolution which we consider as a future area of research.

9. Ethical Statement

This work analyzes various aspects of aggregate tagging behavior of users on a popular community question-answering platform StackExchange. The data is publicly provided by StackExchange as an anonymized dump of all user-contributed content on the Stack Exchange network. The data is cc-by-sa 4.0 licensed, and intended to be shared and remixed. No specific user has been identified and no user-level information (user name etc.) has been used for this work. We only used the Post.xml extracted from the StackExchange dumps and do not use any user profile statistics. The aggregate user behavior has been analyzed with respect to tagging and user-generated questions. Based on these findings a tag predictor model has been developed. The data has not been modified or redistributed as part of this research.

References

  • (1)
  • Adamic et al. (2008) Lada A Adamic, Jun Zhang, Eytan Bakshy, and Mark S Ackerman. 2008. Knowledge sharing and yahoo answers: everyone knows something. In Proceedings of the 17th international conference on World Wide Web. 665–674.
  • Anandhan et al. (2022) Anitha Anandhan, Maizatul Akmar Ismail, and Liyana Shuib. 2022. EXPERT RECOMMENDATION THROUGH TAG RELATIONSHIP IN COMMUNITY QUESTION ANSWERING. Malaysian Journal of Computer Science 35, 3 (2022), 201–221.
  • Anderson et al. (2013) Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. In Proceedings of the 22nd international conference on World Wide Web. 95–106.
  • Askari et al. (2022) Arian Askari, Suzan Verberne, and Gabriella Pasi. 2022. Expert Finding in Legal Community Question Answering. In European Conference on Information Retrieval. Springer, 22–30.
  • Banerjee et al. (2019) Rohan Banerjee, Sailaja Rajanala, and Manish Singh. 2019. Evaluating the Choice of Tags in CQA Sites. In International Conference on Database Systems for Advanced Applications. Springer, 625–640.
  • Beyer and Pinzger (2015) Stefanie Beyer and Martin Pinzger. 2015. Synonym suggestion for tags on stack overflow. In 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, 94–103.
  • Burel et al. (2012) Grégoire Burel, Yulan He, and Harith Alani. 2012. Automatic identification of best answers in online enquiry communities. In Extended Semantic Web Conference. Springer, 514–529.
  • Charte et al. (2015) Francisco Charte, Antonio J Rivera, María J del Jesus, and Francisco Herrera. 2015. QUINTA: A question tagging assistant to improve the answering ratio in electronic forums. In Ieee eurocon 2015-international conference on computer as a tool (eurocon). IEEE, 1–6.
  • Chen et al. (2019) Hui Chen, John Coogle, and Kostadin Damevski. 2019. Modeling stack overflow tags and topics as a hierarchy of concepts. Journal of Systems and Software 156 (2019), 283–299.
  • Chen et al. (2010) Yan Chen, Teck-Hua Ho, and Yong-mi Kim. 2010. Knowledge market design: A field experiment at Google Answers. Journal of Public Economic Theory 12, 4 (2010), 641–664.
  • Fu et al. (2017) Chenbo Fu, Yongli Zheng, Shidi Li, Qi Xuan, and Zhongyuan Ruan. 2017. Predicting the popularity of tags in StackExchange QA communities. In 2017 International Workshop on Complex Systems and Networks (IWCSN). IEEE, 90–95.
  • Fu et al. (2020) Xiang Fu, Shangdi Yu, and Austin R Benson. 2020. Modelling and analysis of tagging networks in Stack Exchange communities. Journal of Complex Networks 8, 5 (2020), cnz045.
  • Hollander et al. (2013) Myles Hollander, Douglas A Wolfe, and Eric Chicken. 2013. Nonparametric statistical methods. John Wiley & Sons.
  • Krishna and Antulov-Fantulin (2022) Vaibhav Krishna and Nino Antulov-Fantulin. 2022. Simplifying Sparse Expert Recommendation by Revisiting Graph Diffusion. arXiv preprint arXiv:2208.02438 (2022).
  • Krishna et al. (2022) Vaibhav Krishna, Vaiva Vasiliauskaite, and Nino Antulov-Fantulin. 2022. Topic Community Based Temporal Expertise for Question Routing. arXiv preprint arXiv:2207.01753 (2022).
  • Kumar and Chauhan (2022) Shobhan Kumar and Arun Chauhan. 2022. A Transformer Based Encodings for Detection of Semantically Equivalent Questions in cQA. Comput. J. (2022).
  • Li et al. (2020) Can Li, Ling Xu, Meng Yan, and Yan Lei. 2020. TagDC: A tag recommendation method for software information sites with a combination of deep learning and collaborative filtering. Journal of Systems and Software 170 (2020), 110783.
  • Li et al. (2022) Yuhan Li, Wei Shen, Jianbo Gao, and Yadong Wang. 2022. Community Question Answering Entity Linking via Leveraging Auxiliary Data. arXiv preprint arXiv:2205.11917 (2022).
  • Lipczak (2008) Marek Lipczak. 2008. Tag recommendation for folksonomies oriented towards individual users. ECML PKDD discovery challenge 84 (2008), 2008.
  • Lipczak and Milios (2010) Marek Lipczak and Evangelos Milios. 2010. Learning in efficient tag recommendation. In Proceedings of the fourth ACM conference on Recommender systems. 167–174.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2022) Yue Liu, Weize Tang, Zitu Liu, Lin Ding, and Aihua Tang. 2022. High-quality domain expert finding method in CQA based on multi-granularity semantic analysis and interest drift. Information Sciences 596 (2022), 395–413.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  • Lv et al. (2021) Xiaoqi Lv, Ke Ji, Zhenxiang Chen, Kun Ma, Jun Wu, Yidong Li, and Guandong Xu. 2021. Expert Recommendations with Temporal Dynamics of User Interest in CQA. In International Conference on Web Information Systems Engineering. Springer, 645–652.
  • Menaha et al. (2021) R Menaha, VE Jayanthi, N Krishnaraj, et al. 2021. A Cluster-based Approach for Finding Domain wise Experts in Community Question Answering System. In Journal of Physics: Conference Series, Vol. 1767. IOP Publishing, 012035.
  • Parnell et al. (2011) Laurence D Parnell, Pierre Lindenbaum, Khader Shameer, Giovanni Marco Dall’Olio, Daniel C Swan, Lars Juhl Jensen, Simon J Cockell, Brent S Pedersen, Mary E Mangan, Christopher A Miller, et al. 2011. BioStar: an online question & answer resource for the bioinformatics community. PLoS computational biology 7, 10 (2011), e1002216.
  • Sigurbjörnsson and Van Zwol (2008) Börkur Sigurbjörnsson and Roelof Van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th international conference on World Wide Web. 327–336.
  • Singh et al. (2015) Sanjay Singh et al. 2015. Is Stack Overflow Overflowing With Questions and Tags. arXiv preprint arXiv:1508.03601 (2015).
  • Sonam et al. (2019) Sonam Sonam, Ayushi Verma, Sangeeta Lal, and Neetu Sardana. 2019. TagStack: Automated system for predicting tags in stackoverflow. In 2019 International Conference on Signal Processing and Communication (ICSC). IEEE, 223–228.
  • Tang et al. (2019) Shijie Tang, Yuan Yao, Suwei Zhang, Feng Xu, Tianxiao Gu, Hanghang Tong, Xiaohui Yan, and Jian Lu. 2019. An integral tag recommendation model for textual content. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5109–5116.
  • Tondulkar et al. (2018) Rohan Tondulkar, Manisha Dubey, and Maunendra Sankar Desarkar. 2018. Get me the best: predicting best answerers in community question answering sites. In Proceedings of the 12th ACM Conference on Recommender Systems. 251–259.
  • Trienes and Balog (2019) Jan Trienes and Krisztian Balog. 2019. Identifying unclear questions in community question answering websites. In European conference on information retrieval. Springer, 276–289.
  • Vanam and Pulipati (2021) Divya Vanam and Venkateswara Rao Pulipati. 2021. Identifying Duplicate Questions in Community Question Answering Forums Using Machine Learning Approaches. In Machine Learning Technologies and Applications. Springer, 131–140.
  • Vander Wal (2007) Thomas Vander Wal. 2007. Folksonomy.
  • Venktesh et al. (2021) V Venktesh, Mukesh Mohania, and Vikram Goyal. 2021. TagRec: Automated Tagging of Questions with Hierarchical Learning Taxonomy. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 381–396.
  • Wang et al. (2013) Gang Wang, Konark Gill, Manish Mohanlal, Haitao Zheng, and Ben Y Zhao. 2013. Wisdom in the social crowd: an analysis of quora. In Proceedings of the 22nd international conference on World Wide Web. 1341–1352.
  • Wang et al. (2015) Xin-Yu Wang, Xin Xia, and David Lo. 2015. Tagcombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology 30, 5 (2015), 1017–1035.
  • Wankerl et al. (2020) Sebastian Wankerl, Gerhard Götz, and Andreas Hotho. 2020. f2tag—Can Tags be Predicted Using Formulas?. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 565–571.
  • Wu et al. (2016) Yong Wu, Yuan Yao, Feng Xu, Hanghang Tong, and Jian Lu. 2016. Tag2word: Using tags to generate words for content based tag recommendation. In Proceedings of the 25th ACM international on conference on information and knowledge management. 2287–2292.
  • Zangerle et al. (2011) Eva Zangerle, Wolfgang Gassler, and Günther Specht. 2011. Using tag recommendations to homogenize folksonomies in microblogging environments. In International conference on social informatics. Springer, 113–126.
  • Zhang et al. (2017) Wei Emma Zhang, Quan Z Sheng, Jey Han Lau, and Ermyas Abebe. 2017. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the 26th International Conference on World Wide Web. 1221–1229.
  • Zhang et al. (2018) Wei Emma Zhang, Quan Z Sheng, Jey Han Lau, Ermyas Abebe, and Wenjie Ruan. 2018. Duplicate detection in programming question answering communities. ACM Transactions on Internet Technology (TOIT) 18, 3 (2018), 1–21.
  • Zhao et al. (2021) Li Xian Zhao, Li Zhang, and Jing Jiang. 2021. Hot question prediction in Stack Overflow. IET Software 15, 1 (2021), 90–106.

Appendix A Domain Statistics

Table 10 shows more details about domain diversity apart from those mentioned in the main section 3.1. We can see cooking and rpg are the domains with the least number of questions with no answers (<<5%) which indicates the experts in these domains are very active. The science domains have more than 15% questions with no answers which shows that special knowledge is required to answer such questions. maxviewmaxview and maxansmaxans show the maximum limit of users who viewed the questions and the maximum number of answers that a question has. no accept ans shows the percentage of posts that have not been accepted by the askers as answers. This gives an indication of whether askers are active and also whether the answers are satisfactory.

Table 9. Tag Statistics: AvgTLen - Average Tag Length
Domains Longest Tag (number of characters) Shortest Tag AvgTLen
Tag Size Tag Size
askubuntu windows-subsystem-for-linux 27 c 1 8.17
aviation performance-based-navigation 28 cg 2 10.53
biology neurodegenerative-disorders 27 ph 2 10.97
chemistry differential-scanning-calorimetry 33 ph 2 12.75
cooking please-remove-this-tag 22 ue 2 8.56
electronics semiconductor-process-technology 32 c 1 8.80
history articles-of-confederation 25 art 3 9.83
money health-reimbursement-arrangement 32 w9 2 10.87
movies valerian-city-of-a-thousand-planets 35 m 1 13.66
music solid-body-electric-guitars 27 dj 2 9.62
philosophy philosophy-of-political-science 31 art 3 11.17
physics heisenberg-uncertainty-principle 32 air 3 13.39
politics immigration-customs-enforcement 31 alp 3 11.12
rpg werewolf-the-apocalypse-2nd-edition 35 e6 2 12.29
scifi the-hitchhikers-guide-to-the-galaxy 35 dc 2 13.31
serverfault google-cloud-internal-load-balancer 35 3g 2 8.87
travel new-zealand-permanent-resident 30 eu 2 9.39

Appendix B Tag Length Analysis

Table 9 shows the maximum and minimum length tags in each domain. We also see that the average tag length of the movies and physics domain are the highest. We find that often the movie names or physics topics are longer than three words leading to an increase in average tag length.

Table 10. Domain Statistics
Domain Q T Q/T AVGT NOANS (%) NOSCORES (%) NO ACCEPT ANS (%) MAXANS MAXVIEW VIEWGT100 #ASKERS
askubuntu 371800 3121 119.13 2.78 23.47 37.21 66.99 82 5409384 1093 201912
aviation 20345 1002 20.3 2.56 7.02 9.82 46.66 18 219002 12 7066
biology 25671 739 34.74 2.58 20.73 15.16 56.13 11 445257 11 12089
chemistry 37476 375 99.94 2.37 19.36 16.86 59.12 11 1077991 7 17202
cooking 24513 833 29.43 2.3 4.6 11.79 50.57 85 1619295 13 12413
electronics 152980 2226 68.72 2.77 9.13 40.87 50.94 38 591616 36 61869
history 12562 813 15.45 2.84 9.85 4.6 49.94 34 994376 19 5296
money 32648 995 32.81 3.11 7.91 17.73 54.56 25 821144 37 18010
movies 20749 4348 4.77 2.09 9.48 2.96 39.08 19 1183407 30 6931
music 20925 512 40.87 2.52 3.24 10.96 49.62 25 611990 5 10447
philosophy 15624 559 27.95 2.4 11 15.27 63.73 31 250018 6 6640
physics 180166 893 201.75 3.17 17.49 29.54 57.08 49 847876 131 59774
politics 12416 739 16.8 2.9 6.81 5.55 48.72 27 833812 27 3970
rpg 42693 1195 35.73 2.91 4.41 2.26 32.39 44 865197 56 11541
scifi 62987 3433 18.35 2.25 10.62 2.45 42.87 34 1430390 153 22717
serverfault 299895 3814 78.63 2.9 11.68 37.05 51.93 160 2478923 327 130214
travel 42201 1891 22.32 3.28 11.2 8.19 59.48 30 430504 42 24895
Refer to caption
Figure 10. Tag Co-Occurrence Distributions (Rest),Y-Axis: # posts the tag pairs appear in, X-Axis: Top-50 Tags

Appendix C Tag Co-Occurrence Distribution Analysis

Refer to caption
Figure 11. Tag Co-Occurrence Distribution Patterns,Y-Axis: # posts the tag pairs appear in, X-Axis: Top-50 Tags

We analyzed the distribution of the top-50 frequently occurring tag pairs in each domain (Figure 10, 11). We observe three main patterns: (1) Smooth Distribution (2) Spike in Top-1 (3) Spikes in top few pairs. Larger domains like askubuntu, serverfault, electronics, and physics, have smooth distributions. Some of the smaller domains like politics, philosophy, and music also show this behavior, which we believe is because, in these domains, the questions have fine-grained topics. In domains like rpg, money, history, aviation, biology, chemistry, the tags of the most frequent tag pair that appears in abundance are generic in nature. Finally, in domains like movies, scifi, cooking and travel, few tag pairs dominate the distributions, indicating their popularity in such smaller domains.

Appendix D Tag Co-Occurrence Examples

Table 11 shows the most frequent tag pairs that appear in each domain.

Table 11. Top-5 Most Frequent Tag Pairs
Domain Top-5 Most Frequent Tag Pairs
askubuntu (boot, grub2), (boot, dual-boot), (dual-boot, grub2), (bash, command-line), (apt, package-management)
aviation (aerodynamics, aircraft-design), (aircraft-design, wing), (aerodynamics, wing), (aircraft-design, aircraft-performance),
(air-traffic-control, faa-regulations)
biology (entomology, species-identification), (species-identification, zoology), (botany, species-identification),
(neurophysiology, neuroscience), (biochemistry, molecular-biology)
chemistry (organic-chemistry, reaction-mechanism), (physical-chemistry, thermodynamics), (aromatic-compounds, organic-chemistry),
(nomenclature, organic-chemistry), (carbonyl-compounds, organic-chemistry)
cooking (baking, bread), (baking, cake), (baking, cookies), (baking, substitutions), (bread, dough)
electronics (current, voltage), (pcb, pcb-design), (power, power-supply), (batteries, battery-charging), (microcontroller, pic)
history (nazi-germany, world-war-two), (united-states, world-war-two), (europe, middle-ages),
(japan, world-war-two), (military, world-war-two)
money (taxes, united-states), (income-tax, united-states), (401k, united-states), (income-tax, taxes), (tax-deduction, united-states)
movies (character, plot-explanation), (marvel-cinematic-universe, plot-explanation),
(game-of-thrones, plot-explanation), (analysis, plot-explanation), (avengers-infinity-war, marvel-cinematic-universe)
music (chords, theory), (chord-theory, chords), (harmony, theory), (scales, theory), (chord-theory, theory)
philosophy (logic, philosophy-of-mathematics), (epistemology, philosophy-of-science), (fallacies, logic),
(logic, symbolic-logic), (metaphysics, ontology)
physics (homework-and-exercises, newtonian-mechanics), (forces, newtonian-mechanics), (hilbert-space, quantum-mechanics),
(operators, quantum-mechanics), (quantum-mechanics, wavefunction)
politics (donald-trump, united-states), (president, united-states), (presidential-election, united-states),
(congress, united-states), (election, united-states)
rpg (dnd-5e, spells), (dnd-5e, magic-items), (class-feature, dnd-5e), (dnd-5e, monsters), (pathfinder-1e, spells)
scifi (short-stories, story-identification), (marvel, marvel-cinematic-universe), (books, story-identification),
(the-lord-of-the-rings, tolkiens-legendarium), (novel, story-identification)
serverfault (linux, ubuntu), (centos, linux), (amazon-ec2, amazon-web-services), (linux, networking), (apache-2.2, php)
travel (uk, visas), (schengen, visas), (usa, visas), (customs-and-immigration, usa), (indian-citizens, visas)

Appendix E Tag Distributions

Figure 12 shows the distribution of top-100 most frequent tags in each domain.

Refer to caption
Figure 12. Tag Distributions, Y-Axis: # posts the tag appear in, X-Axis: Top-100 Tags

Appendix F Tag Ordering Example:

Tables 12, 13, and 14 show top-10 most frequently occurring tag pairs in each domain. On analyzing manually, we found that in most of the cases meta-tag appears before the refined tags.

Table 12. Tag Ordering Statistics - First 6 domains
Domain Total Order-1 % Order-2 %
askubuntu 5845 (boot,grub2) 99.93 (grub2,boot) 0.07
askubuntu 5174 (boot,dual-boot) 99.96 (dual-boot,boot) 0.04
askubuntu 5104 (dual-boot,grub2) 91.12 (grub2,dual-boot) 8.88
askubuntu 4552 (bash,command-line) 1.89 (command-line,bash) 98.11
askubuntu 4547 (apt,package-management) 98.53 (package-management,apt) 1.47
askubuntu 4304 (networking,wireless) 70.07 (wireless,networking) 29.93
askubuntu 4178 (dual-boot,partitioning) 97.75 (partitioning,dual-boot) 2.25
askubuntu 4128 (drivers,nvidia) 99.93 (nvidia,drivers) 0.07
askubuntu 3257 (networking,server) 97.97 (server,networking) 2.03
askubuntu 3003 (bash,scripts) 99.9 (scripts,bash) 0.1
aviation 417 (aerodynamics,aircraft-design) 1.68 (aircraft-design,aerodynamics) 98.32
aviation 221 (aircraft-design,wing) 100 (wing,aircraft-design) 0
aviation 221 (aerodynamics,wing) 100 (wing,aerodynamics) 0
aviation 183 (aircraft-design,aircraft-performance) 100 (aircraft-performance,aircraft-design) 0
aviation 138 (air-traffic-control,faa-regulations) 0 (faa-regulations,air-traffic-control) 100
aviation 136 (faa-regulations,instrument-flight-rules) 100 (instrument-flight-rules,faa-regulations) 0
aviation 127 (aerodynamics,lift) 100 (lift,aerodynamics) 0
aviation 125 (aerodynamics,airfoil) 100 (airfoil,aerodynamics) 0
aviation 124 (air-traffic-control,radio-communications) 100 (radio-communications,air-traffic-control) 0
aviation 124 (aerodynamics,aircraft-performance) 100 (aircraft-performance,aerodynamics) 0
biology 731 (entomology,species-identification) 10.81 (species-identification,entomology) 89.19
biology 361 (species-identification,zoology) 76.18 (zoology,species-identification) 23.82
biology 350 (botany,species-identification) 44.29 (species-identification,botany) 55.71
biology 322 (neurophysiology,neuroscience) 0 (neuroscience,neurophysiology) 100
biology 321 (biochemistry,molecular-biology) 99.69 (molecular-biology,biochemistry) 0.31
biology 274 (dna,genetics) 4.38 (genetics,dna) 95.62
biology 272 (evolution,genetics) 37.5 (genetics,evolution) 62.5
biology 256 (botany,plant-physiology) 98.05 (plant-physiology,botany) 1.95
biology 251 (entomology,zoology) 0.8 (zoology,entomology) 99.2
biology 247 (cell-biology,molecular-biology) 1.21 (molecular-biology,cell-biology) 98.79
chemistry 1621 (organic-chemistry,reaction-mechanism) 100 (reaction-mechanism,organic-chemistry) 0
chemistry 703 (physical-chemistry,thermodynamics) 99.43 (thermodynamics,physical-chemistry) 0.57
chemistry 648 (aromatic-compounds,organic-chemistry) 0 (organic-chemistry,aromatic-compounds) 100
chemistry 585 (nomenclature,organic-chemistry) 0 (organic-chemistry,nomenclature) 100
chemistry 529 (carbonyl-compounds,organic-chemistry) 0 (organic-chemistry,carbonyl-compounds) 100
chemistry 526 (acid-base,organic-chemistry) 0 (organic-chemistry,acid-base) 100
chemistry 457 (organic-chemistry,synthesis) 100 (synthesis,organic-chemistry) 0
chemistry 429 (organic-chemistry,stereochemistry) 100 (stereochemistry,organic-chemistry) 0
chemistry 420 (acid-base,ph) 100 (ph,acid-base) 0
chemistry 348 (acid-base,inorganic-chemistry) 0 (inorganic-chemistry,acid-base) 100
cooking 393 (baking,bread) 99.75 (bread,baking) 0.25
cooking 290 (baking,cake) 100 (cake,baking) 0
cooking 180 (baking,cookies) 100 (cookies,baking) 0
cooking 179 (baking,substitutions) 91.06 (substitutions,baking) 8.94
cooking 137 (bread,dough) 91.24 (dough,bread) 8.76
cooking 131 (bread,sourdough) 100 (sourdough,bread) 0
cooking 124 (baking,dough) 100 (dough,baking) 0
cooking 122 (bread,yeast) 96.72 (yeast,bread) 3.28
cooking 116 (baking,oven) 100 (oven,baking) 0
cooking 111 (dough,pizza) 91.89 (pizza,dough) 8.11
electronics 1161 (current,voltage) 0.6 (voltage,current) 99.4
electronics 1138 (pcb,pcb-design) 100 (pcb-design,pcb) 0
electronics 1043 (power,power-supply) 0.48 (power-supply,power) 99.52
electronics 844 (batteries,battery-charging) 100 (battery-charging,batteries) 0
electronics 775 (microcontroller,pic) 98.58 (pic,microcontroller) 1.42
electronics 620 (amplifier,operational-amplifier) 3.87 (operational-amplifier,amplifier) 96.13
electronics 619 (power-supply,switch-mode-power-supply) 100 (switch-mode-power-supply,power-supply) 0
electronics 612 (bjt,transistors) 0.49 (transistors,bjt) 99.51
electronics 598 (mosfet,transistors) 0.17 (transistors,mosfet) 99.83
electronics 587 (arduino,microcontroller) 86.03 (microcontroller,arduino) 13.97
Table 13. Tag Ordering Statistics - Second 6 domains
Domain Total Order-1 % Order-2 %
history 298 (nazi-germany,world-war-two) 0.67 (world-war-two,nazi-germany) 99.33
history 179 (united-states,world-war-two) 94.41 (world-war-two,united-states) 5.59
history 153 (europe,middle-ages) 0 (middle-ages,europe) 100
history 141 (japan,world-war-two) 0 (world-war-two,japan) 100
history 138 (military,world-war-two) 0.72 (world-war-two,military) 99.28
history 136 (19th-century,united-states) 0 (united-states,19th-century) 100
history 134 (soviet-union,world-war-two) 0.75 (world-war-two,soviet-union) 99.25
history 117 (20th-century,united-states) 8.55 (united-states,20th-century) 91.45
history 106 (ancient-rome,roman-empire) 84.91 (roman-empire,ancient-rome) 15.09
history 105 (ancient-history,ancient-rome) 99.05 (ancient-rome,ancient-history) 0.95
money 3393 (taxes,united-states) 0.03 (united-states,taxes) 99.97
money 2087 (income-tax,united-states) 0.05 (united-states,income-tax) 99.95
money 883 (401k,united-states) 0 (united-states,401k) 100
money 839 (income-tax,taxes) 3.81 (taxes,income-tax) 96.19
money 662 (tax-deduction,united-states) 0.15 (united-states,tax-deduction) 99.85
money 638 (investing,stocks) 16.3 (stocks,investing) 83.7
money 613 (ira,united-states) 0 (united-states,ira) 100
money 604 (investing,united-states) 0 (united-states,investing) 100
money 554 (mortgage,united-states) 0 (united-states,mortgage) 100
money 541 (roth-ira,united-states) 0 (united-states,roth-ira) 100
movies 518 (character,plot-explanation) 2.9 (plot-explanation,character) 97.1
movies 509 (marvel-cinematic-universe,plot-explanation) 0.2 (plot-explanation,marvel-cinematic-universe) 99.8
movies 367 (game-of-thrones,plot-explanation) 0.82 (plot-explanation,game-of-thrones) 99.18
movies 242 (analysis,plot-explanation) 7.85 (plot-explanation,analysis) 92.15
movies 233 (avengers-infinity-war,marvel-cinematic-universe) 0 (marvel-cinematic-universe,avengers-infinity-war) 100
movies 205 (character,marvel-cinematic-universe) 100 (marvel-cinematic-universe,character) 0
movies 199 (avengers-endgame,marvel-cinematic-universe) 0 (marvel-cinematic-universe,avengers-endgame) 100
movies 184 (analysis,character) 29.89 (character,analysis) 70.11
movies 179 (dialogue,plot-explanation) 5.59 (plot-explanation,dialogue) 94.41
movies 143 (ending,plot-explanation) 2.8 (plot-explanation,ending) 97.2
music 519 (chords,theory) 0 (theory,chords) 100
music 490 (chord-theory,chords) 0 (chords,chord-theory) 100
music 435 (harmony,theory) 0 (theory,harmony) 100
music 410 (scales,theory) 0 (theory,scales) 100
music 404 (chord-theory,theory) 0 (theory,chord-theory) 100
music 363 (electric-guitar,guitar) 0 (guitar,electric-guitar) 100
music 337 (notation,sheet-music) 99.41 (sheet-music,notation) 0.59
music 329 (chords,guitar) 0 (guitar,chords) 100
music 328 (chord-progressions,theory) 0 (theory,chord-progressions) 100
music 306 (notation,piano) 0 (piano,notation) 100
philosophy 272 (logic,philosophy-of-mathematics) 100 (philosophy-of-mathematics,logic) 0
philosophy 266 (epistemology,philosophy-of-science) 94.36 (philosophy-of-science,epistemology) 5.64
philosophy 246 (fallacies,logic) 0.41 (logic,fallacies) 99.59
philosophy 193 (logic,symbolic-logic) 100 (symbolic-logic,logic) 0
philosophy 186 (metaphysics,ontology) 100 (ontology,metaphysics) 0
philosophy 186 (logic,philosophy-of-logic) 100 (philosophy-of-logic,logic) 0
philosophy 183 (argumentation,logic) 0.55 (logic,argumentation) 99.45
philosophy 179 (epistemology,metaphysics) 100 (metaphysics,epistemology) 0
philosophy 179 (epistemology,logic) 1.68 (logic,epistemology) 98.32
philosophy 178 (logic,proof) 100 (proof,logic) 0
physics 4182 (homework-and-exercises,newtonian-mechanics) 99.74 (newtonian-mechanics,homework-and-exercises) 0.26
physics 3658 (forces,newtonian-mechanics) 0.52 (newtonian-mechanics,forces) 99.48
physics 2565 (hilbert-space,quantum-mechanics) 0 (quantum-mechanics,hilbert-space) 100
physics 2360 (operators,quantum-mechanics) 0 (quantum-mechanics,operators) 100
physics 2337 (quantum-mechanics,wavefunction) 100 (wavefunction,quantum-mechanics) 0
physics 2238 (electromagnetism,magnetic-fields) 99.82 (magnetic-fields,electromagnetism) 0.18
physics 2196 (homework-and-exercises,quantum-mechanics) 0 (quantum-mechanics,homework-and-exercises) 100
physics 1988 (newtonian-gravity,newtonian-mechanics) 0 (newtonian-mechanics,newtonian-gravity) 100
physics 1767 (quantum-mechanics,schroedinger-equation) 100 (schroedinger-equation,quantum-mechanics) 0
physics 1704 (black-holes,general-relativity) 0 (general-relativity,black-holes) 100
Table 14. Tag Ordering Statistics - Last 5 domains
Domain Total Order-1 % Order-2 %
politics 570 (donald-trump,united-states) 0 (united-states,donald-trump) 100
politics 557 (president,united-states) 0 (united-states,president) 100
politics 523 (presidential-election,united-states) 0 (united-states,presidential-election) 100
politics 478 (congress,united-states) 0 (united-states,congress) 100
politics 475 (election,united-states) 0.63 (united-states,election) 99.37
politics 467 (brexit,united-kingdom) 0 (united-kingdom,brexit) 100
politics 328 (constitution,united-states) 0 (united-states,constitution) 100
politics 282 (law,united-states) 0.35 (united-states,law) 99.65
politics 279 (senate,united-states) 0.36 (united-states,senate) 99.64
politics 254 (united-states,voting) 100 (voting,united-states) 0
rpg 5330 (dnd-5e,spells) 99.21 (spells,dnd-5e) 0.79
rpg 1367 (dnd-5e,magic-items) 100 (magic-items,dnd-5e) 0
rpg 1212 (class-feature,dnd-5e) 0 (dnd-5e,class-feature) 100
rpg 1204 (dnd-5e,monsters) 99.83 (monsters,dnd-5e) 0.17
rpg 1188 (pathfinder-1e,spells) 90.24 (spells,pathfinder-1e) 9.76
rpg 959 (dnd-3.5e,spells) 72.78 (spells,dnd-3.5e) 27.22
rpg 676 (dnd-5e,feats) 99.85 (feats,dnd-5e) 0.15
rpg 632 (dnd-5e,warlock) 100 (warlock,dnd-5e) 0
rpg 607 (balance,dnd-5e) 0.16 (dnd-5e,balance) 99.84
rpg 567 (combat,dnd-5e) 0.53 (dnd-5e,combat) 99.47
scifi 3514 (short-stories,story-identification) 1.05 (story-identification,short-stories) 98.95
scifi 2109 (marvel,marvel-cinematic-universe) 76.67 (marvel-cinematic-universe,marvel) 23.33
scifi 2029 (books,story-identification) 0.74 (story-identification,books) 99.26
scifi 1922 (the-lord-of-the-rings,tolkiens-legendarium) 52.76 (tolkiens-legendarium,the-lord-of-the-rings) 47.24
scifi 1859 (novel,story-identification) 1.02 (story-identification,novel) 98.98
scifi 1638 (movie,story-identification) 1.47 (story-identification,movie) 98.53
scifi 1497 (star-trek,star-trek-tng) 99.67 (star-trek-tng,star-trek) 0.33
scifi 1077 (aliens,story-identification) 2.04 (story-identification,aliens) 97.96
scifi 866 (a-song-of-ice-and-fire,game-of-thrones) 6.24 (game-of-thrones,a-song-of-ice-and-fire) 93.76
scifi 723 (star-wars,star-wars-legends) 100 (star-wars-legends,star-wars) 0
serverfault 3261 (linux,ubuntu) 98.13 (ubuntu,linux) 1.87
serverfault 2865 (centos,linux) 1.33 (linux,centos) 98.67
serverfault 2498 (amazon-ec2,amazon-web-services) 76.7 (amazon-web-services,amazon-ec2) 23.3
serverfault 2452 (linux,networking) 99.14 (networking,linux) 0.86
serverfault 1912 (apache-2.2,php) 86.72 (php,apache-2.2) 13.28
serverfault 1803 (debian,linux) 1.5 (linux,debian) 98.5
serverfault 1716 (linux,ssh) 98.19 (ssh,linux) 1.81
serverfault 1643 (apache-2.2,linux) 2.01 (linux,apache-2.2) 97.99
serverfault 1560 (iptables,linux) 1.15 (linux,iptables) 98.85
serverfault 1466 (apache-2.2,virtualhost) 96.18 (virtualhost,apache-2.2) 3.82
travel 2181 (uk,visas) 0.05 (visas,uk) 99.95
travel 1779 (schengen,visas) 0.06 (visas,schengen) 99.94
travel 1340 (usa,visas) 2.24 (visas,usa) 97.76
travel 871 (customs-and-immigration,usa) 0 (usa,customs-and-immigration) 100
travel 795 (indian-citizens,visas) 0 (visas,indian-citizens) 100
travel 727 (transit,visas) 0 (visas,transit) 100
travel 726 (customs-and-immigration,visas) 0 (visas,customs-and-immigration) 100
travel 643 (standard-visitor-visas,uk) 0 (uk,standard-visitor-visas) 100
travel 566 (uk,visa-refusals) 100 (visa-refusals,uk) 0
travel 511 (visa-refusals,visas) 0 (visas,visa-refusals) 100
Table 15. Tag-Post Overlap:% posts where at least one tag appears in user texts. EMS%: single word tag exact match, EMM%: single & multi-word tag exact match
Domain Title Title+Body Title+Body+Answer
EMS EMM EMS EMM EMS EMM
askubuntu 56.94 71.64 77.29 88.67 81.46 91.15
aviation 29.09 49.63 47.53 66.11 58.98 75.76
biology 17.95 29.68 33.70 47.17 42.92 56.97
chemistry 19.72 29.26 32.81 46.44 40.99 56.20
cooking 53.51 71.04 72.75 82.92 80.87 88.40
electronics 55.24 71.11 75.21 86.69 80.6 89.99
history 22.23 44.34 43.76 67.23 59.66 79.51
money 34.47 57.89 56.32 78.66 66.62 86.23
movies 9.49 34.51 17.69 72.92 24.06 77.32
music 37.42 53.76 60.81 75.32 75.02 85.78
philosophy 23.85 43.24 44.85 63.77 59.81 75.54
physics 24.34 40.95 40.25 61.48 48.24 70.20
politics 28.37 54.11 52.64 77.52 67.41 88.21
rpg 23.69 41.14 44.72 65.13 57.52 76.20
scifi 17.76 38.31 28.38 60.49 34.63 70.64
serverfault 58.67 74.50 76.70 89.34 80.02 91.51
travel 45.93 63.66 65.88 79.76 75.11 86.38

Appendix G Tag-Post Overlap: Full Table

Table 15 shows the tag-post overlap in tabular form similar to Figure 3 in Section 3.

Appendix H Decoding Phase of the MRPG Model

We allow the model to generate the tags based on the input parameter maximum output length and then use few heuristics to filter out appropriate tag-tokens and choose the top-k tags. Our heuristics are based on prior knowledge about how a tag token should be like (1) a tag cannot start or end with a ’-’ (2) skip the punctuation tokens (3) ignoring adjacent repeated tags. We then combine the tag tokens between two \langletagsep\rangletokens to form the final tag. We also calculate the top-k (k=15k=1\dots 5) most probable tags based on the combined probability scores of the tag-tokens.

Appendix I Feature-based Model Configurations:

For building both the tf-idf and bag of words features we consider unigram and bigram features with a minimum document frequency of 0.00009. We generate 200,000 maximum features. We consider log loss and search hyper-parameter space using alpha = [0.0001,0.001,0.00001] and penalty=[l1l_{1}, l2l_{2}] for the Stochastic Gradient Descent One versus rest classifier. For both the models, we find that l2l_{2} penalty with 0.00001 alpha yields the best performance.

Appendix J P-values for Hit@5

Table 16 shows the p-values when MRPG model’s Hit@5 is compared with MP model. The significance test has been done by one-sided Wilcoxon Test(Hollander et al., 2013). For k=1,2,3,4 MRPG model’s Hit@k shows significant improvements over MP model. MRPG model outperforms all other baselines significantly in Hit@k metrics for each value of k.

Table 16. MRPG vs MP: P-values for Hit@5 calculated by one-sided wilcoxon test (Hollander et al., 2013). The improvement of MRPG is considered significant if it is less than 0.05
Domains P-Values Is Significant
askubuntu 0.03125 Yes
aviation 0.15625 No
biology 1.00000 No
chemistry 0.03125 Yes
cooking 0.03125 Yes
electronics 0.03125 Yes
history 0.09375 No
money 0.03125 Yes
movies 0.40625 No
music 0.03125 Yes
philosophy 0.50000 No
physics 0.03125 Yes
politics 0.03125 Yes
rpg 0.03125 Yes
scifi 0.03125 Yes
serverfault 0.03125 Yes
travel 0.03125 Yes

Appendix K Detailed Tag-Post Coverage %

Table 17 shows detailed tag-post coverage.

Table 17. Top-n Tag’s Post Coverage. #T:#distinct tags, 100T%:Top-100 tag percent among whole tag-space.
Domain #T Top1 Top3 Top5 Top10 Top50 Top100 100T%
askubuntu 3121 5.67 15.87 24.81 40.21 71.84 82.68 3.2
aviation 1002 11.05 25.81 33.87 45.93 79.13 89.43 9.98
biology 739 9.22 23.91 37.84 55.05 84.39 91.76 13.53
chemistry 375 23.05 42.61 48.62 61.38 87.69 95.35 26.67
cooking 833 9.55 22.45 29.55 38.99 71.45 85.19 12
electronics 2226 4.94 13.84 20.88 32.81 68.96 81.98 4.49
history 813 10.86 25.08 35.27 45.91 80.82 89.95 12.3
money 995 37.04 49.69 56.62 68.52 88.33 94.18 10.05
movies 4348 36.93 49.59 56.36 66.84 81.59 85.88 2.3
music 512 14.93 39.08 47.59 58.04 87.42 94.54 19.53
philosophy 559 19.39 37.1 48.56 63.3 87.29 93.77 17.89
physics 893 12.7 28.35 39.99 55.1 83.98 91.68 11.2
politics 739 46 59.16 63.64 66.41 89.63 94.95 13.53
rpg 1195 42.5 61.23 76.9 79.75 88.01 92.66 8.37
scifi 3433 27.86 47.75 62.03 70.67 81.32 85.04 2.91
serverfault 3814 11.92 22.16 29.97 42.76 72.8 82.86 2.62
travel 1891 22.2 36.03 48.34 58.34 84.39 92.36 5.29

Appendix L Effect of Using Answers

We can use answers in those domains or organizations where we already have some answers posted and the tag-prediction approach is being deployed later. The motivation for using answers directly comes from our Tag-Post Overlap analysis in Table 3, where we can find a minimum overlap of tags in 70% of posts in 16/17 domains with the exception of chemistry and biology domains. In these two domains, the overlap increases by around 9-10%. In some domains, the overlap also increases to 91%.

Table 18. Effect of Vocabulary Size Reduction on Individual Models Hit@5 Metric
Domain MP MRPG
90 85 90 85
askubuntu 80.42 75.73 Δ\Delta-4.69 83.18 80.92 Δ\Delta-2.26
aviation 77.12 73.21 Δ\Delta-3.91 77.64 77.68 Δ\Delta0.04
biology 79.31 76.35 Δ\Delta-2.96 78.03 77.41 Δ\Delta-0.62
chemistry 77.77 75.62 Δ\Delta-2.15 79.51 79.63 Δ\Delta0.12
cooking 80.42 76.81 Δ\Delta-3.61 85.38 85.29 Δ\Delta-0.09
electronics 77.92 73.69 Δ\Delta-4.23 81.62 80.56 Δ\Delta-1.06
history 80.57 77.59 Δ\Delta-2.98 82.29 81.21 Δ\Delta-1.08
money 84.46 80.38 Δ\Delta-4.08 88.19 87.9 Δ\Delta-0.29
movies 83.54 78.6 Δ\Delta-4.94 82.77 82.8 Δ\Delta0.03
music 82.72 78.73 Δ\Delta-3.99 84.37 84.18 Δ\Delta-0.19
philosophy 79.17 74.4 Δ\Delta-4.77 79.58 79.1 Δ\Delta-0.48
physics 81.49 77.3 Δ\Delta-4.19 86.48 85.78 Δ\Delta-0.7
politics 86.43 82.4 Δ\Delta-4.03 91.38 90.74 Δ\Delta-0.64
rpg 83.71 79.41 Δ\Delta-4.3 89.23 88.1 Δ\Delta-1.13
scifi 85.81 82.22 Δ\Delta-3.59 91.55 90.72 Δ\Delta-0.83
serverfault 81.87 77.26 Δ\Delta-4.61 85.9 85.04 Δ\Delta-0.86
travel 84.09 79.41 Δ\Delta-4.68 89.47 88.53 Δ\Delta-0.94
Table 19. Head Contributions for 90%, 85% and 95% Post Coverage by Tags: P represents total correct predictions % by P-Head only and G represents total correct prediction % by G-Head only
Domain 90 85 95
P G P G P G
askubuntu 49.66 11.85 46.5(-3.16) 14.3(2.45) 59.27(9.61) 7.54(-4.31)
aviation 53.8 10.42 47.78(-6.02) 13.76(3.34) 61.54(7.74) 5.41(-5.01)
biology 53.86 9.88 49.32(-4.54) 11.47(1.59) 61.1(7.24) 5.8(-4.08)
chemistry 55.18 10.61 50.89(-4.29) 12.96(2.35) 63.87(8.69) 6.31(-4.3)
cooking 55.03 10.95 47.15(-7.88) 14.93(3.98) 64.45(9.42) 6.24(-4.71)
electronics 52.07 11.28 46.28(-5.79) 14.39(3.11) 59.52(7.45) 7.24(-4.04)
history 55.06 9.91 52.15(-2.91) 10.51(0.6) 65.25(10.19) 3.94(-5.97)
money 50.17 10.23 43.74(-6.43) 12.82(2.59) 61.49(11.32) 5.99(-4.24)
movies 68.51 4.55 58.63(-9.88) 7.4(2.85) 74.17(5.66) 1.61(-2.94)
music 56.46 10.18 50.32(-6.14) 13.19(3.01) 64.44(7.98) 6.69(-3.49)
philosophy 58.98 7.81 53.02(-5.96) 10.85(3.04) 64.9(5.92) 4.61(-3.2)
physics 45.01 12 39.84(-5.17) 14.73(2.73) 55.04(10.03) 8.41(-3.59)
politics 52.6 9.14 44.87(-7.73) 12.81(3.67) 67.38(14.78) 4.51(-4.63)
rpg 51.89 7.68 39.03(-12.86) 11.34(3.66) 68.68(16.79) 3.97(-3.71)
scifi 74.64 5.11 62.79(-11.85) 8.37(3.26) 84.69(10.05) 1.6(-3.51)
serverfault 50.63 11.9 42.96(-7.67) 15.97(4.07) 58.55(7.92) 8.45(-3.45)
travel 46.26 12.39 40.6(-5.66) 15.94(3.55) 53.59(7.33) 8.19(-4.2)
Table 20. Model Performance (Hit@k) for MP and MRPG models
Domain MP MRPG
Hit@1 Hit@2 Hit@3 Hit@4 Hit@5 Hit@1 Hit@2 Hit@3 Hit@4 Hit@5
askubuntu 31.59 ±\pm 0.14 50.89 ±\pm 64.85 0.21 ±\pm 0.14 74.23 ±\pm 0.31 80.44 ±\pm 0.11 50.86 ±\pm 0.13 72.72 ±\pm 0.1 80.19 ±\pm 0.09 81.71 ±\pm 0.17 82.94 ±\pm 0.15
aviation 30.72 ±\pm 0.95 48.39 ±\pm 62.05 0.93 ±\pm 0.55 72.23 ±\pm 0.66 77.09 ±\pm 0.44 47.28 ±\pm 0.19 67.37 ±\pm 0.36 75.21 ±\pm 0.47 76.02 ±\pm 0.45 77.63 ±\pm 0.56
biology 34.66 ±\pm 0.64 50.81 ±\pm 63.77 1.01 ±\pm 0.31 73.96 ±\pm 0.32 78.96 ±\pm 0.34 49.67 ±\pm 0.7 68.8 ±\pm 0.37 75.7 ±\pm 0.42 76.51 ±\pm 0.44 77.55 ±\pm 0.41
chemistry 38.83 ±\pm 0.51 54.77 ±\pm 65.84 0.36 ±\pm 0.72 73.35 ±\pm 0.34 77.66 ±\pm 0.1 50.28 ±\pm 0.15 69.81 ±\pm 0.33 76.43 ±\pm 0.3 77.23 ±\pm 0.38 79.17 ±\pm 0.45
cooking 35.46 ±\pm 0.28 56.17 ±\pm 67.2 0.79 ±\pm 0.82 76.21 ±\pm 0.95 80.86 ±\pm 0.42 52.43 ±\pm 0.68 75.61 ±\pm 0.15 82.73 ±\pm 0.25 83.74 ±\pm 0.31 85.18 ±\pm 0.29
electronics 28.67 ±\pm 0.43 47.28 ±\pm 61.78 0.73 ±\pm 0.54 70.97 ±\pm 0.24 77.51 ±\pm 0.26 49.64 ±\pm 0.63 71.4 ±\pm 0.53 78.92 ±\pm 0.46 80.06 ±\pm 0.47 81.3 ±\pm 0.53
history 34.18 ±\pm 1.21 54.16 ±\pm 66.47 1.24 ±\pm 0.97 76.22 ±\pm 0.53 80.45 ±\pm 0.09 54.07 ±\pm 0.1 73.56 ±\pm 0.62 79.5 ±\pm 0.82 80.29 ±\pm 0.88 81.23 ±\pm 1
money 51.05 ±\pm 0.44 66.28 ±\pm 75.43 0.98 ±\pm 0.43 81.01 ±\pm 0.24 84.15 ±\pm 0.23 60.59 ±\pm 0.47 78.89 ±\pm 0.14 86.01 ±\pm 0.4 86.75 ±\pm 0.41 87.94 ±\pm 0.42
movies 50.06 ±\pm 0.5 64.28 ±\pm 73.58 1.53 ±\pm 0.88 79.48 ±\pm 0.7 82.91 ±\pm 0.55 57.33 ±\pm 0.44 78.41 ±\pm 0.28 82.03 ±\pm 0.69 83.05 ±\pm 0.9 83.25 ±\pm 0.99
music 37.03 ±\pm 0.41 57.17 ±\pm 68.79 0.62 ±\pm 0.27 77.76 ±\pm 0.43 82.66 ±\pm 0.26 53.17 ±\pm 0.69 75.17 ±\pm 0.35 81.46 ±\pm 0.52 82.28 ±\pm 0.49 83.71 ±\pm 0.51
philosophy 34.9 ±\pm 0.8 52.94 ±\pm 66.03 0.26 ±\pm 0.77 75.46 ±\pm 0.53 79.45 ±\pm 0.2 53.09 ±\pm 0.95 72.01 ±\pm 0.27 78.03 ±\pm 0.59 78.76 ±\pm 0.52 79.49 ±\pm 0.56
physics 41.27 ±\pm 0.46 60.63 ±\pm 70.59 0.49 ±\pm 0.18 77.39 ±\pm 0.31 81.12 ±\pm 0.22 57.96 ±\pm 0.28 75.49 ±\pm 0.28 83.5 ±\pm 0.38 84.47 ±\pm 0.41 86.34 ±\pm 0.37
politics 65.26 ±\pm 1.76 73.87 ±\pm 78.43 1.18 ±\pm 0.86 84.04 ±\pm 0.42 86.29 ±\pm 0.25 71.61 ±\pm 0.5 82.92 ±\pm 0.31 88.97 ±\pm 0.46 89.91 ±\pm 0.42 90.98 ±\pm 0.46
rpg 68.68 ±\pm 0.29 75.93 ±\pm 79.43 0.31 ±\pm 0.37 81.66 ±\pm 0.36 83.31 ±\pm 0.33 72.85 ±\pm 0.4 81.91 ±\pm 0.17 87.59 ±\pm 0.12 88.28 ±\pm 0.16 89.09 ±\pm 0.16
scifi 76.65 ±\pm 0.34 80.64 ±\pm 82.69 0.42 ±\pm 0.36 84.89 ±\pm 0.15 85.91 ±\pm 0.11 81.99 ±\pm 0.12 86.2 ±\pm 0.21 90.41 ±\pm 0.23 90.85 ±\pm 0.28 91.53 ±\pm 0.32
serverfault 25.9 ±\pm 0.12 50.99 ±\pm 65.52 0.59 ±\pm 0.25 75.24 ±\pm 0.34 81.66 ±\pm 0.16 53.05 ±\pm 0.38 75.36 ±\pm 0.18 82.56 ±\pm 0.17 84.16 ±\pm 0.29 85.82 ±\pm 0.26
travel 45.04 ±\pm 0.51 63.62 ±\pm 72.52 0.14 ±\pm 0.41 79.96 ±\pm 0.24 83.96 ±\pm 0.12 58.7 ±\pm 0.34 78.57 ±\pm 0.2 86.75 ±\pm 0.25 87.78 ±\pm 0.28 89.5 ±\pm 0.3