Modeling Tag Prediction based on Question Tagging Behavior Analysis of CommunityQA Platform Users
Abstract.
In community question-answering platforms, tags play essential roles in effective information organization and retrieval, better question routing, faster response to questions, and assessment of topic popularity. Hence, automatic assistance for predicting and suggesting tags for posts is of high utility to users of such platforms. To develop better tag prediction across diverse communities and domains, we performed a thorough analysis of users’ tagging behavior in 17 StackExchange communities. We found various common inherent properties of this behavior on those diverse domains. We used the findings to develop a flexible neural tag prediction architecture, which predicts both popular tags and more granular tags for each question. Our extensive experiments and obtained performance show the effectiveness of our model.
1. Introduction
Community Question Answering (CQA) platforms have become a very important online source of information for Web users. On these platforms, information seeking takes the form of questions and answers in communities formed around common domains of interest. StackExchange, Quora, AnswerBag, Question2Answer, Reddit111stackexchange.com, quora.com, answerbag.com, question2answer.org, reddit.com and Biostars (Parnell et al., 2011) are some of the most popular public CQA platforms. Many enterprise entities offer similar private platforms for their employees. These communities have amassed over time large online information repositories, with high numbers of daily active users. Thus, there is a need to organize and retrieve information efficiently, as well as to facilitate question routing to interested and qualified experts in order to provide a seamless user experience and interaction. Semantic tagging of questions plays an important role in this context.
Most CQA platforms require users to assign tags to their questions. Tags are keywords representative of the topics covered by those questions. They help communities to (1) categorize and organize information (2) retrieve existing answers for users looking for information, which in turn reduces duplicate question creation (3) route questions to topic experts which improves query response time and answer quality (4) provide tag-based notifications, which allow knowledgeable community members to answer questions in their areas of expertise and gain reputation (5) assess the popularity of various areas and topics in the targeted domain.
Asking users to annotate their questions with tags without providing adequate support poses several challenges, in particular with respect to novice users and to the lack of knowledge about tag usage in a community, which may lead to the creation of various tags with the same meaning, as well as different orthographic forms of those tags. This makes question routing difficult (for tag-based subscription platforms), delays response time, and leads to poor information organization. In turn, addressing these issues would require community administrators to constantly work on identifying and merging near-duplicate tags. Additionally, lack of support in suggesting adequate tags may inhibit novice users from asking questions and/or lead to questions being mistagged and not answered. These challenges may become more severe in enterprise CQA platforms due to community size and topic sparsity.
Against this background, tag-prediction becomes an extremely important while challenging task for both public and private CQA platforms. In this investigation, our first goal was to understand the commonalities of the tagging behaviors of users through a large scale analysis of 17 diverse domains in StackExchange (Section 3). Our analysis revealed that while these domains are quite diverse in terms of volume of questions, users and tags, they share common distributional properties for tag and tag pair usage. Also, there is a large lexical overlap between the tags and user texts in every domain. Post coverage of tags is high in all domains. Tags also show positional stability and tag pairs show particular ordering preferences forming a soft hierarchy among tags.

We incorporate the findings to develop a neural model with two tag-prediction heads - one trained to predict existing popular tags such as the name of important topics in a domain (e.g. ”harry-potter”, doctor-who”, and ”star-wars” in the scifi domain) and frequently-used meta-tags (e.g. ”video-games”, ”books”, and ”short-stories” in scifi) and another one generate finer-grained tags, which may have been used rarely on previous questions or are new. Typically, the former category of tags represent the main topic area of a question while the latter help in further scoping down and clarifying it. Both types of tags are equally important in identifying the question and hence it is necessary for the tag prediction systems to not only predict the main generic tags but also the refined ones.
Our experiments show that the proposed approach significantly outperforms baseline methods in prediction of both generic tags and finer-grained tags. We also investigate and show the effect of reducing the pre-defined vocabulary size, as well as the contributions of each prediction head. Our main contributions in this work are:
-
•
We present an in-depth analysis of the tagging behaviors of the users of a CQA platform (StackExchange) on 17 diverse domains. We present our findings of question tag analysis across four dimensions: tag space, tag co-occurrence, tag pair ordering, and tag positional stability.
-
•
We propose a tag prediction architecture for both predicting popular tags from a pre-defined vocabulary and generating refined tags not present in the vocabulary.
-
•
We perform comprehensive experiments on the 17 domains and show effects of each model component under various experimental settings.
2. Dataset Preparation
We collected data from 17 communities of StackExchange that correspond to a diverse set of domains. We use the StackExchange data dumps222https://archive.org/details/stackexchange_20210301 (2021-03-01) for our analysis and model. We find that the Post.xml file is sufficient for our tag analysis and predictions. We only consider the posts from the dataset which are either questions or answers (PostTypeId) for our analysis. We reject posts with no owners (OwnerUserId, OwnerDisplayName). As imposed by StackExchange, the minimum and maximum number of tags assigned to each posts are one and five respectively and all the posts in this data set are dated prior to March, 2021. We chose several domains from each of the following StackExchange categories333https://stackexchange.com/sites#: Technology, Culture & recreation, Life & arts, Science and Professional. Each selected domain has at least a decade of posts. We do not include the stackoverflow domain because of its enormous volume and also a random sample set might not be representative of the full data of this domain. Hence we consider askUbuntu which is also a representative community of the Technology domain.
3. Tagging Behavior Analysis
To understand the user behavior of question tagging and to identify the inherent commonalities, we analyze ten years of data from these 17 domains.
Mathematical Notation: Without loss of generality, let denote one of the domains (out of 17) being investigated, the set of posts in the data for this domain, and the set of all tags used in domain . Each post has associated a sequence of tags , , where denotes the tag at position in that sequence. We employ parentheses to distinguish between the positional information of a tag in a sequence and the indexes that identify elements of the tag set observed for domain .

3.1. Community Diversity
We observed a high degree of variability for the selected domains in terms of Question Volume, Tag Space and Asker Volume. Figure 1 shows a visual comparison of this variability, while Table 1 shows general statistics for each domain. In terms of the amount of information created over a decade, only four domains have over 100K posted questions while the domains politics and history have merely 12K. If we consider the number of unique tags (#T) created, the domain movies ranks highest, as new movie titles are added to the tag set on weekly basis. To quantify tag re-use in each domain, we define post-per-Tag (PPT) as the number of posts available for one tag. We also observe that physics, askubuntu, and chemistry are domains with the most tag-reuse (PPT 100) while movies domain (PPT 5) shows frequent new tags. The number of posts having views over 100 (V100) can be used to infer the popularity of posts in each domain. From the average number of tags (AvgT) per post, we can infer the need for detailed tagging in each domain. In travel, physics, and money, AvgT 3 indicates users feel the need to assign more than 3 tags to clarify their questions. Also, the movie domain has the least AvgT (2.09), showing that only two tags on average are sufficient. Some domains like aviation, philosophy, history, movies, politics are not popular (#A 10K in a decade). More statistics are in Appendix Table 10.
Domain | #Q | #T | PPT | AvgT | V100 | #A | QPA |
---|---|---|---|---|---|---|---|
askubuntu | 371800 | 3121 | 119.13 | 2.78 | 1093 | 201912 | 1.84 |
aviation | 20345 | 1002 | 20.30 | 2.56 | 12 | 7066 | 2.88 |
biology | 25671 | 739 | 34.74 | 2.58 | 11 | 12089 | 2.12 |
chemistry | 37476 | 375 | 99.94 | 2.37 | 7 | 17202 | 2.18 |
cooking | 24513 | 833 | 29.43 | 2.30 | 13 | 12413 | 1.97 |
electronics | 152980 | 2226 | 68.72 | 2.77 | 36 | 61869 | 2.47 |
history | 12562 | 813 | 15.45 | 2.84 | 19 | 5296 | 2.37 |
money | 32648 | 995 | 32.81 | 3.11 | 37 | 18010 | 1.81 |
movies | 20749 | 4348 | 4.77 | 2.09 | 30 | 6931 | 2.99 |
music | 20925 | 512 | 40.87 | 2.52 | 5 | 10447 | 2.00 |
philosophy | 15624 | 559 | 27.95 | 2.40 | 6 | 6640 | 2.35 |
physics | 180166 | 893 | 201.75 | 3.17 | 131 | 59774 | 3.01 |
politics | 12416 | 739 | 16.80 | 2.90 | 27 | 3970 | 3.13 |
rpg | 42693 | 1195 | 35.73 | 2.91 | 56 | 11541 | 3.70 |
scifi | 62987 | 3433 | 18.35 | 2.25 | 153 | 22717 | 2.77 |
serverfault | 299895 | 3814 | 78.63 | 2.90 | 327 | 130214 | 2.30 |
travel | 42201 | 1891 | 22.32 | 3.28 | 42 | 24895 | 1.70 |
3.2. Tag-Space Analysis
We analyzed each domain’s tag spaces into (1) General Tag Statistics (2) Tag Distributions (3) Tag-Post Coverage (4) Tag-Post Overlap.
General Tag Statistics: The shortest tag in every domain is merely 1-3 characters long (c, air, 3g) while the longest tag is 22-35 characters long (valerian-city-of-a-thousand-planets, neurodegenerative-disorders). askubuntu has the lowest average tag length (8.17) while movies has the highest (13.66). We believe that the tags in askubuntu are short technical terms of a subtopic but movie names tend to be quite long in comparison and are often used as a part of a tag in the movie domain. Table 2 shows the distribution based on the number of words of the tags. With the exception of movies, rpg, and scifi the majority of tags in all the domains consist of three or fewer words. The shortest and longest tags for each domain are presented in Appendix Table 9.
Domain | 1 | 2 | 3 | 4 | 5 | >5 |
---|---|---|---|---|---|---|
askubuntu | 80.83 | 18.73 | 0.37 | 0.07 | 0 | 0 |
aviation | 49.74 | 43.86 | 6.34 | 0.05 | 0 | 0 |
biology | 69.30 | 29.95 | 0.75 | 0 | 0 | 0 |
chemistry | 47.17 | 50.36 | 2.31 | 0.16 | 0 | 0 |
cooking | 78.53 | 21.11 | 0.36 | 0.01 | 0 | 0 |
electronics | 74.23 | 23.9 | 1.33 | 0.54 | 0 | 0 |
history | 56.86 | 36.1 | 7.01 | 0.03 | 0 | 0 |
money | 50.00 | 45.51 | 4.05 | 0.45 | 0 | 0 |
movies | 32.81 | 41.58 | 16.32 | 5.61 | 2.57 | 1.1 |
music | 77.74 | 21.07 | 1.17 | 0.02 | 0 | 0 |
philosophy | 69.42 | 14.02 | 16.29 | 0.27 | 0.01 | 0 |
physics | 41.37 | 49.31 | 9.02 | 0.3 | 0 | 0 |
politics | 51.26 | 45.05 | 3.59 | 0.08 | 0.02 | 0 |
rpg | 42.43 | 51.39 | 4.82 | 1.11 | 0.16 | 0.09 |
scifi | 31.04 | 49.23 | 13.19 | 3.23 | 2.1 | 1.2 |
serverfault | 67.91 | 23.09 | 7.62 | 1.32 | 0.06 | 0 |
travel | 65.78 | 26.5 | 6.87 | 0.85 | 0 | 0 |
Tag Distributions: There is a long tail in the distribution of tags in every domain (Figure 2). We observe that (1) most larger domains where the tag re-use is high, have smoother tag distributions like askubuntu, electronics, biology and (2) for some smaller domains like scifi, movies, rpg, the most frequent tag dominates the distribution. The rest of the distributions are shown in the Appendix Figure 12. Also, Table 3 shows that the 100 most frequent tags (100Tag%) constitute a very small portion of the tag space for large domains.
Domain | #T | 100Tag% | Top1 | Top10 | Top100 |
---|---|---|---|---|---|
askubuntu | 3121 | 3.20 | 5.67 | 40.21 | 82.68 |
aviation | 1002 | 9.98 | 11.05 | 45.93 | 89.43 |
biology | 739 | 13.53 | 9.22 | 55.05 | 91.76 |
chemistry | 375 | 26.67 | 23.05 | 61.38 | 95.35 |
cooking | 833 | 12.00 | 9.55 | 38.99 | 85.19 |
electronics | 2226 | 4.49 | 4.94 | 32.81 | 81.98 |
history | 813 | 12.30 | 10.86 | 45.91 | 89.95 |
money | 995 | 10.05 | 37.04 | 68.52 | 94.18 |
movies | 4348 | 2.30 | 36.93 | 66.84 | 85.88 |
music | 512 | 19.53 | 14.93 | 58.04 | 94.54 |
philosophy | 559 | 17.89 | 19.39 | 63.30 | 93.77 |
physics | 893 | 11.20 | 12.70 | 55.10 | 91.68 |
politics | 739 | 13.53 | 46.00 | 66.41 | 94.95 |
rpg | 1195 | 8.37 | 42.50 | 79.75 | 92.66 |
scifi | 3433 | 2.91 | 27.86 | 70.67 | 85.04 |
serverfault | 3814 | 2.62 | 11.92 | 42.76 | 82.86 |
travel | 1891 | 5.29 | 22.20 | 58.34 | 92.36 |

Post Coverage by Tags: We consider a tag to cover a post if it is present in the tag sequence of the post. Table 3 shows the percentage of total posts that can be covered by the top most frequent tags in each domain. We observe that the most frequent tag covers (Top1) at most 10% of posts in electronics, askubuntu, cooking, and biology domains but more than 40% in politics and rpg domains. More than 81% of all posts in each domain are covered by the 100 most frequent tags.
Tag-Post Overlap: Figure 3 shows whether the tags appear in user contents (question-title / question-body / answers) using two metrics: (1) single worded tag exact-match (EMS) and both single and multiple worded tag exact-match (EMM). We observe that in 8/17 domains, tags appear in more than 50% of post titles. The movie domain has more multi-worded tags than single worded tags (9.49% compared to 34.51%). Two science domains - biology and chemistry - have the lowest tag overlap (30%) with the question title (T-EMS). When we include the question body, we observe, in 9/17 domains, question tags appear in more than 70% of posts. Finally, if we include every answer for each question, all the domains (except chemistry and biology) have their tags appear in more than 70% of the posts. The three larger domains (askubuntu, serverfault, and electronics) have more than 90% overlap. The overlap is lowest (56%) for the chemistry and biology domains.
Domain | Top-1 | Top-3 | Top-5 | Top-10 | Top-50 | Top-100 | Single |
---|---|---|---|---|---|---|---|
askubuntu | 1.57 | 2.89 | 5.33 | 9.43 | 17.97 | 23.45 | 17.70 |
aviation | 2.05 | 3.49 | 4.78 | 6.99 | 17.00 | 23.81 | 19.27 |
biology | 2.85 | 4.90 | 7.41 | 11.39 | 25.85 | 33.34 | 20.67 |
chemistry | 4.33 | 7.62 | 9.99 | 14.56 | 29.82 | 36.95 | 23.89 |
cooking | 1.60 | 3.45 | 4.34 | 5.89 | 13.51 | 18.54 | 25.81 |
electronics | 0.76 | 2.16 | 3.20 | 5.08 | 13.03 | 18.62 | 18.31 |
history | 2.37 | 4.86 | 6.09 | 9.93 | 20.97 | 27.58 | 15.34 |
money | 10.39 | 17.13 | 18.52 | 24.16 | 39.92 | 46.49 | 10.51 |
movies | 2.50 | 6.28 | 7.81 | 10.93 | 20.29 | 25.30 | 21.98 |
music | 2.48 | 5.32 | 7.56 | 13.52 | 31.20 | 38.17 | 20.49 |
philosophy | 1.74 | 4.97 | 7.32 | 11.08 | 26.54 | 33.79 | 27.85 |
physics | 2.32 | 5.29 | 7.10 | 11.07 | 28.54 | 37.46 | 11.39 |
politics | 4.59 | 11.98 | 17.60 | 27.24 | 43.27 | 49.24 | 10.40 |
rpg | 12.48 | 17.23 | 22.27 | 28.13 | 43.54 | 52.73 | 9.96 |
scifi | 5.58 | 12.09 | 17.92 | 26.12 | 43.57 | 49.29 | 25.86 |
serverfault | 1.09 | 2.84 | 4.16 | 6.23 | 16.07 | 22.29 | 13.03 |
travel | 5.17 | 12.01 | 14.64 | 18.04 | 31.01 | 38.43 | 6.45 |
3.3. Tag Co-Occurrence Analysis
For a post , we define tag co-occurrence as a pair of tags appearing in a post together irrespective of their positions.
Soft Tag Hierarchy: From the tag co-occurrence analysis in the 17 domains, we find that there exists a soft hierarchy among the tag pairs. One of the tags indicates the main topic or area of the question and the other tag is often fine-grained which makes the question more specific. For these examples, the second tag is a sub-category of the first: (baking, bread) in cooking, (dnd-5e, spells) in rpg and (aircraft-design, wing) in aviation. In the science domain, similar examples of topic-subtopic relationships are (organic-chemistry, carbonyl-compounds) in chemistry and (hilbert-space, quantum-mechanics) in physics. The most frequently occuring tag-pair for each domain is shown in Table 5, in Appendix Table 11 a more comprehensive set of the top-5 most frequent pairs per domain are shown.

Domain | Top Pair | Post-Count |
---|---|---|
askubuntu | (’boot’, ’grub2’) | 5845 |
aviation | (’aerodynamics’, ’aircraft-design’) | 417 |
biology | (’entomology’, ’species-identification’) | 731 |
chemistry | (’organic-chemistry’, ’reaction-mechanism’) | 1621 |
cooking | (’baking’, ’bread’) | 393 |
electronics | (’current’, ’voltage’) | 1161 |
history | (’nazi-germany’, ’world-war-two’) | 298 |
money | (’taxes’, ’united-states’) | 3393 |
movies | (’character’, ’plot-explanation’) | 518 |
music | (’chords’, ’theory’) | 519 |
philosophy | (’logic’, ’philosophy-of-mathematics’) | 272 |
physics | (’homework-and-exercises’, ’newtonian-mechanics’) | 4182 |
politics | (’donald-trump’, ’united-states’) | 570 |
rpg | (’dnd-5e’, ’spells’) | 5330 |
scifi | (’short-stories’, ’story-identification’) | 3514 |
serverfault | (’linux’, ’ubuntu’) | 3261 |
travel | (’uk’, ’visas’) | 2181 |
Tag Pair Post Coverage: We consider a tag-pair ({,}) to cover a post if the tag-pair occurs in the sequence of tags for that post in any position. Table 4 shows the tag pair post coverage across the domains. We see around 10-20% of posts have only a single tag. Considering the most frequent 100 pairs we can cover 18-53% posts. Also, the most frequent tag pair can cover more than 10% of posts in money and rpg domains which shows that this tag-pair is extremely essential for these two domains.
Tag Pair Distribution: On analyzing the distribution of top-50 frequently occurring tag pairs in each domain, we observe three patterns: (1) Smooth Distribution (2) Spike in Top-1 and (3) Spikes in top few pairs. Larger domains (askubuntu, serverfault, electronics) have smooth distributions. In smaller domains (movies, scifi, travel) few tag pairs dominate the distributions, indicating their popularity. More Details are available in Appendix Section C and Figures 10 and 11.
3.4. Tag Pair Ordering
We analyze the top-10 most frequent tag pairs in each domain to identify users’ ordering preferences for tags. For a post , (and ) are the tag ordering for the tag pairs and , where and are the positions of and respectively in the tag sequence . We find that community users have a tendency to assign the more generic tags prior to the specific ones, for each domain by analyzing the occurrence of and . For example, aircraft-design always appears before wings out of 221 times they appear together in aviation, united-states appears before income-tax, 99.95% of times out of 3393 times they appear in the money domain and dnd-5e always appears before magic-items out of 1367 times in rpg. More examples are in the Appendix F.
3.5. Tag Position Stability
Domain | Position 1, 2 | Position 3, 4, 5 |
---|---|---|
askubuntu | [’software-installation’,’server’,’community’,’locoteams’,’10.04’] | [’multiple-workstations’,’equalizer’,’speakers’,’workflow’,’flicker’] |
aviation | [’air-traffic-control’,’radio-communications’,’airspace’,’flight-planning’,’faa-regulations’] | [’rotary-wing’,’rvsm’,’sfo’,’dash-8’,’special-vfr’] |
biology | [’biochemistry’,’immunology’,’cell-biology’,’dna’,’molecular-biology’] | [’ribosome’,’binding-sites’,’exons’,’dendritic-spines’,’rna-interference’] |
chemistry | [’crystal-structure’,’equilibrium’,’organic-chemistry’,’thermodynamics’,’inorganic-chemistry’] | [’nitro-compounds’,’bent-bond’,’phenols’,’organosulfur-compounds’,’reaction-coordinate’] |
cooking | [’baking’,’oven’,’eggs’,’substitutions’,’sauce’] | [’oregano’,’condensed-milk’,’chopping’,’blind-baking’,’scottish-cuisine’] |
electronics | [’arduino’,’motor’,’soldering’,’ethernet’,’avr’] | [’basic-stamp’,’debugwire’,’sinking’,’nxp’,’fuse-bits’] |
history | [’20th-century’,’world-war-one’,’language’,’china’,’political-history’] | [’proof’,’dday’,’crusaders’,’templars’,’republic-of-ireland’] |
money | [’investing’,’united-states’,’canada’,’taxes’,’credit-card’] | [’pension-plan’,’contractor’,’contribution’,’limits’,’debt-reduction’] |
movies | [’wedding-crashers’,’analysis’,’star-wars’,’comedy’,’the-pink-panther’] | [’manichitrathazhu’,’chandramukhi’,’bhool-bhulaiyaa’,’clint-eastwood’,’for-a-few-dollars-more’] |
music | [’learning’,’voice’,’theory’,’tuning’,’scales’] | [’stick-control’,’archeterie’,’instrumentation’,’rsi’,’rock-n-roll’] |
philosophy | [’epistemology’,’philosophy-of-mathematics’,’ethics’,’existentialism’,’logic’] | [’dreams’,’plantinga’,’rationalism’,’rule-ethics’,’arithmetic’] |
physics | [’quantum-mechanics’,’particle-physics’,’string-theory’,’acoustics’,’experimental-physics’] | [’action’,’faq’,’stability’,’wavefunction-collapse’,’coriolis-effect’] |
politics | [’election’,’political-theory’,’democracy’,’united-kingdom’,’israel’] | [’first-past-the-post’,’checks-and-balances’,’redistricting’,’faithless-elector’,’puerto-rico’] |
rpg | [’pathfinder-1e’,’dnd-3.5e’,’game-recommendation’,’dungeons-and-dragons’,’dogs-in-the-vineyard’] | [’feywild’,’group-scaling’,’round-robin-gming’,’romance’,’charmed’] |
scifi | [’novel’,’vorkosigan-saga’,’total-recall-2070’,’star-trek’,’the-road’] | [’star-trek-data’,’3001-the-final-odyssey’,’rama-revealed’,’star-trek-emh’,’skylark-series’] |
serverfault | [’sql-server’,’backup’,’sql-server-2008’,’raid’,’windows’] | [’tempdb’,’fakeraid’,’tuning’,’su’,’debian-etch’] |
travel | [’loyalty-programs’,’transportation’,’public-transport’,’sightseeing’,’safety’] | [’amazon-river’,’amazon-jungle’,’singapore-airlines’,’sin’,’trans-siberian’] |

We study the positional stability of tags i.e., whether some tags frequently appear in any particular position among the five allowed by StackExchange. We consider as the percentage of occurrence of a tag () in any position , given by,
(1) |
where denotes the count of tag in position . We consider three stability thresholds () - 80%, 90%, 99% (Figure 4 and 5). For a tag and position , indicates that the tag is stable at that position.
(2) | |||
(3) |
where is the set of tags that occurs more than in sets of positions defined by and is the percentage of tags in a domain that are stable at positions . In Figure 4, (rpg domain) for , we find i.e. 13.81% of all tags in rpg are stable in positions 1 and 2 combined, and are stable in positions 3, 4 and 5 combined. The rest of the tags are unstable. Also, the stable tags () appearing in positions 3, 4, and 5 are finer-grained (or refined) tags that support the stable tags present in positions 1 and 2 ().
The travel domain, has the highest number of stable tags appearing in positions 3,4, and 5 () with threshold showing that to make a question specific more than one refined tags is needed in this domain. We neither find any conclusive evidence of this stability within positions 1 and 2 (i.e. and ), nor within positions 3, 4 and 5 (i.e. , and ) individually.
Table 6 shows five randomly selected examples of position-stable tags in 17 domains. These positions account for more than 99% of the occurrences of these tags in their respective domains.
4. Modeling Tag Prediction
Based on the observations from our tagging behavior analysis (Section 3), we develop an automated generic tag prediction approach for CQA platforms that predicts both generic and refined tags. The inherent commonalities in community diversity influence our decision to develop a common tag generation framework. The long tail in tag-space analysis guided us to develop a predictive-generative hybrid model. Tag co-occurrence analysis, tag-pair ordering, and tag-positional analysis on these domains led us to generate tags from a common vocabulary of popular tags at certain positions and related granular tags at the remaining positions.
4.1. Majority Baseline
Five most frequent tags per domain from training, data are considered as Top1-Top5 predictions for the test data in order (Hit@1 to Hit@5). We introduce this baseline as the top few tags cover a large number of posts in each domain (Table 3).
4.2. Feature-Based Models
We use linear multi-label classifiers using the one-vs-all strategy with tf-idf and bag-of-word features as two baselines since most of the feature-based tag prediction models use either of them as features. We hypothesize that these models can leverage the high amount of tag-post overlap (Figure 3). Here we train the models for each domain with classes corresponding to all the unique tags.
4.3. MetaTag Predictor Model (MP)
In this model (Figure 6), we first select a vocabulary (MetaTag) of tags based on a frequency analysis of the tag’s post coverage per domain. Here, we consider popular tags as meta tags. We formulate this multi-label classification task as a language model mask-filling task using pre-trained roberta-base (Liu et al., 2019) as the base of this model. We train separately for each domain.
Training: We tokenize the question title () and body () and hide the tags from the MetaTag vocabulary with a mask token, mask. These are concatenated and provided as input to the model.
+ +mask…mask
This model is trained to predict those masks optimizing the prediction loss () over all masked tokens (). Here the number of mask tokens may vary based on the post (shown as above). is the total loss.
Inference: We tokenize and , and append five mask tokens at the end, enforcing the model to predict exactly five tags for the post (the most probable tag for each position). This is because StackExchange allows a maximum of five tags to be associated with a question. This ensures that the model predicts the tags from the MetaTag vocabulary.
+ +maskmaskmaskmaskmask

4.4. Meta Refined Tag Predictor Generator Model (MRPG)
This model (Figure 6) is similar to the MP model, with the additional ability to generate tags not present in the MetaTag vocabulary (OOV). In a more general sense, here the motivation is to develop a model capable of predicting tags from a predefined set and generating novel tags as well.
Training: Similar to MP model, we tokenize and , and replace the tags present in the MetaTag vocabulary with mask token. The rest of the tags (out-of-vocab or OOV) are tokenized and each token is replaced with a separate mask token, maskref. A tagsep token is added to mark the boundaries (start and end) of these OOV tag tokens. The model is trained on joint loss () of meta tag prediction head loss () and refined tag generation head loss () given by .
Inference: Our goal is to encourage the model to generate a combination of meta and refined tags. Based on our tag-stability analysis (Section 3.5), tag pair ordering analysis (Section 3.4) and soft tag-hierarchy findings (Section 3.3), we train the MRPG model to predict the first two tags from the MetaTag vocabulary and to generate the remaining three tags based on the user texts. We append two mask tokens and a parameterized number of maskref tokens with tokenized and .
+ +maskmaskmaskref…maskref
Tag Generation: For each maskref tokens, MRPG generates one token from the tokenizer vocabulary following a greedy approach by selecting the most probable token. We concatenate the generated tokens between two tagsep tokens and form a tag. We choose the most probable three generated refined tags based on our earlier data analysis and stack exchange tag limitations. However, for implementing this model to any other CQA platform, this number can be incremented or decremented based on the above-mentioned parameter. Also, there is no restriction in the model that will limit it to generating tags with more than 3 words. But they are rare for most of the domains, as can be seen from Table 2. More details are in Appendix Section H.
5. Experiments
5.1. Settings
We split our dataset into train-dev-test in the ratio 70:10:20 based on a random seed value. In our experiments we build our model on top of the base version (125M parameters) of pre-trained roberta language model. We remove html tags (since these tags are irrelevant to StackExchange tags) from the user contents (question title and body) before tag prediction. We ran all experiments on 4 NVIDIA RTX A6000 GPUs (48GB GPU memory) with a batch size of 60 and an input length of 256. We use AdamW (Loshchilov and Hutter, 2017) optimizer, linear warmup scheduler, and a learning rate of 5e-5.
5.2. Metrics:
We define Hit@k (where ) as the percentage of posts where at least one predicted tags match with the actual tags for predictions. We generate at most 5 tag predictions in line with StackExchange’s upper limit of tags. This metric aligns with our motivation of maximizing the probability that a user will be able to find at least one tag among the recommended fixed number of tags. Hence we do not consider other metrics like precision and recall.
5.3. Performance Analysis
5.3.1. Baseline vs MP vs MRPG
In Table 7, we compare our models with the baselines (mean of five different runs). The feature-based models, bag-of-word, and tf-idf models are able to achieve good performance for those domains where we found a high overlap between user texts and tags. We find that our MP model shows improvements over the majority baseline and the feature-based models by a substantial margin (p-values 0.05 on Wilcoxon test) in Hit@5 performance. The MRPG model outperforms other methods in almost all the domains (significant improvements in 12 out of 17 domains). This is because it was able to generate tags outside the MetaTag vocabulary. In the biology domain, the MP model performs better than MRPG. This might be because of the high tag reuse in this domain. All the model performance numbers (Hit@k for ) are present in Appendix Table 20. In this table, we observe that for Hit@1 MRPG model is always better than MP model.
Domain | Majority | TF-IDF | Bag-of-Words | MP | MRPG |
---|---|---|---|---|---|
askubuntu | 24.84 | 59.760.06 | 71.250.56 | 80.440.11 | 82.940.15 |
aviation | 35.05 | 55.120.29 | 65.580.64 | 77.090.44 | 77.630.56 |
biology | 37.94 | 54.910.16 | 64.790.50 | 78.960.34 | 77.550.41 |
chemistry | 48.89 | 58.760.17 | 68.090.46 | 77.660.10 | 79.170.45 |
cooking | 29.04 | 70.280.19 | 71.690.34 | 80.860.42 | 85.180.29 |
electronics | 20.68 | 57.800.11 | 70.120.13 | 77.510.26 | 81.300.53 |
history | 34.67 | 58.930.32 | 59.290.36 | 80.450.09 | 81.231.00 |
money | 55.96 | 75.540.19 | 79.700.30 | 84.150.23 | 87.940.42 |
movies | 54.99 | 60.800.14 | 64.570.24 | 82.910.55 | 83.250.99 |
music | 47.91 | 68.150.15 | 74.260.42 | 82.660.26 | 83.710.51 |
philosophy | 48.93 | 62.710.10 | 64.060.34 | 79.450.20 | 79.490.56 |
physics | 39.98 | 66.810.16 | 79.590.17 | 81.120.22 | 86.340.37 |
politics | 64.16 | 81.500.21 | 83.370.73 | 86.290.25 | 90.980.46 |
rpg | 76.66 | 75.790.23 | 82.710.24 | 83.310.33 | 89.090.16 |
scifi | 62.24 | 80.480.10 | 85.880.21 | 85.910.11 | 91.530.32 |
serverfault | 29.84 | 62.830.06 | 73.070.20 | 81.660.16 | 85.820.26 |
travel | 48.31 | 76.820.48 | 83.730.27 | 83.960.12 | 89.500.30 |
5.3.2. Effects of Vocabulary Size Reduction
We build the MetaTag vocab with 85% post-coverage by tags (5%) and show the impact in Figure 7. We observe that the performance gap between MP and MRPG at 90% (Table 7) reduces as vocab size decreases by 5% (Figure 7) across all domains.

This is because the MP model suffers the most (2-5%) for this reduction. This is expected since MP’s performance (by P-head) is based on how big the MetaTag vocabulary is. MRPG model, however, is robust to this vocabulary reduction, i.e., the performance (Hit@5) only changes in the range 0-1.13% with the exception of askubuntu domain (2.26%). Details are in Appendix Table 18. Also with reduced vocab, the maximum performance difference is 9.12% (travel) since it has more refined tags (Section 3.5). The minimum difference is 1.06% (biology). Here the MRPG model could not take much advantage over MP because of high tag reusablity and fewer refined tags.

5.3.3. Head Contribution of MRPG
Figure 8 shows the contribution of P-Head and G-Head in the prediction performance (Hit@5 for 90% coverage vocab). We extract for how many posts (%) (1) only the P-Head correctly predicted at least one tag and (2) only the G-Head correctly predicted at least one tag. P-Head’s contributions were highest (45-74%) since the MetaTag Vocabulary is created using popular tags in each domain. The G-Head was able to predict at least one tag correctly for an extra 4-13% of the posts. The effect of decreasing and increasing the MetaTag vocabulary size by 5% change in tag-post coverage is shown in Appendix Table 19. We observe that the G-Head’s contribution increases up to 4% (on vocab size decrease) and decreases up to 5% (on vocab size increase). We also find that both the heads combined were able to suggest some non-overlapping tags in up to 33% of the posts.
5.3.4. Out-of-Vocabulary Tags Generation %
Table 8 shows MRPG’s performance in the prediction of tags outside MetaTag Vocabulary for 90% Tag-Post Coverage. % Posts shows the percentage of posts where MRPG correctly predicted at least one OOV tag. It has the least contribution in two domains movies (13.88%) and scifi (17.01%). % ALL Tags and % OOV Tags shows that MRPG was able to correctly predict a considerable amount of OOV tags because of the generative head.
Domains |
askubuntu |
aviation |
biology |
chemistry |
cooking |
electronics |
history |
money |
movies |
music |
philosophy |
physics |
politics |
rpg |
scifi |
serverfault |
travel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
% Posts | 31.38 | 23.10 | 22.09 | 24.68 | 29.21 | 27.32 | 22.37 | 35.8 | 13.88 | 25.19 | 19.10 | 41.93 | 34.35 | 36.49 | 17.01 | 34.69 | 43.59 |
% ALL Tags | 12.74 | 9.49 | 8.98 | 11.31 | 13.65 | 10.73 | 8.17 | 12.74 | 6.85 | 10.60 | 8.55 | 15.32 | 13.18 | 13.92 | 8.10 | 13.62 | 15.66 |
% OOV Tags | 41.92 | 28.03 | 27.34 | 37.84 | 47.81 | 31.39 | 22.78 | 33.21 | 22.76 | 36.87 | 28.55 | 36.55 | 35.37 | 43.04 | 38.96 | 36.38 | 38.80 |

5.4. Case Studies
We compare tag predictions of our methods in Figure 9. MRPG was able to generate two extra refined tags than MP in askubuntu domain and was able to predict four out of five tags in physics domain. Included below are examples for five other domains.
Domain: Physics Title: Does matter become energy at the speed of light? Gold: special-relativity, speed-of-light, mass-energy, matter MP: special-relativity, energy, speed-of-light, mass MRPG: special-relativity, speed-of-light, mass-energy, matter
Domain: Travel Title: Nigerian citizen (university student) was refused a UK visit visa due to lack of funds and connection to school - how to resolve? Gold: UK, visa-refusals, nigerian-citizens MP: visas, customs-and-immigration, visa-refusals, paperwork, standard-visitor-visas MRPG: uk, visa-refusals, nigerian-citizens
Domain: Music Title: Piano tuning just under the absolute pitch Gold: piano, tuning MP: piano, tuning, maintenance MRPG: piano, tuning, alternative-tunings, pitch, relative-pitch
Domain: Biology Title: Why aren’t all infections immune-system resistant? Gold: evolution, microbiology, immunology, bacteriology MP: evolution, microbiology, bacteriology, bacteriology, immune-system MRPG: evolution, bacteriology, immunity, antibiotic-resistance
Domain: History Title: Where to find a list of participants in The Crusades? Gold: middle-ages, crusades MP: middle-ages, middle-ages, europe, historiography MRPG: middle-ages, sources, crusades
5.5. Adaptability of the MP & MRPG Architectures
Both the MP and MRPG models can be adapted for use in other domains or in different public and private CQA platforms with specific tag-space restrictions. This can help in efficient question routing to area-experts for faster response time, especially in private CQA platforms where the motivation of the community authority is to get queries resolved faster. Such adaptations can be done by customizing the MetaTag vocabulary based on prior behavioral analysis. Additionally, the number of meta and refined tags can be controlled based on the domain and platform requirements without changes in architecture (through a parameter). Also, the MRPG model can be used in platforms where a soft-hierarchy of tags is known, and routing requires the prediction of top-level tags and leaf tags. In such a scenario, the MetaTag vocabulary could be populated with only top-level tags, allowing the model to generate lower-level tags (from the tail of the tag distribution) based on user texts. With the combination of both types of tags, a query can be routed to a specific sub-area expert without overwhelming all the experts to a specific topic.
6. Related Work
Community QA platform analysis: There have been several studies on Folksonomy (Vander Wal, 2007), the practice of associating custom tags to questions in a social environment. Some of the prior works are: a large-scale analysis of tags and their correlation with other tags (Fu et al., 2020), tag-distribution and tag-occurrence of 168 SE communities (Fu et al., 2020), quality analysis of SO (Singh et al., 2015). User behavior analysis was done on Quora (Wang et al., 2013), Yahoo Answers (Adamic et al., 2008), Google Answers (Chen et al., 2010) and StackOverflow (Anderson et al., 2013). However, here we perform a large-scale study of tags, tag occurrences, and tag relation for 17 domains to understand how they have some common properties in spite of being quite diverse, an observation similar to a prior work (Fu et al., 2020).
Community QA NLP Tasks: As the use of community QA platforms increased and with it the volume of community-created data, various NLP approaches were used to address some of the issues of each platform and also to understand behaviors of users. There have been various insights gathered through analysis of such communities. Similar Question Identification (Zhang et al., 2017, 2018; Vanam and Pulipati, 2021; Kumar and Chauhan, 2022), Similar Tag Identification (Beyer and Pinzger, 2015; Chen et al., 2019), Tag popularity prediction (Fu et al., 2017), Popular Question Prediction (Zhao et al., 2021), Tag predictions (Lipczak, 2008; Lipczak and Milios, 2010; Wang et al., 2015; Wu et al., 2016; Sonam et al., 2019; Tang et al., 2019; Wankerl et al., 2020; Venktesh et al., 2021), detecting anomalous tag combinations (Banerjee et al., 2019), CQA entity linking (Li et al., 2022), expert recommendation (Tondulkar et al., 2018; Lv et al., 2021; Menaha et al., 2021; Anandhan et al., 2022; Krishna and Antulov-Fantulin, 2022; Askari et al., 2022; Liu et al., 2022), question routing (Krishna et al., 2022), identifying unclear questions (Trienes and Balog, 2019), automatic identification of best answers (Burel et al., 2012) and tag-hierarchy predictions (Chen et al., 2019) are some of the interesting tasks. We, perform a large-scale analysis with data over 10 years and across 17 diverse communities. We focus only on the tag-prediction NLP task for CQA platform.
Text Tagging: There are some feature-based machine learning approaches (Wang et al., 2015; Charte et al., 2015; Sonam et al., 2019; Zangerle et al., 2011; Sigurbjörnsson and Van Zwol, 2008; Zangerle et al., 2011; Lipczak and Milios, 2010; Wu et al., 2016) and some deep learning approaches (Tang et al., 2019; Li et al., 2020; Wankerl et al., 2020) for tag prediction. Tagcombine (Wang et al., 2015) uses software object similarity while TagStack (Sonam et al., 2019) uses tf-idf features with Naive Bayes classifier on StackOverflow texts. QUINTA (Charte et al., 2015) works on 6 StackExchange domains using KNN, (Zangerle et al., 2011) on microblogging sites (Twitter) based on tweet-similarity, Tag2word (Wu et al., 2016) in math and StackOverflow domains using an LDA variant, (Lipczak, 2008; Lipczak and Milios, 2010) on BibSonomy and StackOverflow datasets based on tag co-occurrence and user preference. Among the deep learning methods, F2Tag (Wankerl et al., 2020) is on math domains based on visual and textual formula representation, ITAG (Tang et al., 2019) is on the math domain using RNN and TagDC (Li et al., 2020) is based on software object similarity using an LSTM. We here, predict a soft hierarchy of tags (predicting both meta and fine-grained tags) unlike the above-mentioned methods.
7. Conclusion
We perform an in-depth analysis of 17 domains in a popular CQA platform, StackExchange, focusing on various aspects of question tagging such as domain diversity analysis, tag-space analysis, tag co-occurrence analysis, tag order, and tag positional stability. We present multiple insights into user behavior in assigning tags to the questions they post. Based on these findings we develop a tag prediction architecture that generates rarer and finer-grained tags in addition to popular tags from a pre-selected vocabulary. Our approach significantly out-perform feature-based baselines and also shows significant improvement in 12 domains when compared with vocabulary-based approach.
8. Limitations
The analysis and its findings presented here are limited to 17 selected StackExchange domains considering their diversity. However, they may vary for the remaining 150 domains. Some of the findings (e.g. tag’s positional stability) may vary for other CQA platforms which do not have any bounds on the number of tags. We use roberta-base and a smaller input size (256 tokens) for our experiments. With larger models and more context, the performance is expected to increase since more context usually leads to better learning by larger parameterized models. We have ignored the answers in StackExchange for model training. We believe that indiscriminately selecting all answers as context for a question could be too noisy and if we were to select one or more appropriate answers, this would add complexity in choosing between the fastest answer, best answer, accepted answers, etc. We consider this as a separate area of research and future work. We randomly sampled the data for each domain to create the train and test split to show that our MRPG model is capable of both predicting and generating tags. Splitting with respect to timestamp would require tag temporal analysis and tag-evolution which we consider as a future area of research.
9. Ethical Statement
This work analyzes various aspects of aggregate tagging behavior of users on a popular community question-answering platform StackExchange. The data is publicly provided by StackExchange as an anonymized dump of all user-contributed content on the Stack Exchange network. The data is cc-by-sa 4.0 licensed, and intended to be shared and remixed. No specific user has been identified and no user-level information (user name etc.) has been used for this work. We only used the Post.xml extracted from the StackExchange dumps and do not use any user profile statistics. The aggregate user behavior has been analyzed with respect to tagging and user-generated questions. Based on these findings a tag predictor model has been developed. The data has not been modified or redistributed as part of this research.
References
- (1)
- Adamic et al. (2008) Lada A Adamic, Jun Zhang, Eytan Bakshy, and Mark S Ackerman. 2008. Knowledge sharing and yahoo answers: everyone knows something. In Proceedings of the 17th international conference on World Wide Web. 665–674.
- Anandhan et al. (2022) Anitha Anandhan, Maizatul Akmar Ismail, and Liyana Shuib. 2022. EXPERT RECOMMENDATION THROUGH TAG RELATIONSHIP IN COMMUNITY QUESTION ANSWERING. Malaysian Journal of Computer Science 35, 3 (2022), 201–221.
- Anderson et al. (2013) Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. In Proceedings of the 22nd international conference on World Wide Web. 95–106.
- Askari et al. (2022) Arian Askari, Suzan Verberne, and Gabriella Pasi. 2022. Expert Finding in Legal Community Question Answering. In European Conference on Information Retrieval. Springer, 22–30.
- Banerjee et al. (2019) Rohan Banerjee, Sailaja Rajanala, and Manish Singh. 2019. Evaluating the Choice of Tags in CQA Sites. In International Conference on Database Systems for Advanced Applications. Springer, 625–640.
- Beyer and Pinzger (2015) Stefanie Beyer and Martin Pinzger. 2015. Synonym suggestion for tags on stack overflow. In 2015 IEEE 23rd International Conference on Program Comprehension. IEEE, 94–103.
- Burel et al. (2012) Grégoire Burel, Yulan He, and Harith Alani. 2012. Automatic identification of best answers in online enquiry communities. In Extended Semantic Web Conference. Springer, 514–529.
- Charte et al. (2015) Francisco Charte, Antonio J Rivera, María J del Jesus, and Francisco Herrera. 2015. QUINTA: A question tagging assistant to improve the answering ratio in electronic forums. In Ieee eurocon 2015-international conference on computer as a tool (eurocon). IEEE, 1–6.
- Chen et al. (2019) Hui Chen, John Coogle, and Kostadin Damevski. 2019. Modeling stack overflow tags and topics as a hierarchy of concepts. Journal of Systems and Software 156 (2019), 283–299.
- Chen et al. (2010) Yan Chen, Teck-Hua Ho, and Yong-mi Kim. 2010. Knowledge market design: A field experiment at Google Answers. Journal of Public Economic Theory 12, 4 (2010), 641–664.
- Fu et al. (2017) Chenbo Fu, Yongli Zheng, Shidi Li, Qi Xuan, and Zhongyuan Ruan. 2017. Predicting the popularity of tags in StackExchange QA communities. In 2017 International Workshop on Complex Systems and Networks (IWCSN). IEEE, 90–95.
- Fu et al. (2020) Xiang Fu, Shangdi Yu, and Austin R Benson. 2020. Modelling and analysis of tagging networks in Stack Exchange communities. Journal of Complex Networks 8, 5 (2020), cnz045.
- Hollander et al. (2013) Myles Hollander, Douglas A Wolfe, and Eric Chicken. 2013. Nonparametric statistical methods. John Wiley & Sons.
- Krishna and Antulov-Fantulin (2022) Vaibhav Krishna and Nino Antulov-Fantulin. 2022. Simplifying Sparse Expert Recommendation by Revisiting Graph Diffusion. arXiv preprint arXiv:2208.02438 (2022).
- Krishna et al. (2022) Vaibhav Krishna, Vaiva Vasiliauskaite, and Nino Antulov-Fantulin. 2022. Topic Community Based Temporal Expertise for Question Routing. arXiv preprint arXiv:2207.01753 (2022).
- Kumar and Chauhan (2022) Shobhan Kumar and Arun Chauhan. 2022. A Transformer Based Encodings for Detection of Semantically Equivalent Questions in cQA. Comput. J. (2022).
- Li et al. (2020) Can Li, Ling Xu, Meng Yan, and Yan Lei. 2020. TagDC: A tag recommendation method for software information sites with a combination of deep learning and collaborative filtering. Journal of Systems and Software 170 (2020), 110783.
- Li et al. (2022) Yuhan Li, Wei Shen, Jianbo Gao, and Yadong Wang. 2022. Community Question Answering Entity Linking via Leveraging Auxiliary Data. arXiv preprint arXiv:2205.11917 (2022).
- Lipczak (2008) Marek Lipczak. 2008. Tag recommendation for folksonomies oriented towards individual users. ECML PKDD discovery challenge 84 (2008), 2008.
- Lipczak and Milios (2010) Marek Lipczak and Evangelos Milios. 2010. Learning in efficient tag recommendation. In Proceedings of the fourth ACM conference on Recommender systems. 167–174.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Liu et al. (2022) Yue Liu, Weize Tang, Zitu Liu, Lin Ding, and Aihua Tang. 2022. High-quality domain expert finding method in CQA based on multi-granularity semantic analysis and interest drift. Information Sciences 596 (2022), 395–413.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Lv et al. (2021) Xiaoqi Lv, Ke Ji, Zhenxiang Chen, Kun Ma, Jun Wu, Yidong Li, and Guandong Xu. 2021. Expert Recommendations with Temporal Dynamics of User Interest in CQA. In International Conference on Web Information Systems Engineering. Springer, 645–652.
- Menaha et al. (2021) R Menaha, VE Jayanthi, N Krishnaraj, et al. 2021. A Cluster-based Approach for Finding Domain wise Experts in Community Question Answering System. In Journal of Physics: Conference Series, Vol. 1767. IOP Publishing, 012035.
- Parnell et al. (2011) Laurence D Parnell, Pierre Lindenbaum, Khader Shameer, Giovanni Marco Dall’Olio, Daniel C Swan, Lars Juhl Jensen, Simon J Cockell, Brent S Pedersen, Mary E Mangan, Christopher A Miller, et al. 2011. BioStar: an online question & answer resource for the bioinformatics community. PLoS computational biology 7, 10 (2011), e1002216.
- Sigurbjörnsson and Van Zwol (2008) Börkur Sigurbjörnsson and Roelof Van Zwol. 2008. Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th international conference on World Wide Web. 327–336.
- Singh et al. (2015) Sanjay Singh et al. 2015. Is Stack Overflow Overflowing With Questions and Tags. arXiv preprint arXiv:1508.03601 (2015).
- Sonam et al. (2019) Sonam Sonam, Ayushi Verma, Sangeeta Lal, and Neetu Sardana. 2019. TagStack: Automated system for predicting tags in stackoverflow. In 2019 International Conference on Signal Processing and Communication (ICSC). IEEE, 223–228.
- Tang et al. (2019) Shijie Tang, Yuan Yao, Suwei Zhang, Feng Xu, Tianxiao Gu, Hanghang Tong, Xiaohui Yan, and Jian Lu. 2019. An integral tag recommendation model for textual content. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5109–5116.
- Tondulkar et al. (2018) Rohan Tondulkar, Manisha Dubey, and Maunendra Sankar Desarkar. 2018. Get me the best: predicting best answerers in community question answering sites. In Proceedings of the 12th ACM Conference on Recommender Systems. 251–259.
- Trienes and Balog (2019) Jan Trienes and Krisztian Balog. 2019. Identifying unclear questions in community question answering websites. In European conference on information retrieval. Springer, 276–289.
- Vanam and Pulipati (2021) Divya Vanam and Venkateswara Rao Pulipati. 2021. Identifying Duplicate Questions in Community Question Answering Forums Using Machine Learning Approaches. In Machine Learning Technologies and Applications. Springer, 131–140.
- Vander Wal (2007) Thomas Vander Wal. 2007. Folksonomy.
- Venktesh et al. (2021) V Venktesh, Mukesh Mohania, and Vikram Goyal. 2021. TagRec: Automated Tagging of Questions with Hierarchical Learning Taxonomy. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 381–396.
- Wang et al. (2013) Gang Wang, Konark Gill, Manish Mohanlal, Haitao Zheng, and Ben Y Zhao. 2013. Wisdom in the social crowd: an analysis of quora. In Proceedings of the 22nd international conference on World Wide Web. 1341–1352.
- Wang et al. (2015) Xin-Yu Wang, Xin Xia, and David Lo. 2015. Tagcombine: Recommending tags to contents in software information sites. Journal of Computer Science and Technology 30, 5 (2015), 1017–1035.
- Wankerl et al. (2020) Sebastian Wankerl, Gerhard Götz, and Andreas Hotho. 2020. f2tag—Can Tags be Predicted Using Formulas?. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 565–571.
- Wu et al. (2016) Yong Wu, Yuan Yao, Feng Xu, Hanghang Tong, and Jian Lu. 2016. Tag2word: Using tags to generate words for content based tag recommendation. In Proceedings of the 25th ACM international on conference on information and knowledge management. 2287–2292.
- Zangerle et al. (2011) Eva Zangerle, Wolfgang Gassler, and Günther Specht. 2011. Using tag recommendations to homogenize folksonomies in microblogging environments. In International conference on social informatics. Springer, 113–126.
- Zhang et al. (2017) Wei Emma Zhang, Quan Z Sheng, Jey Han Lau, and Ermyas Abebe. 2017. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the 26th International Conference on World Wide Web. 1221–1229.
- Zhang et al. (2018) Wei Emma Zhang, Quan Z Sheng, Jey Han Lau, Ermyas Abebe, and Wenjie Ruan. 2018. Duplicate detection in programming question answering communities. ACM Transactions on Internet Technology (TOIT) 18, 3 (2018), 1–21.
- Zhao et al. (2021) Li Xian Zhao, Li Zhang, and Jing Jiang. 2021. Hot question prediction in Stack Overflow. IET Software 15, 1 (2021), 90–106.
Appendix A Domain Statistics
Table 10 shows more details about domain diversity apart from those mentioned in the main section 3.1. We can see cooking and rpg are the domains with the least number of questions with no answers (5%) which indicates the experts in these domains are very active. The science domains have more than 15% questions with no answers which shows that special knowledge is required to answer such questions. and show the maximum limit of users who viewed the questions and the maximum number of answers that a question has. no accept ans shows the percentage of posts that have not been accepted by the askers as answers. This gives an indication of whether askers are active and also whether the answers are satisfactory.
Domains | Longest Tag (number of characters) | Shortest Tag | AvgTLen | ||
---|---|---|---|---|---|
Tag | Size | Tag | Size | ||
askubuntu | windows-subsystem-for-linux | 27 | c | 1 | 8.17 |
aviation | performance-based-navigation | 28 | cg | 2 | 10.53 |
biology | neurodegenerative-disorders | 27 | ph | 2 | 10.97 |
chemistry | differential-scanning-calorimetry | 33 | ph | 2 | 12.75 |
cooking | please-remove-this-tag | 22 | ue | 2 | 8.56 |
electronics | semiconductor-process-technology | 32 | c | 1 | 8.80 |
history | articles-of-confederation | 25 | art | 3 | 9.83 |
money | health-reimbursement-arrangement | 32 | w9 | 2 | 10.87 |
movies | valerian-city-of-a-thousand-planets | 35 | m | 1 | 13.66 |
music | solid-body-electric-guitars | 27 | dj | 2 | 9.62 |
philosophy | philosophy-of-political-science | 31 | art | 3 | 11.17 |
physics | heisenberg-uncertainty-principle | 32 | air | 3 | 13.39 |
politics | immigration-customs-enforcement | 31 | alp | 3 | 11.12 |
rpg | werewolf-the-apocalypse-2nd-edition | 35 | e6 | 2 | 12.29 |
scifi | the-hitchhikers-guide-to-the-galaxy | 35 | dc | 2 | 13.31 |
serverfault | google-cloud-internal-load-balancer | 35 | 3g | 2 | 8.87 |
travel | new-zealand-permanent-resident | 30 | eu | 2 | 9.39 |
Appendix B Tag Length Analysis
Table 9 shows the maximum and minimum length tags in each domain. We also see that the average tag length of the movies and physics domain are the highest. We find that often the movie names or physics topics are longer than three words leading to an increase in average tag length.
Domain | Q | T | Q/T | AVGT | NOANS (%) | NOSCORES (%) | NO ACCEPT ANS (%) | MAXANS | MAXVIEW | VIEWGT100 | #ASKERS |
---|---|---|---|---|---|---|---|---|---|---|---|
askubuntu | 371800 | 3121 | 119.13 | 2.78 | 23.47 | 37.21 | 66.99 | 82 | 5409384 | 1093 | 201912 |
aviation | 20345 | 1002 | 20.3 | 2.56 | 7.02 | 9.82 | 46.66 | 18 | 219002 | 12 | 7066 |
biology | 25671 | 739 | 34.74 | 2.58 | 20.73 | 15.16 | 56.13 | 11 | 445257 | 11 | 12089 |
chemistry | 37476 | 375 | 99.94 | 2.37 | 19.36 | 16.86 | 59.12 | 11 | 1077991 | 7 | 17202 |
cooking | 24513 | 833 | 29.43 | 2.3 | 4.6 | 11.79 | 50.57 | 85 | 1619295 | 13 | 12413 |
electronics | 152980 | 2226 | 68.72 | 2.77 | 9.13 | 40.87 | 50.94 | 38 | 591616 | 36 | 61869 |
history | 12562 | 813 | 15.45 | 2.84 | 9.85 | 4.6 | 49.94 | 34 | 994376 | 19 | 5296 |
money | 32648 | 995 | 32.81 | 3.11 | 7.91 | 17.73 | 54.56 | 25 | 821144 | 37 | 18010 |
movies | 20749 | 4348 | 4.77 | 2.09 | 9.48 | 2.96 | 39.08 | 19 | 1183407 | 30 | 6931 |
music | 20925 | 512 | 40.87 | 2.52 | 3.24 | 10.96 | 49.62 | 25 | 611990 | 5 | 10447 |
philosophy | 15624 | 559 | 27.95 | 2.4 | 11 | 15.27 | 63.73 | 31 | 250018 | 6 | 6640 |
physics | 180166 | 893 | 201.75 | 3.17 | 17.49 | 29.54 | 57.08 | 49 | 847876 | 131 | 59774 |
politics | 12416 | 739 | 16.8 | 2.9 | 6.81 | 5.55 | 48.72 | 27 | 833812 | 27 | 3970 |
rpg | 42693 | 1195 | 35.73 | 2.91 | 4.41 | 2.26 | 32.39 | 44 | 865197 | 56 | 11541 |
scifi | 62987 | 3433 | 18.35 | 2.25 | 10.62 | 2.45 | 42.87 | 34 | 1430390 | 153 | 22717 |
serverfault | 299895 | 3814 | 78.63 | 2.9 | 11.68 | 37.05 | 51.93 | 160 | 2478923 | 327 | 130214 |
travel | 42201 | 1891 | 22.32 | 3.28 | 11.2 | 8.19 | 59.48 | 30 | 430504 | 42 | 24895 |

Appendix C Tag Co-Occurrence Distribution Analysis

We analyzed the distribution of the top-50 frequently occurring tag pairs in each domain (Figure 10, 11). We observe three main patterns: (1) Smooth Distribution (2) Spike in Top-1 (3) Spikes in top few pairs. Larger domains like askubuntu, serverfault, electronics, and physics, have smooth distributions. Some of the smaller domains like politics, philosophy, and music also show this behavior, which we believe is because, in these domains, the questions have fine-grained topics. In domains like rpg, money, history, aviation, biology, chemistry, the tags of the most frequent tag pair that appears in abundance are generic in nature. Finally, in domains like movies, scifi, cooking and travel, few tag pairs dominate the distributions, indicating their popularity in such smaller domains.
Appendix D Tag Co-Occurrence Examples
Table 11 shows the most frequent tag pairs that appear in each domain.
Domain | Top-5 Most Frequent Tag Pairs |
---|---|
askubuntu | (boot, grub2), (boot, dual-boot), (dual-boot, grub2), (bash, command-line), (apt, package-management) |
aviation | (aerodynamics, aircraft-design), (aircraft-design, wing), (aerodynamics, wing), (aircraft-design, aircraft-performance), |
(air-traffic-control, faa-regulations) | |
biology | (entomology, species-identification), (species-identification, zoology), (botany, species-identification), |
(neurophysiology, neuroscience), (biochemistry, molecular-biology) | |
chemistry | (organic-chemistry, reaction-mechanism), (physical-chemistry, thermodynamics), (aromatic-compounds, organic-chemistry), |
(nomenclature, organic-chemistry), (carbonyl-compounds, organic-chemistry) | |
cooking | (baking, bread), (baking, cake), (baking, cookies), (baking, substitutions), (bread, dough) |
electronics | (current, voltage), (pcb, pcb-design), (power, power-supply), (batteries, battery-charging), (microcontroller, pic) |
history | (nazi-germany, world-war-two), (united-states, world-war-two), (europe, middle-ages), |
(japan, world-war-two), (military, world-war-two) | |
money | (taxes, united-states), (income-tax, united-states), (401k, united-states), (income-tax, taxes), (tax-deduction, united-states) |
movies | (character, plot-explanation), (marvel-cinematic-universe, plot-explanation), |
(game-of-thrones, plot-explanation), (analysis, plot-explanation), (avengers-infinity-war, marvel-cinematic-universe) | |
music | (chords, theory), (chord-theory, chords), (harmony, theory), (scales, theory), (chord-theory, theory) |
philosophy | (logic, philosophy-of-mathematics), (epistemology, philosophy-of-science), (fallacies, logic), |
(logic, symbolic-logic), (metaphysics, ontology) | |
physics | (homework-and-exercises, newtonian-mechanics), (forces, newtonian-mechanics), (hilbert-space, quantum-mechanics), |
(operators, quantum-mechanics), (quantum-mechanics, wavefunction) | |
politics | (donald-trump, united-states), (president, united-states), (presidential-election, united-states), |
(congress, united-states), (election, united-states) | |
rpg | (dnd-5e, spells), (dnd-5e, magic-items), (class-feature, dnd-5e), (dnd-5e, monsters), (pathfinder-1e, spells) |
scifi | (short-stories, story-identification), (marvel, marvel-cinematic-universe), (books, story-identification), |
(the-lord-of-the-rings, tolkiens-legendarium), (novel, story-identification) | |
serverfault | (linux, ubuntu), (centos, linux), (amazon-ec2, amazon-web-services), (linux, networking), (apache-2.2, php) |
travel | (uk, visas), (schengen, visas), (usa, visas), (customs-and-immigration, usa), (indian-citizens, visas) |
Appendix E Tag Distributions
Figure 12 shows the distribution of top-100 most frequent tags in each domain.

Appendix F Tag Ordering Example:
Tables 12, 13, and 14 show top-10 most frequently occurring tag pairs in each domain. On analyzing manually, we found that in most of the cases meta-tag appears before the refined tags.
Domain | Total | Order-1 | % | Order-2 | % |
---|---|---|---|---|---|
askubuntu | 5845 | (boot,grub2) | 99.93 | (grub2,boot) | 0.07 |
askubuntu | 5174 | (boot,dual-boot) | 99.96 | (dual-boot,boot) | 0.04 |
askubuntu | 5104 | (dual-boot,grub2) | 91.12 | (grub2,dual-boot) | 8.88 |
askubuntu | 4552 | (bash,command-line) | 1.89 | (command-line,bash) | 98.11 |
askubuntu | 4547 | (apt,package-management) | 98.53 | (package-management,apt) | 1.47 |
askubuntu | 4304 | (networking,wireless) | 70.07 | (wireless,networking) | 29.93 |
askubuntu | 4178 | (dual-boot,partitioning) | 97.75 | (partitioning,dual-boot) | 2.25 |
askubuntu | 4128 | (drivers,nvidia) | 99.93 | (nvidia,drivers) | 0.07 |
askubuntu | 3257 | (networking,server) | 97.97 | (server,networking) | 2.03 |
askubuntu | 3003 | (bash,scripts) | 99.9 | (scripts,bash) | 0.1 |
aviation | 417 | (aerodynamics,aircraft-design) | 1.68 | (aircraft-design,aerodynamics) | 98.32 |
aviation | 221 | (aircraft-design,wing) | 100 | (wing,aircraft-design) | 0 |
aviation | 221 | (aerodynamics,wing) | 100 | (wing,aerodynamics) | 0 |
aviation | 183 | (aircraft-design,aircraft-performance) | 100 | (aircraft-performance,aircraft-design) | 0 |
aviation | 138 | (air-traffic-control,faa-regulations) | 0 | (faa-regulations,air-traffic-control) | 100 |
aviation | 136 | (faa-regulations,instrument-flight-rules) | 100 | (instrument-flight-rules,faa-regulations) | 0 |
aviation | 127 | (aerodynamics,lift) | 100 | (lift,aerodynamics) | 0 |
aviation | 125 | (aerodynamics,airfoil) | 100 | (airfoil,aerodynamics) | 0 |
aviation | 124 | (air-traffic-control,radio-communications) | 100 | (radio-communications,air-traffic-control) | 0 |
aviation | 124 | (aerodynamics,aircraft-performance) | 100 | (aircraft-performance,aerodynamics) | 0 |
biology | 731 | (entomology,species-identification) | 10.81 | (species-identification,entomology) | 89.19 |
biology | 361 | (species-identification,zoology) | 76.18 | (zoology,species-identification) | 23.82 |
biology | 350 | (botany,species-identification) | 44.29 | (species-identification,botany) | 55.71 |
biology | 322 | (neurophysiology,neuroscience) | 0 | (neuroscience,neurophysiology) | 100 |
biology | 321 | (biochemistry,molecular-biology) | 99.69 | (molecular-biology,biochemistry) | 0.31 |
biology | 274 | (dna,genetics) | 4.38 | (genetics,dna) | 95.62 |
biology | 272 | (evolution,genetics) | 37.5 | (genetics,evolution) | 62.5 |
biology | 256 | (botany,plant-physiology) | 98.05 | (plant-physiology,botany) | 1.95 |
biology | 251 | (entomology,zoology) | 0.8 | (zoology,entomology) | 99.2 |
biology | 247 | (cell-biology,molecular-biology) | 1.21 | (molecular-biology,cell-biology) | 98.79 |
chemistry | 1621 | (organic-chemistry,reaction-mechanism) | 100 | (reaction-mechanism,organic-chemistry) | 0 |
chemistry | 703 | (physical-chemistry,thermodynamics) | 99.43 | (thermodynamics,physical-chemistry) | 0.57 |
chemistry | 648 | (aromatic-compounds,organic-chemistry) | 0 | (organic-chemistry,aromatic-compounds) | 100 |
chemistry | 585 | (nomenclature,organic-chemistry) | 0 | (organic-chemistry,nomenclature) | 100 |
chemistry | 529 | (carbonyl-compounds,organic-chemistry) | 0 | (organic-chemistry,carbonyl-compounds) | 100 |
chemistry | 526 | (acid-base,organic-chemistry) | 0 | (organic-chemistry,acid-base) | 100 |
chemistry | 457 | (organic-chemistry,synthesis) | 100 | (synthesis,organic-chemistry) | 0 |
chemistry | 429 | (organic-chemistry,stereochemistry) | 100 | (stereochemistry,organic-chemistry) | 0 |
chemistry | 420 | (acid-base,ph) | 100 | (ph,acid-base) | 0 |
chemistry | 348 | (acid-base,inorganic-chemistry) | 0 | (inorganic-chemistry,acid-base) | 100 |
cooking | 393 | (baking,bread) | 99.75 | (bread,baking) | 0.25 |
cooking | 290 | (baking,cake) | 100 | (cake,baking) | 0 |
cooking | 180 | (baking,cookies) | 100 | (cookies,baking) | 0 |
cooking | 179 | (baking,substitutions) | 91.06 | (substitutions,baking) | 8.94 |
cooking | 137 | (bread,dough) | 91.24 | (dough,bread) | 8.76 |
cooking | 131 | (bread,sourdough) | 100 | (sourdough,bread) | 0 |
cooking | 124 | (baking,dough) | 100 | (dough,baking) | 0 |
cooking | 122 | (bread,yeast) | 96.72 | (yeast,bread) | 3.28 |
cooking | 116 | (baking,oven) | 100 | (oven,baking) | 0 |
cooking | 111 | (dough,pizza) | 91.89 | (pizza,dough) | 8.11 |
electronics | 1161 | (current,voltage) | 0.6 | (voltage,current) | 99.4 |
electronics | 1138 | (pcb,pcb-design) | 100 | (pcb-design,pcb) | 0 |
electronics | 1043 | (power,power-supply) | 0.48 | (power-supply,power) | 99.52 |
electronics | 844 | (batteries,battery-charging) | 100 | (battery-charging,batteries) | 0 |
electronics | 775 | (microcontroller,pic) | 98.58 | (pic,microcontroller) | 1.42 |
electronics | 620 | (amplifier,operational-amplifier) | 3.87 | (operational-amplifier,amplifier) | 96.13 |
electronics | 619 | (power-supply,switch-mode-power-supply) | 100 | (switch-mode-power-supply,power-supply) | 0 |
electronics | 612 | (bjt,transistors) | 0.49 | (transistors,bjt) | 99.51 |
electronics | 598 | (mosfet,transistors) | 0.17 | (transistors,mosfet) | 99.83 |
electronics | 587 | (arduino,microcontroller) | 86.03 | (microcontroller,arduino) | 13.97 |
Domain | Total | Order-1 | % | Order-2 | % |
history | 298 | (nazi-germany,world-war-two) | 0.67 | (world-war-two,nazi-germany) | 99.33 |
history | 179 | (united-states,world-war-two) | 94.41 | (world-war-two,united-states) | 5.59 |
history | 153 | (europe,middle-ages) | 0 | (middle-ages,europe) | 100 |
history | 141 | (japan,world-war-two) | 0 | (world-war-two,japan) | 100 |
history | 138 | (military,world-war-two) | 0.72 | (world-war-two,military) | 99.28 |
history | 136 | (19th-century,united-states) | 0 | (united-states,19th-century) | 100 |
history | 134 | (soviet-union,world-war-two) | 0.75 | (world-war-two,soviet-union) | 99.25 |
history | 117 | (20th-century,united-states) | 8.55 | (united-states,20th-century) | 91.45 |
history | 106 | (ancient-rome,roman-empire) | 84.91 | (roman-empire,ancient-rome) | 15.09 |
history | 105 | (ancient-history,ancient-rome) | 99.05 | (ancient-rome,ancient-history) | 0.95 |
money | 3393 | (taxes,united-states) | 0.03 | (united-states,taxes) | 99.97 |
money | 2087 | (income-tax,united-states) | 0.05 | (united-states,income-tax) | 99.95 |
money | 883 | (401k,united-states) | 0 | (united-states,401k) | 100 |
money | 839 | (income-tax,taxes) | 3.81 | (taxes,income-tax) | 96.19 |
money | 662 | (tax-deduction,united-states) | 0.15 | (united-states,tax-deduction) | 99.85 |
money | 638 | (investing,stocks) | 16.3 | (stocks,investing) | 83.7 |
money | 613 | (ira,united-states) | 0 | (united-states,ira) | 100 |
money | 604 | (investing,united-states) | 0 | (united-states,investing) | 100 |
money | 554 | (mortgage,united-states) | 0 | (united-states,mortgage) | 100 |
money | 541 | (roth-ira,united-states) | 0 | (united-states,roth-ira) | 100 |
movies | 518 | (character,plot-explanation) | 2.9 | (plot-explanation,character) | 97.1 |
movies | 509 | (marvel-cinematic-universe,plot-explanation) | 0.2 | (plot-explanation,marvel-cinematic-universe) | 99.8 |
movies | 367 | (game-of-thrones,plot-explanation) | 0.82 | (plot-explanation,game-of-thrones) | 99.18 |
movies | 242 | (analysis,plot-explanation) | 7.85 | (plot-explanation,analysis) | 92.15 |
movies | 233 | (avengers-infinity-war,marvel-cinematic-universe) | 0 | (marvel-cinematic-universe,avengers-infinity-war) | 100 |
movies | 205 | (character,marvel-cinematic-universe) | 100 | (marvel-cinematic-universe,character) | 0 |
movies | 199 | (avengers-endgame,marvel-cinematic-universe) | 0 | (marvel-cinematic-universe,avengers-endgame) | 100 |
movies | 184 | (analysis,character) | 29.89 | (character,analysis) | 70.11 |
movies | 179 | (dialogue,plot-explanation) | 5.59 | (plot-explanation,dialogue) | 94.41 |
movies | 143 | (ending,plot-explanation) | 2.8 | (plot-explanation,ending) | 97.2 |
music | 519 | (chords,theory) | 0 | (theory,chords) | 100 |
music | 490 | (chord-theory,chords) | 0 | (chords,chord-theory) | 100 |
music | 435 | (harmony,theory) | 0 | (theory,harmony) | 100 |
music | 410 | (scales,theory) | 0 | (theory,scales) | 100 |
music | 404 | (chord-theory,theory) | 0 | (theory,chord-theory) | 100 |
music | 363 | (electric-guitar,guitar) | 0 | (guitar,electric-guitar) | 100 |
music | 337 | (notation,sheet-music) | 99.41 | (sheet-music,notation) | 0.59 |
music | 329 | (chords,guitar) | 0 | (guitar,chords) | 100 |
music | 328 | (chord-progressions,theory) | 0 | (theory,chord-progressions) | 100 |
music | 306 | (notation,piano) | 0 | (piano,notation) | 100 |
philosophy | 272 | (logic,philosophy-of-mathematics) | 100 | (philosophy-of-mathematics,logic) | 0 |
philosophy | 266 | (epistemology,philosophy-of-science) | 94.36 | (philosophy-of-science,epistemology) | 5.64 |
philosophy | 246 | (fallacies,logic) | 0.41 | (logic,fallacies) | 99.59 |
philosophy | 193 | (logic,symbolic-logic) | 100 | (symbolic-logic,logic) | 0 |
philosophy | 186 | (metaphysics,ontology) | 100 | (ontology,metaphysics) | 0 |
philosophy | 186 | (logic,philosophy-of-logic) | 100 | (philosophy-of-logic,logic) | 0 |
philosophy | 183 | (argumentation,logic) | 0.55 | (logic,argumentation) | 99.45 |
philosophy | 179 | (epistemology,metaphysics) | 100 | (metaphysics,epistemology) | 0 |
philosophy | 179 | (epistemology,logic) | 1.68 | (logic,epistemology) | 98.32 |
philosophy | 178 | (logic,proof) | 100 | (proof,logic) | 0 |
physics | 4182 | (homework-and-exercises,newtonian-mechanics) | 99.74 | (newtonian-mechanics,homework-and-exercises) | 0.26 |
physics | 3658 | (forces,newtonian-mechanics) | 0.52 | (newtonian-mechanics,forces) | 99.48 |
physics | 2565 | (hilbert-space,quantum-mechanics) | 0 | (quantum-mechanics,hilbert-space) | 100 |
physics | 2360 | (operators,quantum-mechanics) | 0 | (quantum-mechanics,operators) | 100 |
physics | 2337 | (quantum-mechanics,wavefunction) | 100 | (wavefunction,quantum-mechanics) | 0 |
physics | 2238 | (electromagnetism,magnetic-fields) | 99.82 | (magnetic-fields,electromagnetism) | 0.18 |
physics | 2196 | (homework-and-exercises,quantum-mechanics) | 0 | (quantum-mechanics,homework-and-exercises) | 100 |
physics | 1988 | (newtonian-gravity,newtonian-mechanics) | 0 | (newtonian-mechanics,newtonian-gravity) | 100 |
physics | 1767 | (quantum-mechanics,schroedinger-equation) | 100 | (schroedinger-equation,quantum-mechanics) | 0 |
physics | 1704 | (black-holes,general-relativity) | 0 | (general-relativity,black-holes) | 100 |
Domain | Total | Order-1 | % | Order-2 | % |
---|---|---|---|---|---|
politics | 570 | (donald-trump,united-states) | 0 | (united-states,donald-trump) | 100 |
politics | 557 | (president,united-states) | 0 | (united-states,president) | 100 |
politics | 523 | (presidential-election,united-states) | 0 | (united-states,presidential-election) | 100 |
politics | 478 | (congress,united-states) | 0 | (united-states,congress) | 100 |
politics | 475 | (election,united-states) | 0.63 | (united-states,election) | 99.37 |
politics | 467 | (brexit,united-kingdom) | 0 | (united-kingdom,brexit) | 100 |
politics | 328 | (constitution,united-states) | 0 | (united-states,constitution) | 100 |
politics | 282 | (law,united-states) | 0.35 | (united-states,law) | 99.65 |
politics | 279 | (senate,united-states) | 0.36 | (united-states,senate) | 99.64 |
politics | 254 | (united-states,voting) | 100 | (voting,united-states) | 0 |
rpg | 5330 | (dnd-5e,spells) | 99.21 | (spells,dnd-5e) | 0.79 |
rpg | 1367 | (dnd-5e,magic-items) | 100 | (magic-items,dnd-5e) | 0 |
rpg | 1212 | (class-feature,dnd-5e) | 0 | (dnd-5e,class-feature) | 100 |
rpg | 1204 | (dnd-5e,monsters) | 99.83 | (monsters,dnd-5e) | 0.17 |
rpg | 1188 | (pathfinder-1e,spells) | 90.24 | (spells,pathfinder-1e) | 9.76 |
rpg | 959 | (dnd-3.5e,spells) | 72.78 | (spells,dnd-3.5e) | 27.22 |
rpg | 676 | (dnd-5e,feats) | 99.85 | (feats,dnd-5e) | 0.15 |
rpg | 632 | (dnd-5e,warlock) | 100 | (warlock,dnd-5e) | 0 |
rpg | 607 | (balance,dnd-5e) | 0.16 | (dnd-5e,balance) | 99.84 |
rpg | 567 | (combat,dnd-5e) | 0.53 | (dnd-5e,combat) | 99.47 |
scifi | 3514 | (short-stories,story-identification) | 1.05 | (story-identification,short-stories) | 98.95 |
scifi | 2109 | (marvel,marvel-cinematic-universe) | 76.67 | (marvel-cinematic-universe,marvel) | 23.33 |
scifi | 2029 | (books,story-identification) | 0.74 | (story-identification,books) | 99.26 |
scifi | 1922 | (the-lord-of-the-rings,tolkiens-legendarium) | 52.76 | (tolkiens-legendarium,the-lord-of-the-rings) | 47.24 |
scifi | 1859 | (novel,story-identification) | 1.02 | (story-identification,novel) | 98.98 |
scifi | 1638 | (movie,story-identification) | 1.47 | (story-identification,movie) | 98.53 |
scifi | 1497 | (star-trek,star-trek-tng) | 99.67 | (star-trek-tng,star-trek) | 0.33 |
scifi | 1077 | (aliens,story-identification) | 2.04 | (story-identification,aliens) | 97.96 |
scifi | 866 | (a-song-of-ice-and-fire,game-of-thrones) | 6.24 | (game-of-thrones,a-song-of-ice-and-fire) | 93.76 |
scifi | 723 | (star-wars,star-wars-legends) | 100 | (star-wars-legends,star-wars) | 0 |
serverfault | 3261 | (linux,ubuntu) | 98.13 | (ubuntu,linux) | 1.87 |
serverfault | 2865 | (centos,linux) | 1.33 | (linux,centos) | 98.67 |
serverfault | 2498 | (amazon-ec2,amazon-web-services) | 76.7 | (amazon-web-services,amazon-ec2) | 23.3 |
serverfault | 2452 | (linux,networking) | 99.14 | (networking,linux) | 0.86 |
serverfault | 1912 | (apache-2.2,php) | 86.72 | (php,apache-2.2) | 13.28 |
serverfault | 1803 | (debian,linux) | 1.5 | (linux,debian) | 98.5 |
serverfault | 1716 | (linux,ssh) | 98.19 | (ssh,linux) | 1.81 |
serverfault | 1643 | (apache-2.2,linux) | 2.01 | (linux,apache-2.2) | 97.99 |
serverfault | 1560 | (iptables,linux) | 1.15 | (linux,iptables) | 98.85 |
serverfault | 1466 | (apache-2.2,virtualhost) | 96.18 | (virtualhost,apache-2.2) | 3.82 |
travel | 2181 | (uk,visas) | 0.05 | (visas,uk) | 99.95 |
travel | 1779 | (schengen,visas) | 0.06 | (visas,schengen) | 99.94 |
travel | 1340 | (usa,visas) | 2.24 | (visas,usa) | 97.76 |
travel | 871 | (customs-and-immigration,usa) | 0 | (usa,customs-and-immigration) | 100 |
travel | 795 | (indian-citizens,visas) | 0 | (visas,indian-citizens) | 100 |
travel | 727 | (transit,visas) | 0 | (visas,transit) | 100 |
travel | 726 | (customs-and-immigration,visas) | 0 | (visas,customs-and-immigration) | 100 |
travel | 643 | (standard-visitor-visas,uk) | 0 | (uk,standard-visitor-visas) | 100 |
travel | 566 | (uk,visa-refusals) | 100 | (visa-refusals,uk) | 0 |
travel | 511 | (visa-refusals,visas) | 0 | (visas,visa-refusals) | 100 |
Domain | Title | Title+Body | Title+Body+Answer | |||
---|---|---|---|---|---|---|
EMS | EMM | EMS | EMM | EMS | EMM | |
askubuntu | 56.94 | 71.64 | 77.29 | 88.67 | 81.46 | 91.15 |
aviation | 29.09 | 49.63 | 47.53 | 66.11 | 58.98 | 75.76 |
biology | 17.95 | 29.68 | 33.70 | 47.17 | 42.92 | 56.97 |
chemistry | 19.72 | 29.26 | 32.81 | 46.44 | 40.99 | 56.20 |
cooking | 53.51 | 71.04 | 72.75 | 82.92 | 80.87 | 88.40 |
electronics | 55.24 | 71.11 | 75.21 | 86.69 | 80.6 | 89.99 |
history | 22.23 | 44.34 | 43.76 | 67.23 | 59.66 | 79.51 |
money | 34.47 | 57.89 | 56.32 | 78.66 | 66.62 | 86.23 |
movies | 9.49 | 34.51 | 17.69 | 72.92 | 24.06 | 77.32 |
music | 37.42 | 53.76 | 60.81 | 75.32 | 75.02 | 85.78 |
philosophy | 23.85 | 43.24 | 44.85 | 63.77 | 59.81 | 75.54 |
physics | 24.34 | 40.95 | 40.25 | 61.48 | 48.24 | 70.20 |
politics | 28.37 | 54.11 | 52.64 | 77.52 | 67.41 | 88.21 |
rpg | 23.69 | 41.14 | 44.72 | 65.13 | 57.52 | 76.20 |
scifi | 17.76 | 38.31 | 28.38 | 60.49 | 34.63 | 70.64 |
serverfault | 58.67 | 74.50 | 76.70 | 89.34 | 80.02 | 91.51 |
travel | 45.93 | 63.66 | 65.88 | 79.76 | 75.11 | 86.38 |
Appendix G Tag-Post Overlap: Full Table
Appendix H Decoding Phase of the MRPG Model
We allow the model to generate the tags based on the input parameter maximum output length and then use few heuristics to filter out appropriate tag-tokens and choose the top-k tags. Our heuristics are based on prior knowledge about how a tag token should be like (1) a tag cannot start or end with a ’-’ (2) skip the punctuation tokens (3) ignoring adjacent repeated tags. We then combine the tag tokens between two tagseptokens to form the final tag. We also calculate the top-k () most probable tags based on the combined probability scores of the tag-tokens.
Appendix I Feature-based Model Configurations:
For building both the tf-idf and bag of words features we consider unigram and bigram features with a minimum document frequency of 0.00009. We generate 200,000 maximum features. We consider log loss and search hyper-parameter space using alpha = [0.0001,0.001,0.00001] and penalty=[, ] for the Stochastic Gradient Descent One versus rest classifier. For both the models, we find that penalty with 0.00001 alpha yields the best performance.
Appendix J P-values for Hit@5
Table 16 shows the p-values when MRPG model’s Hit@5 is compared with MP model. The significance test has been done by one-sided Wilcoxon Test(Hollander et al., 2013). For k=1,2,3,4 MRPG model’s Hit@k shows significant improvements over MP model. MRPG model outperforms all other baselines significantly in Hit@k metrics for each value of k.
Domains | P-Values | Is Significant |
---|---|---|
askubuntu | 0.03125 | Yes |
aviation | 0.15625 | No |
biology | 1.00000 | No |
chemistry | 0.03125 | Yes |
cooking | 0.03125 | Yes |
electronics | 0.03125 | Yes |
history | 0.09375 | No |
money | 0.03125 | Yes |
movies | 0.40625 | No |
music | 0.03125 | Yes |
philosophy | 0.50000 | No |
physics | 0.03125 | Yes |
politics | 0.03125 | Yes |
rpg | 0.03125 | Yes |
scifi | 0.03125 | Yes |
serverfault | 0.03125 | Yes |
travel | 0.03125 | Yes |
Appendix K Detailed Tag-Post Coverage %
Table 17 shows detailed tag-post coverage.
Domain | #T | Top1 | Top3 | Top5 | Top10 | Top50 | Top100 | 100T% |
---|---|---|---|---|---|---|---|---|
askubuntu | 3121 | 5.67 | 15.87 | 24.81 | 40.21 | 71.84 | 82.68 | 3.2 |
aviation | 1002 | 11.05 | 25.81 | 33.87 | 45.93 | 79.13 | 89.43 | 9.98 |
biology | 739 | 9.22 | 23.91 | 37.84 | 55.05 | 84.39 | 91.76 | 13.53 |
chemistry | 375 | 23.05 | 42.61 | 48.62 | 61.38 | 87.69 | 95.35 | 26.67 |
cooking | 833 | 9.55 | 22.45 | 29.55 | 38.99 | 71.45 | 85.19 | 12 |
electronics | 2226 | 4.94 | 13.84 | 20.88 | 32.81 | 68.96 | 81.98 | 4.49 |
history | 813 | 10.86 | 25.08 | 35.27 | 45.91 | 80.82 | 89.95 | 12.3 |
money | 995 | 37.04 | 49.69 | 56.62 | 68.52 | 88.33 | 94.18 | 10.05 |
movies | 4348 | 36.93 | 49.59 | 56.36 | 66.84 | 81.59 | 85.88 | 2.3 |
music | 512 | 14.93 | 39.08 | 47.59 | 58.04 | 87.42 | 94.54 | 19.53 |
philosophy | 559 | 19.39 | 37.1 | 48.56 | 63.3 | 87.29 | 93.77 | 17.89 |
physics | 893 | 12.7 | 28.35 | 39.99 | 55.1 | 83.98 | 91.68 | 11.2 |
politics | 739 | 46 | 59.16 | 63.64 | 66.41 | 89.63 | 94.95 | 13.53 |
rpg | 1195 | 42.5 | 61.23 | 76.9 | 79.75 | 88.01 | 92.66 | 8.37 |
scifi | 3433 | 27.86 | 47.75 | 62.03 | 70.67 | 81.32 | 85.04 | 2.91 |
serverfault | 3814 | 11.92 | 22.16 | 29.97 | 42.76 | 72.8 | 82.86 | 2.62 |
travel | 1891 | 22.2 | 36.03 | 48.34 | 58.34 | 84.39 | 92.36 | 5.29 |
Appendix L Effect of Using Answers
We can use answers in those domains or organizations where we already have some answers posted and the tag-prediction approach is being deployed later. The motivation for using answers directly comes from our Tag-Post Overlap analysis in Table 3, where we can find a minimum overlap of tags in 70% of posts in 16/17 domains with the exception of chemistry and biology domains. In these two domains, the overlap increases by around 9-10%. In some domains, the overlap also increases to 91%.
Domain | MP | MRPG | ||
---|---|---|---|---|
90 | 85 | 90 | 85 | |
askubuntu | 80.42 | 75.73 -4.69 | 83.18 | 80.92 -2.26 |
aviation | 77.12 | 73.21 -3.91 | 77.64 | 77.68 0.04 |
biology | 79.31 | 76.35 -2.96 | 78.03 | 77.41 -0.62 |
chemistry | 77.77 | 75.62 -2.15 | 79.51 | 79.63 0.12 |
cooking | 80.42 | 76.81 -3.61 | 85.38 | 85.29 -0.09 |
electronics | 77.92 | 73.69 -4.23 | 81.62 | 80.56 -1.06 |
history | 80.57 | 77.59 -2.98 | 82.29 | 81.21 -1.08 |
money | 84.46 | 80.38 -4.08 | 88.19 | 87.9 -0.29 |
movies | 83.54 | 78.6 -4.94 | 82.77 | 82.8 0.03 |
music | 82.72 | 78.73 -3.99 | 84.37 | 84.18 -0.19 |
philosophy | 79.17 | 74.4 -4.77 | 79.58 | 79.1 -0.48 |
physics | 81.49 | 77.3 -4.19 | 86.48 | 85.78 -0.7 |
politics | 86.43 | 82.4 -4.03 | 91.38 | 90.74 -0.64 |
rpg | 83.71 | 79.41 -4.3 | 89.23 | 88.1 -1.13 |
scifi | 85.81 | 82.22 -3.59 | 91.55 | 90.72 -0.83 |
serverfault | 81.87 | 77.26 -4.61 | 85.9 | 85.04 -0.86 |
travel | 84.09 | 79.41 -4.68 | 89.47 | 88.53 -0.94 |
Domain | 90 | 85 | 95 | |||
---|---|---|---|---|---|---|
P | G | P | G | P | G | |
askubuntu | 49.66 | 11.85 | 46.5(-3.16) | 14.3(2.45) | 59.27(9.61) | 7.54(-4.31) |
aviation | 53.8 | 10.42 | 47.78(-6.02) | 13.76(3.34) | 61.54(7.74) | 5.41(-5.01) |
biology | 53.86 | 9.88 | 49.32(-4.54) | 11.47(1.59) | 61.1(7.24) | 5.8(-4.08) |
chemistry | 55.18 | 10.61 | 50.89(-4.29) | 12.96(2.35) | 63.87(8.69) | 6.31(-4.3) |
cooking | 55.03 | 10.95 | 47.15(-7.88) | 14.93(3.98) | 64.45(9.42) | 6.24(-4.71) |
electronics | 52.07 | 11.28 | 46.28(-5.79) | 14.39(3.11) | 59.52(7.45) | 7.24(-4.04) |
history | 55.06 | 9.91 | 52.15(-2.91) | 10.51(0.6) | 65.25(10.19) | 3.94(-5.97) |
money | 50.17 | 10.23 | 43.74(-6.43) | 12.82(2.59) | 61.49(11.32) | 5.99(-4.24) |
movies | 68.51 | 4.55 | 58.63(-9.88) | 7.4(2.85) | 74.17(5.66) | 1.61(-2.94) |
music | 56.46 | 10.18 | 50.32(-6.14) | 13.19(3.01) | 64.44(7.98) | 6.69(-3.49) |
philosophy | 58.98 | 7.81 | 53.02(-5.96) | 10.85(3.04) | 64.9(5.92) | 4.61(-3.2) |
physics | 45.01 | 12 | 39.84(-5.17) | 14.73(2.73) | 55.04(10.03) | 8.41(-3.59) |
politics | 52.6 | 9.14 | 44.87(-7.73) | 12.81(3.67) | 67.38(14.78) | 4.51(-4.63) |
rpg | 51.89 | 7.68 | 39.03(-12.86) | 11.34(3.66) | 68.68(16.79) | 3.97(-3.71) |
scifi | 74.64 | 5.11 | 62.79(-11.85) | 8.37(3.26) | 84.69(10.05) | 1.6(-3.51) |
serverfault | 50.63 | 11.9 | 42.96(-7.67) | 15.97(4.07) | 58.55(7.92) | 8.45(-3.45) |
travel | 46.26 | 12.39 | 40.6(-5.66) | 15.94(3.55) | 53.59(7.33) | 8.19(-4.2) |
Domain | MP | MRPG | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Hit@1 | Hit@2 | Hit@3 | Hit@4 | Hit@5 | Hit@1 | Hit@2 | Hit@3 | Hit@4 | Hit@5 | |
askubuntu | 31.59 0.14 | 50.89 64.85 | 0.21 0.14 | 74.23 0.31 | 80.44 0.11 | 50.86 0.13 | 72.72 0.1 | 80.19 0.09 | 81.71 0.17 | 82.94 0.15 |
aviation | 30.72 0.95 | 48.39 62.05 | 0.93 0.55 | 72.23 0.66 | 77.09 0.44 | 47.28 0.19 | 67.37 0.36 | 75.21 0.47 | 76.02 0.45 | 77.63 0.56 |
biology | 34.66 0.64 | 50.81 63.77 | 1.01 0.31 | 73.96 0.32 | 78.96 0.34 | 49.67 0.7 | 68.8 0.37 | 75.7 0.42 | 76.51 0.44 | 77.55 0.41 |
chemistry | 38.83 0.51 | 54.77 65.84 | 0.36 0.72 | 73.35 0.34 | 77.66 0.1 | 50.28 0.15 | 69.81 0.33 | 76.43 0.3 | 77.23 0.38 | 79.17 0.45 |
cooking | 35.46 0.28 | 56.17 67.2 | 0.79 0.82 | 76.21 0.95 | 80.86 0.42 | 52.43 0.68 | 75.61 0.15 | 82.73 0.25 | 83.74 0.31 | 85.18 0.29 |
electronics | 28.67 0.43 | 47.28 61.78 | 0.73 0.54 | 70.97 0.24 | 77.51 0.26 | 49.64 0.63 | 71.4 0.53 | 78.92 0.46 | 80.06 0.47 | 81.3 0.53 |
history | 34.18 1.21 | 54.16 66.47 | 1.24 0.97 | 76.22 0.53 | 80.45 0.09 | 54.07 0.1 | 73.56 0.62 | 79.5 0.82 | 80.29 0.88 | 81.23 1 |
money | 51.05 0.44 | 66.28 75.43 | 0.98 0.43 | 81.01 0.24 | 84.15 0.23 | 60.59 0.47 | 78.89 0.14 | 86.01 0.4 | 86.75 0.41 | 87.94 0.42 |
movies | 50.06 0.5 | 64.28 73.58 | 1.53 0.88 | 79.48 0.7 | 82.91 0.55 | 57.33 0.44 | 78.41 0.28 | 82.03 0.69 | 83.05 0.9 | 83.25 0.99 |
music | 37.03 0.41 | 57.17 68.79 | 0.62 0.27 | 77.76 0.43 | 82.66 0.26 | 53.17 0.69 | 75.17 0.35 | 81.46 0.52 | 82.28 0.49 | 83.71 0.51 |
philosophy | 34.9 0.8 | 52.94 66.03 | 0.26 0.77 | 75.46 0.53 | 79.45 0.2 | 53.09 0.95 | 72.01 0.27 | 78.03 0.59 | 78.76 0.52 | 79.49 0.56 |
physics | 41.27 0.46 | 60.63 70.59 | 0.49 0.18 | 77.39 0.31 | 81.12 0.22 | 57.96 0.28 | 75.49 0.28 | 83.5 0.38 | 84.47 0.41 | 86.34 0.37 |
politics | 65.26 1.76 | 73.87 78.43 | 1.18 0.86 | 84.04 0.42 | 86.29 0.25 | 71.61 0.5 | 82.92 0.31 | 88.97 0.46 | 89.91 0.42 | 90.98 0.46 |
rpg | 68.68 0.29 | 75.93 79.43 | 0.31 0.37 | 81.66 0.36 | 83.31 0.33 | 72.85 0.4 | 81.91 0.17 | 87.59 0.12 | 88.28 0.16 | 89.09 0.16 |
scifi | 76.65 0.34 | 80.64 82.69 | 0.42 0.36 | 84.89 0.15 | 85.91 0.11 | 81.99 0.12 | 86.2 0.21 | 90.41 0.23 | 90.85 0.28 | 91.53 0.32 |
serverfault | 25.9 0.12 | 50.99 65.52 | 0.59 0.25 | 75.24 0.34 | 81.66 0.16 | 53.05 0.38 | 75.36 0.18 | 82.56 0.17 | 84.16 0.29 | 85.82 0.26 |
travel | 45.04 0.51 | 63.62 72.52 | 0.14 0.41 | 79.96 0.24 | 83.96 0.12 | 58.7 0.34 | 78.57 0.2 | 86.75 0.25 | 87.78 0.28 | 89.5 0.3 |