This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-Faced Humans on Twitter and Facebook:
Harvesting Social Multimedia for Human Personality Profiling

Qi Yang [email protected] ITMO UniversityRussia Aleksandr Farseev [email protected] ITMO University, SUSS School of Business, SoMin.ai ResearchRussia, Singapore  and  Andrey Filchenkov [email protected] ITMO UniversityRussia
(2021)
Abstract.

Human personality traits are the key drivers behind our decision-making, influencing our life path on a daily basis. Inference of personality traits, such as Myers-Briggs Personality Type, as well as an understanding of dependencies between personality traits and users’ behavior on various social media platforms is of crucial importance to modern research and industry applications. The emergence of diverse and cross-purpose social media avenues makes it possible to perform user personality profiling automatically and efficiently based on data represented across multiple data modalities. However, the research efforts on personality profiling from multi-source multi-modal social media data are relatively sparse, and the level of impact of different social network data on machine learning performance has yet to be comprehensively evaluated. Furthermore, there is not such dataset in the research community to benchmark. This study is one of the first attempts towards bridging such an important research gap. Specifically, in this work, we infer the Myers-Briggs Personality Type indicators, by applying a novel multi-view fusion framework, called ”PERS” and comparing the performance results not just across data modalities but also with respect to different social network data sources. Our experimental results demonstrate the PERS’s ability to learn from multi-view data for personality profiling by efficiently leveraging on the significantly different data arriving from diverse social multimedia sources. We have also found that the selection of a machine learning approach is of crucial importance when choosing social network data sources and that people tend to reveal multiple facets of their personality in different social media avenues. Our released social multimedia dataset facilitates future research on this direction.

User Profiling, Multimedia Retrieval, Machine Learning
journalyear: 2021copyright: acmcopyrightconference: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval; August 21–24, 2021; Taipei, Taiwanbooktitle: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval (ICDAR ’21), August 21–24, 2021, Taipei, Taiwanprice: 15.00doi: 10.1145/3463944.3469270isbn: 978-1-4503-8529-9/21/08ccs: Computing methodologies Machine learningccs: Applied computing Psychology

1. introduction

During the past decade, an increasing number of social media platforms have been rapidly emerging and therefore such platforms start playing a vital role in facilitating human interactions worldwide. Since 2012, the average daily social media screen time has increased from 60 minutes to 144 minutes (min, 2020). Furthermore, it has spiked even higher since the start of the COVID-19 disease outbreak (Farseev et al., 2020a), when people have been locked at home with the only remaining option of engaging their friends through Social Media.

To maintain high user engagement rates, it is essential for social network conglomerates to position and recommend relevant content according to user interests and online behaviours. For example, extroverted people are more likely to use social media in general as they tend to reveal themselves as enthusiastic, interactive, and therefore forming more social circles around themselves (Gil de Zúñiga et al., 2017). However, contrarily, introverts were found to be spending significantly more time evaluating the value of each online service they use before a deeper user-service interaction may occur (Lu and Hsiao, 2010).

With such large and diverse data available nowadays on Social Media, it is getting practically impossible to manually distinguish social media users when attempting to provide them with more personalized online experiences (Farseev et al., 2018). And therefore, an automated approach to human behaviour pattern understanding on social media is well demanded (Farseev et al., 2020b). Unfortunately, nowadays personality profiling still heavily depends on manual procedures like questionnaires and quizzes (Murray, 1990), and therefore its cost remain unacceptably high limiting its usage in real-time online services, such as social networking websites (Farseev et al., 2016).

However, automatic personality inference is also known to be a hard task (Forsey, 2020; Buraya et al., 2018), which is mainly due to the multi-facet nature of social media data. For example, Twitter is often used for casual daily interactions, while Facebook nowadays more perceived as a private communication channel. As a result, Facebook’s audiences demographics vary drastically from young to senior ages, while e.g. TikTok’s audiences mostly consist of young individuals aged between 18 to 34 years old. Finally, such social networks like Pinterest might not just have a significant audience age shift but also tend to be largely populated by female users (Tankovska, 2021). Furthermore, one might also explore the drastic difference in behavioural traits that people exhibit across various social media avenues. For example, being one of the most open social media outlets, Twitter is known to concentrate on users’ expressions rather than their identity, encapsulating our “real me” from the broader public (Forsey, 2020). At the same time, on the specialized personality-focused forums, such as PersonalityCafe111https://www.personalitycafe.com/, the communication might be more concentrated on the members’ behavioural habits, allowing for gaining a deeper insight into one’s behaviour from the content they post. Considering such a multi-facet multi-source cross-demographic environment, the task of automatic personality profiling from social media data appears to be challenging and, being not widely tackled yet by the research community, requires a more in-depth analysis to be accomplished.

Despite the advantages of leveraging multiple data modalities and sources, there are several associated difficulties identified:
Data gathering. Data from modern social media platforms are often distributed across various Web resources and shielded behind privacy settings. It is therefore important to implement large-scale cross-source data collection techniques.
Data representation. As real-world social media data comes with different data modalities (e.g. text, image, video, location, etc.), the incorporation of such heterogeneous multi-modal data involves the creation of accurate and mutually compatible approaches to data representation (feature learning).
Data modelling: Effective data integration into a single machine learning model is a challenging task, as the data sources and data modalities often represent various aspects of human life and therefore often very different in nature. Even worth, the high dimensionality of the multi-modal feature spaces might often lead to the so-called ”curse of dimensionality” problem when being processed directly, and therefore a dimensionality balancing needs to be accomplished.

Inspired by the research gap and challenges above, in this work we raise the following three research questions. First, to establish a benchmark for multi-view personality profiling, it is important to understand: (RQ1): Is it possible to reliably and accurately infer user personality traits at a large scale in an automatic fashion? Second, to gain an understanding of the real-world applicability of our approach to modern social media scenarios, it is crucial to discover if: (RQ2) Is it possible to improve personality profiling performance by leveraging multi-view social multimedia data? Third, to establish a clear path of future research on multi-source learning, it is crucial to know: (RQ3) What is the impact of social media data origin on personality user profiling performance?

To answer our proposed research questions, in this study we introduce a novel multi-view personality profiling meta ensemble framework, called ”PERS”, which is able to effectively profile social media user personality by leveraging multimodal multimedia data coming from multi-facet social networks. Furthermore, we introduce efficient data gathering and representation techniques, allowing for seamless processing of the data from Facebook, Twitter, and PersonalityCafe social media forums. Finally, we release the PERS dataset222PERS Multi-Source Multi-View Personality Dataset: https://pers.azurewebsites.net to the research community, allowing for future extensive cross-disciplinary research.

The major contributions of this work are threefold. First and foremost, we have proposed a novel machine learning framework for multi-view user profiling and demonstrated that efficient personality profiling is possible and able to achieve industry-level performance for several personality attributes. Second, we have demonstrated that different social networks are different in nature, which impacts the personality profiling performance and therefore needs to be considered during the data modelling process. Third, we have released a new multi-source cross-social personality profiling dataset to be used by the research community in future studies along the direction.

2. Related works

In the past two decades, there have been several studies conducted, which were attempting to model human personality traits from a statistical perspective. First, being inferred from statistical analysis of the English lexicon, the Big Five model have been proposed by Digman (Digman, 1990), where the author reveals the close relationship between human personality and their written language. Inspired by the idea, later on, Pennebaker et al. (Pennebaker and King, 1999) laid the foundation of statistical personality profiling by introducing the LIWC word categorization scheme, which has numerically bridged the personality traits and the written language utilization patterns.

Furthermore, several studies have been devoted to automatic personality profiling, where cross-disciplinary research groups were utilizing machine learning techniques for automatic human personality inference based on test-generated data (Mairesse et al., 2007; Argamon et al., 2005). Worth noting that these studies were all based on relatively small datasets and therefore are limited in supporting large scale observations staying far apart from being applied in a real-world scenario. Moving forward, the ”Small Data” problem was partially mitigated by the introduction of the ”MyPersonality” project (Kosinski et al., 2015) - the first large-scale personality-labelled dataset that includes user-generated data from Facebook, which has immediately attracted multimedia community attention entailing first larger-scale studies along social media personality profiling research direction (Kumar and Gavrilova, 2019; Gjurković and Šnajder, 2018; Tadesse et al., 2018).

These studies have made a giant leap in the field, however, one could also notice that most of them still lack a very important factor limiting their real-world applicability - they are largely focused on a single data sources (e.g. Facebook) or a single data modality (i.e. Text), which brings them apart of being applied to modern multi-source multi-view social multimedia data. Namely, the Linguistic Inquiry and Word Count (LIWC) works (Holtgraves, 2011; Sumner et al., 2012) are mostly focused on text-only data processing to predict personality by using personality-labelled word categories. At the same time, Arnoux et al. (Arnoux et al., 2017) and Tandera et al. (Tandera et al., 2017) instead utilized pre-trained word Global Vectors for Word Representation (GloVe) embedding of textual data, first in the world reporting the results of machine-learning-driven single-modal personality inference.

Finally, there were several studies conducted approaching user profiling from a multi-modal data perspective. For example, Farseev et al. (Farseev et al., 2015) have proposed a multi-modal ensemble model tackling the task of demographic profiling from multi-modal data. Furthermore, authors have extended the framework to leverage sensor data and multi-source multi-task learning for wellness profiling (Farseev and Chua, 2017a). Buraya et al. (Buraya et al., 2017) proposed to solve the problem of relationship status inference via applying ”out of the box” machine learning on early-fused data from Twitter, Instagram, Facebook, and Foursquare, which has achieved a significant 1717% inference performance uplift, as compared to single-modal learning. Going further, (Tsai et al., 2019) proposed a factorization method to model the intra-modal and inter-modal relationships within multi-modal data inputs, which proved the crucial role of multi-modal data incorporation in improving the user profiling performance; while Buraya et al. (Buraya et al., 2018) instead leveraged on the temporal component of the multi-modal data, being first applying deep learning for multi-view personality profiling. Being a significant contribution to the field of multi-view profiling, the above works still lack multi-source cross-social network data processing (Farseev, 2017), which limits their applicability in the majority of real-world scenarios.

As it can be seen, there is a significant evidence that incorporation of multi-modal data for automatic user profiling is useful for the achievement of a better prediction performance. However, when it comes to evaluating the role of social network choice for user profile learning, the existing research results remain to be relatively sparse. At the same time, it is reasonable to assume that often serving different needs of an individual, various social media sources might provide a very diverse data in nature, and therefore a more comprehensive study on the roles of different data sources for personality user profiling is necessary.

3. ITMO Multi-Source Multi-View Personality Dataset

Refer to caption
Figure 1. Target user from Twitter (a), Facebook (b), PersonalityCafe (c).
Table 1. Dataset statistics
Twitter Facebook PerCafe
#User 21305 11730 3800
#Posts 8114568 2838141 621482
#Images 1865562 597164 -
#Extroversion 5013 6243 981
#Introversion 16292 5487 2819
#Sensing 2799 2300 610
#Intuition 18506 9430 3190
#Thinking 6743 3040 1666
#Feeling 14562 8690 2134
#Judging 9800 5528 1613
#Perceiving 11505 6202 2187
Table 2. Proportion of 4 personality categories in Twitter, Facebook and PersonalityCafe datasets.
Twitter Facebook PerCafe
#Extroversion 23.53% 53.22% 25.82%
#Introversion 76.47% 46.78% 74.18%
#Sensing 13.14% 19.61% 16.05%
#Intuition 86.86% 80.39% 83.95%
#Thinking 31.65% 25.92% 43.84%
#Feeling 68.35% 74.08% 56.16%
#Judging 46.00% 47.13% 42.45%
#Perceiving 54.00% 52.87% 57.55%
Table 3. Personality traits distribution
PerCafe Twitter Facebook
INFP 713 5334 1665
INFJ 664 4177 1498
INTP 508 1121 814
INTJ 487 3544 521
ENFP 353 3496 2381
ENTP 256 122 671
ISFP 137 413 161
ISTP 127 508 131
ENTJ 113 389 412
ISTJ 98 739 162
ENFJ 96 323 1468
ISFJ 85 456 535
ESTP 50 200 63
ESFJ 43 52 666
ESFP 43 311 316
ESTJ 27 120 266

3.1. Data Acquisition

To represent human personality, in this work we use the Myers-Briggs Type Indicator (MBTI) (Myers, 1998), which has been widely adopted by the research community (Buraya et al., 2018, 2017) and splits one’s personality into 16 types, each formed by the following four binary dimensions:

  • Extroversion and Introversion (EI): this dimension determines how an individual focuses her energies and interest, whether she is influenced externally by the opinion and interpretation of others (Extroverts) or motivated by her inner thoughts (Introverts).

  • Sensing and iNtuition (SN): this aspect demonstrates how people interpret knowledge. Sensing personalities make decisions based on their five senses and solid observation, whereas Intuitive individuals favour imagination to constancy.

  • Thinking and Feeling (TF): a person with Thinking aspect, always exhibits logical behaviour in their decisions, while Feeling individuals are empathic and give priority to emotions over logic.

  • Judging and Perceiving (JP): this dichotomy describes an individual approach towards work, decision-making, and planning. Judging individuals are highly organized in their thoughts, while Perceivers behave more spontaneously.

The data was collected from Twitter, Facebook, and PersonalityCafe333https://www.personalitycafe.com/ social networks, during the time interval of 11st Jan 20182018 to 11st Jan 20212021, and via the two following steps:

1) Ground truth collection. To obtain personality ground truth from Twitter, we have downloaded all of the tweets which contain self-reported personality-related keywords/phrases such as “I’m an ENTP” or ”I am an ENTPand extract the personality trait from those phrases to be the ground truth for each user (see Figure  1(a) for example). To harvest Facebook ground truth, we have monitored Facebook comments under personality test results released on 16personalities portal (see Figure 1(b) for example). Likewise, to obtain the personality-related ground truth from the PersonalityCafe forum, we downloaded user’s publicly-available self-reported personality traits on their profile pages (see Figure 1(c) for example).

2) User-generated content (UGC) collection. To establish UGC collection from Twitter and Facebook, we have downloaded user timelines through Twitter REST API444https://developer.twitter.com/ and Facebook GRAPH API555https://developers.facebook.com/, respectively. At the same time, to collect UGC from the PersonalityCafe forum, we downloaded posts from the MBTI forum thread.

3) Data pre-processing. Since social network data might exhibit significant noise levels and often contain grammatical errors, it’s necessary to perform data prepossessing prior to the data modelling stage. At the same time, it is necessary to remove the direct personality mentions from the text content so that the model might not be able to use personality abbreviations from the post content at the inference stage. To mitigate the above two problems, we have pre-processed our dataset via the following three steps: 1) Data Filtering To ensure the sufficient amount of data per user for training and inference, we have filtered out the users with less than 10 tweets available; 2) In-line label replacement For all personality traits, the personality type name was then replaced with the ”¡type¿” placeholder (e.g. ”ENTJ” might have been replaced with ”¡type¿”); 3) Social indicator replacement: similarly to Nguyen et al. (Nguyen et al., 2020), we have further converted emojis into the corresponding descriptive textual strings, removed all non-ASCII words, and normalized the text by replacing user mentions, URLs, hashtags, date-time by the corresponding placeholders as follows: @USER, HTTPURL, HASHTAG, DATETIME.

Table 1 highlights the statistics of our dataset across binary personality labels while Figure 2 visually reflects the personality label distributions. From the Table 3 and the Figure 2, it can be seen that INFP, INFJ, INTJ, ENTP, INTP are the Top-5 most popular personality labels found in both Facebook and Twitter datasets, showing the consistency of the data distributions across general social networks, reducing the risk of falling into source-dependent bias during the data modeling stage. At the same time, it is important to note that in the PersonalityCafe forum data, the ENFP, INFP, INFJ, ENFJ labels dominate the rest. The latter observation shows that the personality-related data sources might have a distribution shift towards individuals of certain personality types (different from a general distribution) that tend to participate in such specific personality-related discussions. Therefore, the evaluation based on PersonalityCafe dataset must be accomplished independently.

Refer to caption
Figure 2. Proportion of personality traits in three data sources

3.2. On MBTI Personality Categorization

The MBTI personality categorization scheme defines each of the 44 binary MBTI categories to represent a different aspect of human personality. However, when being combined into 1616 personality types, it is known to be associated with a major shortcoming of the overlap between the ”neighbour categories” (e.g. INTJ and INTP). Given the noisy nature of the content from Social Media, it might be a good idea to predict individual binary MBTI personality traits, instead of modelling the overlapping 16-category scenario. Therefore, in this work, we have adopted such binary personality categorization scheme.

3.3. Data Representation

To facilitate an effective data modelling process, the data needs to be properly represented in the form of feature vectors. Following the best practices described in the literature on user profiling (Khan et al., 2020) (Rangel Pardo et al., 2018) (Farseev and Chua, 2017c) we have chosen the following data representation approaches:

1) Textual Features: First, to represent the textual data at a user level, for each user all posts were concatenated into a corresponding user-specific ”documents”. Second, the term frequency-inverse document frequency (TF-IDF) has been extracted to form the document-term matrix. Finally, we have applied the Latent Semantic Analysis (LSA (Halko et al., 2011)), as such transformation has previously shown sizable performance uplift (Daneshvar and Inkpen, 2018) when being applied for user profiling. The final dimension of the compressed textual feature vector was set to 100100, where the new number of dimensions has been found empirically during a grid search.

2) Visual Features: To represent visual data, we have automatically mapped each photo into the distribution of 10001000 ImageNet (Deng et al., 2009) image concepts via the pre-trained ResNet-101 model (He et al., 2016). We then summed up the predicted concept occurrence likelihoods for each user and element-vice normalized the obtained vector by the total number of images available from the user. In such a way, for each user, we have obtained a 10001000-sized image concept distribution vector. Similarly to the text modality, Principal Component Analysis (PCA (Jolliffe, 2011)) has been further applied to reduce the dimensionality of the visual feature space to 200200.

4. Problem definition

4.1. Social media data

Given a user i, we denote the multi-view data associated with the user as a set X:

(1) X={(yi,Ti,Ii),i=1,2,,l}X=\{(y_{i},T_{i},I_{i}),i=1,2,\ldots,l\}

where ll represents the number of samples in the dataset, TIT_{I} is the collection of the iith user text content represented as Textual Features, IiI_{i} is the collection of iith user image content represented as Image features, and yiy_{i} represents one of the four iith binary personality trait ground truth labels.

In such a way, we can now formulate the personality profiling as a user-level multi-view document classification task, where users play the role of ”documents” when being rephrased in standard terms.

Refer to caption
Figure 3. Overview of the PERS framework

4.2. PERS Framework

Below, we now can define the PERS framework as a two-step stacked generalized ensemble approach. The architecture of our PERS framework is illustrated in Figure 3.

First-Step: Given the training set X={(yi,xi),i=1,2,,I}X=\{(y_{i},x_{i}),i=1,2,\ldots,I\} , where yiy_{i} is the personality label and xix_{i} represents the feature vector, we shuffle and uniformly split XX into KK equal sub-sets. In such a way, XkX_{k} and X(k)=XXkX^{(-k)}=X-X_{k} are the definitions of the test and training sets, respectively, for the kk-th fold in K-Fold cross-validation.

Now, we can specify a J-sized list of ”base” classifiers and train the jj-th base classifier based on the training set X(k)X^{(-k)}. We denote the ZkiZ_{ki} as the prediction of jj-th classifier on xix_{i} for each sample in the test set for the kk-th cross-validation fold. In such a way, immediately after the cross-validation process, we can define a new dataset based on J ”base” classifier outputs:

(2) Zfirst={(yi,z1i,,zJi),i=1,,I}Z_{first}=\{(y_{i},z_{1i},\ldots,z_{Ji}),i=1,\ldots,I\}

For each data modality, we compute ZTfirstZ_{Tfirst} and ZIfirstZ_{Ifirst} separately and then we form the input for the next stage of data processing HH via column-wise concatenation of ZTfirstZ_{Tfirst} and ZIfirstZ_{Ifirst}. The corresponding dimension of HH is therefore n×2Jn\times 2J.

Second-Step: Same as the first step, we perform K-fold cross-validation on each of jj base classifier results with HH coming from First-Step to obtain the output ZsecondZ_{second} formally defined as follows:

(3) Zsecond={(yi,w1i,,wJi),n=1,,I}Z_{second}=\{(y_{i},w_{1i},\ldots,w_{Ji}),n=1,\ldots,I\}

where, wKnw_{Kn} represent the ii-th prediction inferred by the kk-th base classifier. The corresponding dimension of ZsecondZ_{second} is therefore n×Jn\times J.

Therefore, to get the final model prediction, we trained a meta classifier ff on ZsecondZ_{second}, which can be formulated as :

(4) y=f(Zsecond).y=f(Z_{second}).

Specifically in this study, we have chosen Support Vector Machine with linear kernel(LinearSVM) as our meta classifier.

4.3. Base Classifiers

To maximize the performance of the PERS framework, it is of a crucial importance to choose a set of suitable machine learning algorithms as base classifiers. The previous studies  (Amirhosseini and Kazemian, 2020; Qi et al., 2020; Farseev et al., 2015) suggest XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), and Random Forest to be the top-choice base models for user profiling, as their performance on social media data has been reported to beat state-of-the-art baselines, often including those baselines coming from Deep Learning community.

4.3.1. Random Forest

Random forest is an ensemble learning algorithm that integrates multiple decision trees to complete prediction. For the classification problem, the prediction result is the vote of all the decision tree prediction results. During training, bootstrap sampling is used to form the training set of each decision tree. When training each node of each decision tree, the features used are also part of the features extracted from the entire feature vector. By integrating multiple decision trees, and training each decision tree with sampled samples and feature components each time, the variance of the model can be effectively reduced.

4.3.2. XGBoost

XGBoost is an effective and scalable gradient booster machine that has been widely adopted in the industry domain in recent years. It is an ensemble model containing a set of classification and regression trees (CART). Given a training data xix_{i} and target yiy_{i}, the XGBoost model can be defined as:

(5) y^i=K=1Kfk(xi),fkF\hat{y}_{i}=\sum_{K=1}^{K}f_{k}(x_{i}),f_{k}\in F

where K is the total number of trees, fkf_{k} for kthk_{t}h tree is a function in the functional space F, and F is the set of all possible CARTs.

4.3.3. LightGBM

LightGBM is the improved version of Gradient Boosting Machines that mitigates the ”optimal division point search” problem, arising from the increasing computational complexity on larger datasets. The problem is solved via the following two tricks, reducing the training data size and data dimensionality:

Gradient-based One-Side Sampling (GOSS): Exclude most of the samples with small gradients, and only use the remaining samples to calculate the information gain.

Exclusive Feature Bundling (EFB): Bundles mutually exclusive features as they rarely take non-zero values at the same time.

5. Evaluation

To answer our research questions, we have evaluated the performance of PERS framework being trained across different data sources and data modality combinations. For all experiments, the dataset was uniformly split into training set and test set with the ratio of 85:15, maintaining the original personality label distributions. To understand the impact of different modalities, data sources, and fusion strategies on the final performance of the model, we have selected the following community-adopted personality profiling baselines (Buraya et al., 2017; Farseev et al., 2015; Rangel et al., 2015):

  • Independently-trained base classifiers (see descriptions in Section 4.3) with respect to each data modality;

  • Early fusion. Base classifiers trained based on the early-fused data modality representations (concatenated feature vectors);

  • Early Fusion (PCA 200). Base classifiers trained based on the early-fused data modality representations with PCA being applied after the vector concatenation (the PCA dimension of 200200 has been selected empirically via a grid search);

5.1. Evaluation Metrics

Due to the imbalanced distribution of personality labels in our Datasets (see Section 3.1 for details on the Data Distributions), for performance evaluation, we have adopted the ”F1macroF_{1macro}” metric (Farseev et al., 2015), which is the harmonic mean between precision and recall, and the average is calculated per label across all labels. The F1macroF_{1macro} metric is formally defined as:

(6) F1macro=1Qj=1Q2×𝑝𝑗×𝑟𝑗𝑝𝑗+𝑟𝑗F_{1macro}=\frac{1}{Q}\sum_{j=1}^{Q}\frac{2\times\mathit{pj}\times\mathit{rj}}{\mathit{pj}+\mathit{rj}}

where pjp_{j} and rjr_{j} are the precision and recall for λj\lambda_{j} \in h(xix_{i}) from λj\lambda_{j} \in yiy_{i}.

We have further adopted the  Matthews correlation coefficient metric  (Matthews, 1975) (Mcor), as it incorporates both true and false positives and negatives and generally regarded as a ”balancing” measure that can be used even if the classes are of a very different size. The Mcor metric is formally defined as:

(7) Mcor=TPTNFPFN(TP+FP)(FN+TN)(FP+TN)(TP+FN)Mcor=\frac{TP*TN-FP*FN}{\sqrt{(TP+FP)*(FN+TN)*(FP+TN)*(TP+FN)}}

where TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives.

We prioritize the F1macroF_{1macro} score as our main evaluation metric, while the Mcor score plays an auxiliary role for making decisions regarding performance when the F1macroF_{1macro} values are marginal.

Table 4. Evaluation of the ”PERS” framework trained on the independent modality and the modality combinations in Twitter and Facebook. Text in green indicate the best performance while red indicate the worst.
Model Twitter Facebook
EI SN TF JP EI SN TF JP
Text(F1macroF_{1macro}/Mcor)
XGBoost 0.80/0.62 0.48/0.08 0.59/0.23 0.61/0.22 0.62/0.25 0.59/0.22 0.55/0.16 0.58/0.17
RF 0.78/0.56 0.56/0.14 0.61/0.22 0.62/0.25 0.61/0.21 0.62/0.26 0.58/0.17 0.58/0.16
LGBM 0.80/0.62 0.56/0.14 0.61/0.24 0.62/0.25 0.62/0.24 0.62/0.24 0.57/0.13 0.58/0.16
Image(F1macroF_{1macro}/Mcor)
XGBoost 0.46/0.01 0.47/0.03 0.51/0.01 0.54/0.09 0.59/0.19 0.47/0.05 0.49/0.07 0.56/0.14
RF 0.54/0.09 0.52/0.06 0.57/0.14 0.56/0.12 0.59/0.18 0.54/0.08 0.57/0.15 0.57/0.14
LGBM 0.52/0.05 0.52/0.04 0.56/0.11 0.57/0.13 0.59/0.18 0.55/0.11 0.57/0.14 0.56/0.12
Early Fusion(F1macroF_{1macro}/Mcor)
XGBoost 0.8/0.62 0.48/0.06 0.59/0.22 0.59/0.19 0.60/0.21 0.56/0.22 0.54/0.16 0.58/0.17
RF 0.78/0.56 0.54/0.1 0.62/0.24 0.62/0.24 0.64/0.27 0.62/0.24 0.60/0.20 0.57/0.15
LGBM 0.80/0.61 0.55/0.11 0.62/0.25 0.62/0.24 0.62/0.24 0.61/0.22 0.60/0.21 0.60/0.20
Early Fusion(PCA 200)(F1macroF_{1macro}/Mcor)
XGBoost 0.79/0.61 0.48/0.07 0.59/0.23 0.60/0.20 0.57/0.14 0.47/0.04 0.50/0.07 0.55/0.10
RF 0.78/0.56 0.57/0.14 0.63/0.26 0.62/0.24 0.60/0.21 0.54/0.08 0.58/0.16 0.56/0.13
LGBM 0.80/0.60 0.56/0.11 0.62/0.25 0.62/0.24 0.59/0.18 0.52/0.04 0.56/0.12 0.56/0.13
PERS Trained with Single Modality(F1macroF_{1macro}/Mcor)
Text 0.81/0.62 0.55/0.14 0.62/0.26 0.63/0.28 0.62/0.23 0.62/0.28 0.59/0.21 0.58/0.16
Image 0.53/0.11 0.47/0.05 0.57/0.16 0.59/0.17 0.59/0.17 0.50/0.10 0.56/0.17 0.56/0.13
PERS Trained with Dual Modalities(F1macroF_{1macro}/Mcor)
T+I 0.82/0.61 0.54/0.12 0.63/0.26 0.64/0.28 0.64/0.28 0.63/0.30 0.61/0.23 0.62/0.21
Table 5. Evaluation of the ”PERS” framework trained on text modality on PersonalityCafe.
Text(F1macroF_{1macro}/Mcor)
model EI SN TF JP
LGBM 0.69/0.39 0.74/0.50 0.80/0.61 0.73/0.47
XGBoost 0.69/0.41 0.69/0.42 0.79/0.60 0.73/0.46
RF 0.65/0.29 0.67/0.33 0.74/0.53 0.69/0.36
PERS 0.71/0.43 0.74/0.51 0.81/0.61 0.74/0.49

5.2. Evaluation Across MBTI Categories

To judge the applicability of PERS framework in a real-world scenario, we have evaluated the limits of PERS’s performance across Twitter, Facebook, and PersonalityCafe datasets. The evaluation results are presented in the Table 4.

From the table, it can be seen that, being trained on the multi-view data from Twitter, PERS framework is able to achieve an industry-level performance of 0.820.82 F1macroF1_{macro} score when predicting the Extroversion-Introversion (EI) personality trait. While the performance obtained for the other three personality categories is significantly lower (ranging from -0.18 F1macroF1_{macro} score for Judging-Perceiving(JP) to -0.28 F1macroF1_{macro} score for Sensing-Intuition(SN)), we believe that such high promising performance for the EI label could testify the tremendous potential of multi-view social media data for psycho-graphic discovery and personality profiling. The superiority for the EI label, at the same time, can be explained by the natural difference of these two human personality categories when it comes to user communication on social platforms: Extroverts are known to be much more open to others, while Introverts - opposite, being more selective and making decisions at a slightly more conservative pace. Such an inspiring results allow us to give a positive answer to our RQ1 and possibly gives birth to a wide range of new research directions related to Personality Profiling and Multi-View learning.

But what about other two data sets? An interesting finding comes from the results presented in Table 5, where PERS demonstrates a breakthrough performance based on PersonalityCafe dataset showing the best overall F1macroF1_{macro} scores when predicting all 44 binary MBTI categories. Such a result can be explained by the specific nature of the PersonalityCafe dataset, where users purposely reveal their behavioural differences and therefore often biased towards particular social behaviour concepts. Such results also Reassure our positive answer to the Q1 and allow us to conclude that indeed the nature of a data source and the social network use patterns are of the crucial importance when solving the multi-view cross-media personality profiling problem.

5.3. Evaluation Across Different Modalities

First, we have investigated the contribution of different data modalities towards personality profiling performance and its integration ability. An interesting observation comes from the cross-modal experimental results presented in Table 4: the PERS framework have performed 2% better than other single-source baselines for all but SN binary labels being trained on Twitter and Facebook datasets. Another interesting observation can be made from the modality combination results, where being trained based on both textual and image data, PERS is able to outperform by more than 1% not just other single-modal classifiers but also the early-fused baselines.

The above findings suggest that the introduction of multi-modality into user profiling could serve as a powerful booster of model performance. Such observation could be explained by the richness of visual data when reflecting user preferences, which serves as a greatly beneficial supplement of the textual data modality at the data modeling stage. The latter finding confidently positively answers our RQ2 by emphasizing the important role of multi-modal data learning for personality profiling application.

Finally, let us also highlight an interesting observation that comes from single-modal evaluation results (see Table 4). It is important to note that, in the cases of learning from single-modal source, ”PERS” being trained on text-modality performs best across all personality labels, ranging from 0.020.02 to 0.280.28 F1macroF1_{macro} score superiority level. The latter can be easily explained by the quantitative domination of textual data over visual modality (see Table 1). Another potential reason behind such trend could be the high level of noise in the user-generated visual data, where the images are less strict in terms of perspective and object positioning as compared to professional photos. Moreover, such visual content often includes objects that might not directly reflect the semantics of the data and therefore might be not accurate in representing an author’s personality. To this end, such hypothesis also aligns well with our-chosen visual data representation approach, where ImageNet concept distribution might be simply too general for personality profiling tasks, as opposed to, for example, demographic profiling (Farseev et al., 2015).

5.4. Evaluation Across Different Sources

At last, let’s examine the impact of the social media data origin on personality user profiling performance, so that an industry guideline can be established for future research.

As the textual modality has participated in all three data sources, let’s describe the PERS performance on textual data first. From the Table 4 and Table 5 it can be noticed that PERS framework being trained on Twitter dataset outperformed the performance on Facebook dataset and PerosnalityCafe dataset by more than 0.190.19 F1macroF1_{macro} score in predicting EI label. In the contrast, when it comes to the SN label, Twitter-trained PERS was not able to outperform Facebook and PersonalityCafe data, staying behind by 0.20.2 and 0.110.11 F1macroF1_{macro} score, respectively. Finally, the PERS performance of TF and JP labels based on Twitter textual data was found to be better than Facebook by 0.030.03 and 0.050.05 F1macroF1_{macro} score, respectively, but considerably worse than PersonalityCafe by 0.190.19 and 0.110.11 F1macroF1_{macro} score, respectively.

The superiority of Twitter in predicting the EI label could be explained by the differences of the ”energy” source for Extroverted and Introverted personality types. Precisely, according to Martin (Martin, 1997), Extroverts prefer to source their life energy from active involvement in events and engaging into different activities, while Introverts often prefer doing things alone obtaining their energy from dealing with the ideas, pictures, memories, and reactions that are inside on their mind. Similarly, from the digital world, it can be seen that on Twitter both personality types are able to express themselves fulfilling both their enjoyment (ENJ) and observation/learning (LEN) needs, while for Facebook ENJ factor got fulfilled proportionally for a smaller number of individuals, affecting the overall user base distribution (Syn and Oh, 2015). Correspondingly Twitter and PersonalityCafe data are diverse enough to differentiate the EI personalities achieving a higher prediction scores, as compared to Facebook-based prediction. The observation is also supported by our data distribution (see Section 3.1), where Twitter and PersonalityCafe datasets are clearly skewed towards Introverts, proving more data for PERS to learn on how the personality type direct their energy and make decisions. The latter aspect is important as it is known that Extroverts might generate substantially more UGC as compared to Introverts (Syn and Oh, 2015) and therefore Introvert-crafted content sufficiency is crucial for mutually-consistent and comprehensive learning from the data.

At the same time, an inverse picture can be noticed for the SN label results, and there is a ”low hanging fruit” explanation of the phenomena: for both Twitter and Facebook the SN personality is distributed with a clear shift towards Intuitive personality type, while being short on Sensing individuals in the data. Despite reflecting the real life distribution, this data property also entails a possible technical issue of variance insufficiency limiting the model differentiation when it comes to learning the Sensing and Intuitive user personas. Considering that on Facebook and PersonalityCafe there were more Sensing personality types identified, it is reasonable to assume that this is also the reason why PERS have performed better on these latter two sources as compared to the former one. To the end, a more ”sensing” Facebook can also be explained by the fact that Facebook is mainly treated nowadays as a communication tool so that people land there for fulfilling their daily communication needs, while Twitter more often serves as a source of Inspiration attracting even more Intuitive individuals into its nets (Syn and Oh, 2015).

Now, its time to compare the visual modality performance and the first thing that might capture a reader’s attention is that the image data modality has performed similarly for the cases of TF and JP labels for both Twitter and Facebook sources, however, at the same time Facebook performed better for EI and SN labels with 0.060.06 and 0.030.03 F1macroF1_{macro} score performance uplift, respectively. As it has been described earlier, both the personality categories are very different in the way they direct the energy and perceive the external world (Martin, 1997) and therefore the data diversity introduced by incorporation of the visual modality is of crucial importance for the PERS performance. As Twitter is a ”less visual” data source as compared to Facebook and also its data distributions are less balanced (as discussed above) for both personality labels, it is reasonably to assume that these two factors might entail the superiority of Facebook over Twitter on visual data in our particular case.

Finally, it is worth noting that PERS trained only based on the textual data from PersonalityCafe forum outperforms results obtained from both Twitter and Facebook data by at least 0.1 F1macroF1_{macro} score. Such a finding can be easily explained by the precise focus of the PersonalityCafe forum on the personality topic, which provides additional meaningful data descriptors which can be utilized by PERS for improving its personality inference score.

Backed up by all the observations above, we now can give an answer to the RQ3 by highlighting the drastic difference of Social Media data sources when being used in automated personality profiling, which is dictated by the way different personalities engaged into social network activities.

6. Limitations and Future Work

Although PERS outperforms baselines for all binary personality inference tasks, being combined together, the predicted labels might often mismatch the actual final user MBTI personality score and therefore only binary personality predictions (such as EI prediction for Twitter dataset) can be leveraged in a real world settings.

Therefore, it is evident that new data source-specific multi-view learning approaches need to be implemented (Farseev and Chua, 2017c; Farseev et al., 2017) where personality profiling modelling will be leveraging on more multi-view data representations (such as avatar (Gao et al., 2013), sensor data (Farseev and Chua, 2017b), etc.) and mitigating the specific issues arising from the difference of communication styles across different social avenues. The development of such models and their application for content generation or recommendation services will be the focus of our future research.

7. Conclusions

In this work, we presented a brave study on automated human personality profiling across multiple data modalities and social networking sites, such as Facebook, Twitter, and PersonalityCafe. Our-proposed personality profiling framework, called ”PERS” have demonstrated an industry-level superior performance over single-source and multi-source baselines. Via our cross-social evaluation, we have also proved that different social networking platforms exhibit various distinct user communication and usage patterns, which in turn affects user profiling model performance and needs to be treated with care for skewed distributed datasets. Finally, to facilitate future research in this exciting direction we have released our new large-scale cross-social multi-view personality profiling dataset and supplemented it with the corresponding statistics and analytics for the community use.

8. Acknowledgement

This work is financially supported by National Center for Cognitive Research of ITMO University. This work was supported by the Ministry of Science and Higher Education of Russian Federation, research project no. 075-03-2020-139/2 (goszadanie no. 2019-1339).

This research is also supported supported by Enterprise Singapore “Startup SG Tech POV” Grant Scheme under SOMIN PTE LTD project.

References

  • (1)
  • min (2020) 2020. Daily time spent on social networking by internet users worldwide from 2012 to 2020. Retrieved Feb 1, 2021 from https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
  • Amirhosseini and Kazemian (2020) Mohammad Hossein Amirhosseini and Hassan Kazemian. 2020. Machine Learning Approach to Personality Type Prediction Based on the Myers–Briggs Type Indicator®. Multimodal Technologies and Interaction 4, 1 (2020). https://doi.org/10.3390/mti4010009
  • Argamon et al. (2005) Shlomo Argamon, Sushant Dhawle, Moshe Koppel, and James W. Pennebaker. 2005. Lexical Predictors Of Personality Type. In IN PROCEEDINGS OF THE JOINT ANNUAL MEETING OF THE INTERFACE AND THE CLASSIFICATION SOCIETY OF NORTH AMERICA.
  • Arnoux et al. (2017) Pierre-Hadrien Arnoux, Anbang Xu, Neil Boyette, Jalal Mahmud, Rama Akkiraju, and Vibha Sinha. 2017. 25 Tweets to Know You: A New Model to Predict Personality with Social Media. Proceedings of the International AAAI Conference on Web and Social Media 11, 1 (May 2017). https://ojs.aaai.org/index.php/ICWSM/article/view/14963
  • Buraya et al. (2018) Kseniya Buraya, Aleksandr Farseev, and Andrey Filchenkov. 2018. Multi-view personality profiling based on longitudinal data. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 15–27.
  • Buraya et al. (2017) Kseniya Buraya, Aleksandr Farseev, Andrey Filchenkov, and Tat-Seng Chua. 2017. Towards user personality profiling from multiple social networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
  • Daneshvar and Inkpen (2018) Saman Daneshvar and Diana Inkpen. 2018. Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018. In CEUR Workshop Proceedings, Vol. 2125. http://ceur-ws.org/Vol-2125/paper{_}213.pdf
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
  • Digman (1990) J M Digman. 1990. Personality Structure: Emergence of the Five-Factor Model. Annual Review of Psychology 41, 1 (1990), 417–440. https://doi.org/10.1146/annurev.ps.41.020190.002221 arXiv:https://doi.org/10.1146/annurev.ps.41.020190.002221
  • Farseev (2017) Aleksandr Farseev. 2017. 360° user profile learning from multiple social networks for wellness and urban mobility applications. Ph.D. Dissertation. National University of Singapore (Singapore).
  • Farseev et al. (2020a) Aleksandr Farseev, Yu-Yi Chu-Farseeva, Yang Qi, and Daron Benjamin Loo. 2020a. Understanding economic and health factors impacting the spread of COVID-19 disease. medRxiv (2020).
  • Farseev and Chua (2017a) Aleksandr Farseev and Tat-Seng Chua. 2017a. Tweet can be Fit: Integrating Data from Wearable Sensors and Multiple Social Networks for Wellness Profile Learning. (2017).
  • Farseev and Chua (2017b) Aleksandr Farseev and Tat-Seng Chua. 2017b. Tweet can be fit: Integrating data from wearable sensors and multiple social networks for wellness profile learning. ACM Transactions on Information Systems (TOIS) 35, 4 (2017), 1–34.
  • Farseev and Chua (2017c) Aleksandr Farseev and Tat-Seng Chua. 2017c. TweetFit: Fusing Multiple Social Media and Sensor Data for Wellness Profile Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI.
  • Farseev et al. (2018) Aleksandr Farseev, Kirill Lepikhin, Hendrik Schwartz, Eu Khoon Ang, and Kenny Powar. 2018. SoMin. ai: Social multimedia influencer discovery marketplace. In Proceedings of the 26th ACM international conference on Multimedia. 1234–1236.
  • Farseev et al. (2015) Aleksandr Farseev, Liqiang Nie, Mohammad Akbari, and Tat-Seng Chua. 2015. Harvesting multiple sources for user profile learning: a big data study. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 235–242.
  • Farseev et al. (2016) Aleksandr Farseev, Ivan Samborskii, and Tat-Seng Chua. 2016. bBridge: A Big Data Platform for Social Multimedia Analytics. In Proceedings of the 24rd ACM international conference on Multimedia. ACM.
  • Farseev et al. (2017) Aleksandr Farseev, Ivan Samborskii, Andrey Filchenkov, and Tat-Seng Chua. 2017. Cross-domain recommendation via clustering on multi-layer graphs. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 195–204.
  • Farseev et al. (2020b) Aleksnadr Farseev, Qi Yang, Andrey Filchenkov, Kirill Lepikhin, Yu-Yi Chu-Farseeva, and Daron-Benjamin Loo. 2020b. SoMin.ai: Personality-Driven Content Generation Platform. (2020). https://doi.org/10.1145/3437963.3441714 arXiv:abs/2002.01726
  • Forsey (2020) Caroline Forsey. 2020. Twitter, Facebook, or Instagram? Which Platform(s) You Should Be On. Retrieved Feb 20, 2021 from https://blog.hubspot.com/marketing/twitter-vs-facebook
  • Gao et al. (2013) Rui Gao, Bibo Hao, Shuotian Bai, Lin Li, Ang Li, and Tingshao Zhu. 2013. Improving user profile with personality traits predicted from social media content. In Proceedings of the 7th ACM conference on recommender systems. 355–358.
  • Gil de Zúñiga et al. (2017) Homero Gil de Zúñiga, Trevor Diehl, Brigitte Huber, and James Liu. 2017. Personality Traits and Social Media Use in 20 Countries: How Personality Relates to Frequency of Social Media Use, Social Media News Use, and Social Media Use for Social Interaction. Cyberpsychology, Behavior, and Social Networking 20 (09 2017), 540–552. https://doi.org/10.1089/cyber.2017.0295
  • Gjurković and Šnajder (2018) Matej Gjurković and Jan Šnajder. 2018. Reddit: A Gold Mine for Personality Prediction. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. Association for Computational Linguistics, New Orleans, Louisiana, USA, 87–97. https://doi.org/10.18653/v1/W18-1112
  • Halko et al. (2011) N. Halko, P. G. Martinsson, and J. A. Tropp. 2011. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev. 53, 2 (2011), 217–288. https://doi.org/10.1137/090771806 arXiv:https://doi.org/10.1137/090771806
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
  • Holtgraves (2011) Thomas Holtgraves. 2011. Text messaging, personality, and the social context. Journal of Research in Personality 45, 1 (2011), 92–99. https://doi.org/10.1016/j.jrp.2010.11.015
  • Jolliffe (2011) Ian Jolliffe. 2011. Principal Component Analysis. Springer Berlin Heidelberg, Berlin, Heidelberg, 1094–1096. https://doi.org/10.1007/978-3-642-04898-2_455
  • Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
  • Khan et al. (2020) Alam Sher Khan, Hussain Ahmad, Muhammad Zubair Asghar, Furqan Khan Saddozai, Areeba Arif, and Hassan Ali Khalid. 2020. Personality Classification from Online Text using Machine Learning Approach. International Journal of Advanced Computer Science and Applications 11, 3 (2020). https://doi.org/10.14569/IJACSA.2020.0110358
  • Kosinski et al. (2015) Michal Kosinski, Sandra Matz, Samuel Gosling, Vesselin Popov, and David Stillwell. 2015. Facebook as a Research Tool for the Social Sciences. The American psychologist 70 (09 2015), 543–556. https://doi.org/10.1037/a0039210
  • Kumar and Gavrilova (2019) K. N. P. Kumar and M. L. Gavrilova. 2019. Personality Traits Classification on Twitter. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1–8. https://doi.org/10.1109/AVSS.2019.8909839
  • Lu and Hsiao (2010) Hsi-Peng Lu and Kuo-Lun Hsiao. 2010. The influence of extro/introversion on the intention to pay for social networking sites. Information and Management 47, 3 (2010), 150 – 157. https://doi.org/10.1016/j.im.2010.01.003
  • Mairesse et al. (2007) François Mairesse, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore. 2007. Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text. J. Artif. Int. Res. 30, 1 (Nov. 2007), 457–500.
  • Martin (1997) CR Martin. 1997. Looking at Type: The Fundamentals. Gainesville: Center for Application of Psychological Type.
  • Matthews (1975) B.W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405, 2 (1975), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9
  • Murray (1990) John B. Murray. 1990. Review of Research on the Myers-Briggs Type Indicator. Perceptual and Motor Skills 70, 3_suppl (1990), 1187–1202. https://doi.org/10.2466/pms.1990.70.3c.1187 arXiv:https://doi.org/10.2466/pms.1990.70.3c.1187
  • Myers (1998) I.B. Myers. 1998. MBTI Manual: A Guide to the Development and Use of the Myers-Briggs Type Indicator. Consulting Psychologists Press.
  • Nguyen et al. (2020) Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. 2020. BERTweet: A pre-trained language model for English Tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 9–14.
  • Pennebaker and King (1999) J. Pennebaker and L. King. 1999. Linguistic styles: language use as an individual difference. Journal of personality and social psychology 77 6 (1999), 1296–312.
  • Qi et al. (2020) Yang Qi, Farseev Aleksandr, and Filchenkov Andrey. 2020. I Know Where You Are Coming From: On the Impact of Social Media Sources on AI Model Performance (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence 34, 10 (Apr. 2020), 13971–13972. https://doi.org/10.1609/aaai.v34i10.7258
  • Rangel et al. (2015) Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF. sn, 2015.
  • Rangel Pardo et al. (2018) Francisco Manuel Rangel Pardo, Manuel Montes-y-Gómez, Martin Potthast, and Benno Stein. 2018. Overview of the 6th Author Profiling Task at PAN 2018. In CLEF 2018 Evaluation Labs and Workshop. CEUR-WS.org. http://ceur-ws.org/Vol-2125/
  • Sumner et al. (2012) C. Sumner, A. Byers, R. Boochever, and G. J. Park. 2012. Predicting Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets. In 2012 11th International Conference on Machine Learning and Applications, Vol. 2. 386–393. https://doi.org/10.1109/ICMLA.2012.218
  • Syn and Oh (2015) Sue Yeon Syn and Sanghee Oh. 2015. Why do social network site users share information on Facebook and Twitter? Journal of Information Science 41, 5 (2015), 553–569.
  • Tadesse et al. (2018) Michael M. Tadesse, Hongfei Lin, Bo Xu, and Liang Yang. 2018. Personality Predictions Based on User Behavior on the Facebook Social Media Platform. IEEE Access 6 (2018), 61959–61969. https://doi.org/10.1109/ACCESS.2018.2876502
  • Tandera et al. (2017) Tommy Tandera, Hendro, Derwin Suhartono, Rini Wongso, and Yen Lina Prasetio. 2017. Personality Prediction System from Facebook Users. Procedia Computer Science 116 (2017), 604–611. https://doi.org/10.1016/j.procs.2017.10.016 Discovery and innovation of computer science technology in artificial intelligence era: The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017).
  • Tankovska (2021) H. Tankovska. 2021. Distribution of Pinterest users worldwide as of January 2021, by gender. Retrieved Feb 20, 2021 from https://www.statista.com/statistics/248168/gender-distribution-of-pinterest-users/
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning Factorized Multimodal Representations. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=rygqqsA9KX