Adverse Media Mining for KYC and ESG Compliance
Abstract.
In recent years, institutions operating in the global market economy face growing risks stemming from non-financial risk factors such as cyber, third-party, and reputational outweighing traditional risks of credit and liquidity. Adverse media or negative news screening is crucial for the identification of such non-financial risks. Typical tools for screening are not real-time, involve manual searches, require labor-intensive monitoring of information sources. Moreover, they are costly processes to maintain up-to-date with complex regulatory requirements and the institution’s evolving risk appetite.
In this extended abstract, we present an automated system to conduct both real-time and batch search of adverse media for users’ queries (person or organization entities) using news and other open-source, unstructured sources of information. Our scalable, machine-learning driven approach to high-precision, adverse news filtering is based on four perspectives - relevance to risk domains, search query (entity) relevance, adverse sentiment analysis, and risk encoding. With the help of model evaluations and case studies, we summarize the performance of our deployed application.
1. Introduction
In today’s uncertain geopolitical and social environment, global institutions face growing challenges to their risk management processes arising from Non-Financial Risk (NFR) factors. These non-financial risks include, but are not limited to, conduct, cyber, country, compliance, third-party, ESG (Environmental, Social and Corporate Governance) risks. Inadequate compliance & screening controls have cost top banking institutions and other non-financial firms millions of dollars in fines between 2018-2019 alone. In the matter of US Bancorp, fined for lax anti-money laundering controls in 2018 (Schroeder, 2018), it agreed to a $613 million (USD) settlement with US regulators. It had failed to report suspicious banking activities carried out by the long-time customer, Scott Tucker, from 2011 to 2013, owner of several payday lending businesses.
With an increased focus on ESG and other regulatory expectations, institutions have realized the importance of integrating adverse news monitoring into their frameworks for managing NFRs. Adverse Media screening involves the introspection of news and other third-party data sources for potential indicators of negative news associated with an entity (person or company). Adverse media mining makes use of open-source indicators (publicly available information) as essential early warning indicators. In a recent study (Barry et al., 2019; Ji et al., 2017), researchers found that Wells Fargo’s reputation plummeted after regulators announced the bank’s financial fraud. However, Glassdoor reviews signaled the bank had a problem with corporate ethics before the fraud was made public.
The critical challenges for such a screening process include - (1) sheer diversity and volumes of publicly available information, and (2) accurate entity matching and relevance to negative news. Which makes manual monitoring inadequate and may cause lapses in timely access to NFR-related information. Motivated by this assessment, we outline an automated adverse news screening & monitoring solution. The contributions of our work are:
-
(1)
A fast, automated adverse media mining application. We showcase a system which can scale to high-volume and diverse unstructured data sources, that provides both real-time & batch processing entity searches for negative news.
-
(2)
A high-precision adverse news filtering pipeline. We develop a novel pipeline to assess the quality of filtered media by its relevance to the risk domains, target entity, and risk attributes (categories and stages of risk).
-
(3)
A searchable database of adverse media profiles. We propose a representation of risk profiles to characterize better, search and retrieve adverse entities.

2. System Overview
We briefly outline the key components of our system as shown Fig. 1 for adverse media mining.
a. News Retrieval & Ingest For each new (entity) user query, we query a lucene-powered news database using full-text search to fetch all news articles containing the entity mention. For every single query, we cache the articles and track the search period over which query was issued. This helps the system track and monitor what new news data must be fetched for repeat user queries. All news data is stored in an ElasticSearch database.
b. Distributed Computing Infrastructure To process a search query, we make use of Celery-based distributed computing system. We have implemented a “data-funelling” pattern of tasks, where each task reduces the number of articles the next task in the pipeline receives. Each task operates on a single article, scheduled using asynchronous work queues. We operate a cluster 18 compute nodes with a master task scheduler that uses an in-memory, key-value data store for book-keeping.
c. Model Pipeline There are five primary models in our pipeline. (1) Risk Relevance - a binary relevance classifier based on supervised training of 2500 articles using a support vector machine model. This model classified the relevance of each article to an in-domain (risk-related) or out-domain class. We achieved 0.81 F1 score over a 80/20 train-test split ratio. (2) Adverse Scoring - this is a heuristic model for sentiment scoring that relies on the Loughran-McDonald (Loughran and McDonald, 2011) financial sentiment dictionary for computing adversity score of each article. We further group and weigh differently those subsets of keywords that are negative and related to the legal domain. (3) Entity Relevance - Each entity (person or an organization) is assumed relevant to an article if it is an apropos risk domain (compliance). We manually tagged 1200 articles to train a supervised binary logistic regression model using bag-of-words features from contexts extracted around the entity mentions using FlairNLP (Akbik et al., 2018) using which we achieved F1 score of 0.8 in 80/20 train-test split. (4) Risk Categorization - This step in our pipeline consists of inference risk categories and stage classification. As shown in Fig. 2, we curated a list of fine-grained compliance relevant categories across seven risk types. Using a weakly-supervised labeled data of over 9 Million news documents, we trained CNN-based (Kim, 2014) text classifier for the multi-label classification task. The initial set of 65 categories were expanded using sense2vec (Trask et al., 2015) to query our internal news database further. Our top-3 categorical accuracy was 0.86 in a 70/30 train-test split. (4) Risk Stage Identification - We identified five different criminal proceedings stages that typical compliance events might evolve through. Using a similar weak-supervision approach, we trained a multi-class classifier model using XGBoost (Chen and Guestrin, 2016) with we achieved a F1 score of 0.93 in a 70/30 train-test split.


3. Discussion and Conclusion
Compliance-related risks are continuously evolving. Adverse news provides a snapshot essentially in time that can help profile such non-financial risk. For example, in Fig. 3, a risk profile for a US-based payroll company MyPayRollHR charged with fraud can be visualized using compliance risk categories and stages with which we can provide actionable insights for the user that best meet their risk appetite. This system is deployed in production in Compliance Catalyst (van Dijk, 2020), a KYC monitoring tool.
References
- (1)
- Akbik et al. (2018) Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638–1649.
- Barry et al. (2019) Desiree Barry, Sean Brown, Carolyn Ann Geason, Ryan Melehan, Allyson MacDonald, Lauren Rosano, and Pedro Henrique Santos. 2019. Measuring Culture in Leading Companies. (2019).
- Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
- Ji et al. (2017) Yuan Ji, Oded Rozenbaum, and Kyle T Welch. 2017. Corporate culture and financial reporting risk: Looking through the glassdoor. Available at SSRN 2945745 (2017).
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
- Loughran and McDonald (2011) Tim Loughran and Bill McDonald. 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance 66, 1 (2011), 35–65.
- Schroeder (2018) Pete Schroeder. 2018. U.S. Bancorp to pay $613 million for money-laundering violations. (2018). https://www.reuters.com/article/us-usa-usbancorp/u-s-bancorp-to-pay-613-million-for-money-laundering-violations-idUSKCN1FZ1YJ
- Trask et al. (2015) Andrew Trask, Phil Michalak, and John Liu. 2015. sense2vec-a fast and accurate method for word sense disambiguation in neural word embeddings. arXiv preprint arXiv:1511.06388 (2015).
- van Dijk (2020) Bureau van Dijk. 2020. Compliance Catalyst: Risk Management Platform. https://www.bvdinfo.com/en-us/our-products/catalyst/compliance-catalyst