Temporal Knowledge Distillation for Time Sensitive Financial Services Applications

Hongda Shen University of Alabama in Huntsville [email protected] and Eren Kurshan Princeton University [email protected]

Abstract.

Detecting anomalies has become an increasingly critical function in the financial service industry. Anomaly detection is frequently used in key compliance and risk functions such as financial crime detection, fraud, and cybersecurity. The dynamic nature of the underlying data patterns, especially in adversarial environments like fraud detection, poses serious challenges to the machine learning models. Keeping up with the rapid changes by retraining the models with the latest data patterns introduces pressures in balancing the historical and current patterns while managing the training data size. Furthermore, the model retraining times raise problems in time-sensitive and high-volume deployment systems, where the retraining period directly impacts the model’s ability to respond to ongoing attacks in a timely manner. In this study, we propose a temporal knowledge distillation-based label augmentation approach, (TKD), which utilizes the learning from older models to rapidly boost the latest model and effectively reduces the model retraining times to achieve improved agility. Experimental results show that the proposed approach provides advantages in retraining times, while improving the model performance.

1. Introduction

Machine learning and artificial intelligence solutions have been widely used in the financial services industry. The applications of AI/ML range from trading to consumer and business credit decisions, financial crime detection, mobile banking, and authentication systems (Forum, 2020). These use cases also span almost all critical business functions from business operations to risk management (Heaton et al., 2016; Oprea et al., 2019). Overall, AI and machine learning solutions have a significant impact on the performance and operations of today’s financial services firms.

Financial firms frequently use machine learning to detect anomalies in cybersecurity applications, fraud detection, compliance, and financial crime detection (Anandakrishnan et al., 2018). Among these, there is a sizable list of mission-critical applications, each of which requires effective and timely detection as well as response to anomalous events promptly (Anandakrishnan et al., 2017).

The number of models used in high-volume and time-sensitive applications is growing due to the recent trends in mobile payments and digital banking (Council, 2019). In the past few years, mobile payment systems have experienced unprecedented growth (Zelle, 2021). Zelle transaction volumes increased by 61% year-to-year as of 2021 (Heun, 2021),(Systems, 2019). As the high-speed, high-volume nature of digital transactions attracts financial crime organizations, digital payments fraud and cybercrimes have rapidly become among the top challenges in financial firms. In digital payments, mobile P2P fraud experienced a 733% increase from 2016 to 2019. Similarly, account takeover fraud (ATO) grew by 72% just in one year, from 2018 to 2019 (Reports, 2020) and rose over 200% in 2021 according to industry reports (Shein, 2021).

One of the grand challenges machine learning models face in these use cases is the dynamic and adversarial nature of the detection process. Unlike naturally stable datasets used in other application areas of AI/ML, fraud and anomaly detection systems experience frequent and rapid pattern changes (Marfaing and Garcia, 2018), (Buehler, 2019). The pattern changes occur in both (i) normal events, as in changes in the non-fraud transactions and customer behavior, as well as in (ii) anomalous events, where perpetrators implement new fraud tactics. As an example, in account takeover fraud (which is characterized by perpetrators gaining access to the customers’ account and draining the funds across multiple channels (Council, 2019),(Harrison, 2020)), fraud tactics are known to show rapid changes. In some instances, the patterns change from one popular tactic to the next in a matter of hours. This raises serious concerns for industry-standard machine learning solutions that rely on supervised learning. In payment fraud detection systems, which process and score millions of transactions every day with millisecond response time SLAs, the underlying modeling challenges become more pronounced (Hasham et al., 2019),(Shepard et al., 2019), (Chatain et al., 2011).

In high-volume and dynamic application environments, such as payment fraud detection, models face serious dilemmas:

•

Data Size & Balance: Retraining the models with the new patterns is a common approach to improve the model performance for recent data patterns. However, it often degrades the performance of historical patterns that repeat. Excluding the historical patterns causes retention challenges. Yet, continuously extending the training data set with additional data causes size and training time issues (A.Shah, 2020). This highlights the need to balance the historical and recent data while preventing uncontrolled growth in the training data. Similarly, in anomaly detection cases model retraining aims to (i) capture all anomalous patterns and (ii) balance the anomalous/normal event percentages in the training data to achieve the highest model performance. The combination of both goals often causes the machine learning models to use larger data sizes to deal with highly dynamic environments.
•

Training Times & Model Agility: Similarly, the amount of time required to retrain the models is a serious challenge in the deployment of time-sensitive production models. Models need to be rapidly updated and moved back to production systems to prevent further financial and cybersecurity damage. In response, numerous techniques have been explored to reduce the training times. Hardware acceleration using graphical processing units (GPU) (Zhang et al., 2017), (Shepovalov and Akella, 2020), field programmable gate arrays (FPGA) (Zhang et al., 2015)(T Wang et al., 2018), memory optimizations (Eleftheriou et al., 2019), vectorization and other techniques have been proposed (Daghaghi et al., 2021), (N et al., 2016), (Chen et al., 2018). While the hardware and system optimizations provide benefits, optimizing the models themselves for training time is equally important in improving the overall solution.

This paper explores a novel supervised learning approach to tackle these key challenges by providing a way to update models faster, while balancing the current and historical datasets in high-volume and time-sensitive financial services applications. The proposed technique, Temporal Knowledge Distillation (TKD), transfers knowledge from the historical data to boost the model performance through data label augmentation. It aims to balance the data to enhance the model’s robustness. Further, it improves the training time for agile response in adversarial use cases, such as fraud detection and account takeover. The paper explores a fraud detection use case to analyze the effectiveness of the proposed approach. However, the approach can be broadly applied to a wide range of financial services applications due to the prominence of high-volume and time-sensitive applications in the financial services systems.

The paper is organized as follows: Section 2 discusses the related work in knowledge distillation; Section 3 outlines the TKD temporal knowledge distillation approach; Section 4 describes the experimental analysis setup; Section 5 overviews the experimental results; finally Section 6 discusses the conclusions.

2. Related Work

The concept of Knowledge Distillation (KD) was explored by a number of researchers (Buciluǎ et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015; Urban et al., 2016; Furlanello et al., 2018). Initially, the goal of KD was to produce a compact student model that retains the performance of a more complex teacher model that takes up more space and/or requires more computation to make predictions. Dark Knowledge (Hinton et al., 2015), which includes a softmax distribution of the teacher model, was first proposed to guide the student model.

Recently, the focus of this line of research has shifted from model compression to label augmentation which can be considered a form of regularization using Dark Knowledge. In (Furlanello et al., 2018), a chain of retraining models, parameterized identically to their teachers, Born Again Network (BAN), was proposed. The final ensemble of all trained models can outperform their teacher network on computer vision and NLP tasks. Additionally, (Furlanello et al., 2018) investigated the importance of each term to quantify the contribution of dark knowledge to the success of KD.

Following this direction of research, self distillation has emerged as a new technique to improve the classification performance of the teacher model rather than merely mitigating computational or deployment burden. Label refinery (Bagherinezhad et al., 2018) iteratively updates the ground truth labels after cropping the entire image dataset and generates a set of informative, collective, and dynamic labels, from which one can learn a more robust model. In another related study, (Romero et al., 2014) aimed to compress models by approximating the mapping between hidden layers of the teacher and the student models, using linear projection layers to train relatively narrower students.

KD has already gained success in many applications in a wide range of domains. In (Yang et al., 2019), authors have proposed to distill the knowledge of essence in an ensemble of models to a single model that needs much less computation to deploy for building a speech recognition system. Similarly, KD was utilized in computer vision domain: object detection (Chen et al., 2017), deepfake detection (Kim et al., 2021), image super-resolution (Zhang et al., 2021). (Sanh et al., 2019) proposed a method to pre-train a smaller general-purpose language representation model, called DistilBERT based on Knowledge Distillation, which can reduce the size of a large BERT NLP model by 40%. In the emerging field of federated learning, KD has started to be used to tackle some challenges such as user heterogeneity issue (Zhu et al., 2021) and expensive model payload during communication (Seo et al., 2020). Besides the traditional response-based distillation, feature-based distillation (Chen et al., 2020) has recently gained attention from researchers with an attempt to leverage knowledge from the intermediate layers in addition to the output layer (Gou et al., 2020). As KD shifts its focus from model compression to domain adaption, it has shown its potential in handling heterogeneous data sources. Our work is motivated by this trend and shares some conceptual similarities with (Farhadi and Yang, 2019) for both attempt to distill knowledge temporally. However, (Farhadi and Yang, 2019) utilizes temporal correlations across video frames while the proposed work in this paper mainly tackles the dynamic nature of anomaly detection use case.

The proposed TKD label augmentation incorporates Dark Knowledge from previously trained models, which have been trained with different time ranges to augment the labels of the latest dataset. This new knowledge enables the transfer of learning from historical patterns extracted by experienced experts. With the assistance of their expertise, the new model sees performance improvement without having the historical datasets in its training. This enables more effective detection of anomalous events, and streamlines model retraining and deployment in a time-sensitive scenario.

3. Temporal Knowledge Distillation

Consider the classical classification setting with a sequence of training datasets corresponding to $N$ different time frames consisting feature vectors: $X_{t}$ and labels $Y_{t}$ where $t=0,1,...N$ . For traditional supervised learning algorithms, a model is trained on $\{X_{<t},Y_{<t}\}$ for each time frame. Naturally, the size of $\{X_{<t},Y_{<t}\}$ increases as time passes. TKD leverages the outputs generated by previously trained models $M_{<t}$ prior to each time frame $t$ instead of including historical data in the training directly. These outputs are used to augment labels of the latest dataset and construct a regularizer to the conventional loss function. For time frame $t$ , the training dataset will be $\{X_{t},Y_{t}\}$ only and the general form loss function to optimize in the training becomes:

(1)

Loss_{t}=\alpha\cdot CE(Y_{t},y_{t})+(1-\alpha)\cdot\underset{K\leq i\leq t-1}{AGG}\left[KL(O_{i,t},y_{t}))\right]

where $O_{i,t}$ and $y_{t}$ represents model $M_{i}$ output on data $X_{t}$ and model output at the current time frame, respectively. $CE$ and $KL$ are Cross-Entropy and Kullback–Leibler divergence. With this second term in the loss function Eq. 1, existing ground truth labels are augmented by the experienced experts. Typically, KL divergence is used to minimize the discrepancy between the logits outputs from the teacher model and the student model, respectively.

$AGG[\cdot]$ serves as an aggregating function which collects ‘vote‘ from each experienced expert. There are many options for this aggregating function e.g. $max()$ , $sum()$ , $mean()$ and etc. and in this work $mean()$ is selected after some empirical experiments. The coefficient $\alpha$ is used for balancing the label augmentation term and the cross-entropy on new data. As $\alpha$ gets closer to one, the training for the target model depends more on the ground-truth labels of the new data and this approach gradually converges to a classic classification without any regularizer. For simplicity, $\alpha$ is set to 0.5 in this paper.

As the number of models increases over time, the historical models, whose underlying training data patterns have changed provide increasingly less meaningful information on the recent anomaly patterns. Thus, including them in the training may not provide further performance gain for retraining and possibly deteriorate the performance. To reduce the negative impact of this distribution shift, we use parameter $K$ to determine which model to start with and truncate all the previous models prior to the current one. In this study we used an empirical approach to determine $K$ . With all the specific setups, the loss function used can be simplified as follows:

(2)

Loss_{t}=CE(Y_{t},y_{t})+\frac{1}{t-K}\sum_{i=K}^{t-1}KL(O_{i,t},y_{t}))

Refer to caption — Figure 1. Architecture of Label Augmentation via Time-based Knowledge Distillation (TKD).

Fig. 1 illustrates the architecture of TKD. For the first time frame $t=0$ , a model $M_{0}$ is trained on dataset $\{X_{0},Y_{0}\}$ . Then, for each of the following time frames (depending on the specific retraining schedule), a new identical model $M_{t}$ is trained from, $O_{i,t}$ the supervision of previous models $M_{<t}$ by using Eq. 2. Auxiliary soft labels (outputs) from the previous models are highlighted in orange in Fig. 1. Note that this figure depicts the $K=0$ case. When the number of historical models becomes large, certain truncation can be applied based on prior knowledge of each historical model.

4. Experimental Analysis

4.1. Dataset

In order to evaluate the effectiveness of TKD in time-sensitive anomaly detection applications, we use a fraud detection use case. This section provides the experimental simulation analysis setup for TKD using an open-source anomaly detection dataset (IEEE Computational Intelligence Society, 2019) based on telecommunications industry card-not-present digital payment transactions. As in almost all the anomaly detection problems, the negative class in this data set takes a very small portion of the total transactions. For the experimental analysis, we extracted 6 months of data with the labels included. The first day of this data set is assumed to be November 1st, 2017 (Timeframe Analysis, 2019). The start date was used to facilitate data segmentation and does not impact the model performance. Details about the positive/negative samples distribution of the experimental dataset can be found in Table 1.

Table 1. Overview of the transaction and fraud distribution in the experimental dataset.

Month	Nonfraud #	Fraud #
Nov-17	130,937	3,401
Dec-17	88,821	3,689
Jan-18	95,398	3,939
Feb-18	84,785	3,571
Mar-18	83,723	2,949
Apr-18	83,577	2,995
Total	567,241	20,544

4.2. Experimental Setup

Model training period begins with November 2017 and gradually incorporates additional months to the training period to approximate an adversarial fraud detection environment. Also, a one-month delay policy is assumed for data labeling. This accounts for the fraud claim submission process, which is a hybrid of digital and manual, as well as labeling the reported transactions. Accordingly, the testing period started from January to April 2018 and with one reduced month from the previous run. Table 2 shows further details of the experiment periods. In total, there are four experiment periods used for the experimental analysis. For each period, TKD trains the model with the most recent data only (they are highlighted in bold and red in Table 2). Note that Period 0 does not have any pre-trained model to run TKD.

Table 2. Experimental period details.

Period #	Training Period	No-label Period	Testing Period
0	Nov.	Dec.	Jan. + Feb. + Mar. + Apr.
1	Nov. + Dec.	Jan.	Feb. + Mar. + Apr.
2	Nov. + Dec. + Jan.	Feb.	Mar. + Apr.
3	Nov. + Dec. + Jan. + Feb.	Mar.	Apr.

Prior to the model training, categorical features are encoded using one-hot encoding. log10 transformation is used on continuous variables to limit their value ranges. Further details on data preprocessing can be found in Table 4 in the Appendix. For performance comparisons, Area Under Precision-Recall Curve (AUPRC or AUC-PR) was selected as the primary metric to compare the classification performance results. AUPRC has been shown to be a stronger metric for performance and class separation than the traditional Area Under Receiver Operating Curve (AUROC or AUC-ROC) in highly imbalanced binary classification problems such as anomaly detection use cases (Davis and Goadrich, 2006; Saito and Rehmsmeier, 2015).

4.3. Algorithm Comparison

In order to cover the model types for fraud detection use cases both neural networks and decision tree-based algorithms were used in the experimental analysis (Shen and Kurshan, 2020). Similarly, ensemble techniques (Zhou, 2012) have been widely reported in many classification and fraud detection applications (Forough and Momtazi, 2021). Therefore, an ensemble of neural network and tree-based model was incorporated in the experimental analysis. Altogether, the analysis aims to compare the performance of the baseline neural networks, tree-based models and ensemble models along with their corresponding TKD versions:

(i)

MLP: A Multi-layer Perceptron has been trained on labeled data to serve as the baseline. Implementation details of the MLP has been provided in Table 5 in Appendix.
(ii)

XG: Xgboost is commonly used for its high efficiency and performance (Chen and Guestrin, 2016). It has been widely used to model tabular data (from Kaggle competitions to industrial applications). The specific set of hyper-parameters for this study were determined using grid search and provided in Table 6 in Appendix.
(iii)

MLP-XG: An ensemble of baseline Xgboost and MLP via averaging outputs of both models
(iv)

MLP-TKD: Label Augmented version of the Multi-Layer Perceptron Model through Temporal Knowledge Distillation (TKD) using historical ensemble models.
(v)

XG-TKD: Label Augmented version of GXBoost through TKD (Temporal Knowledge Distillation) using historical ensemble models.
(vi)

MLP-XG-TKD: Label Augmented version of MLP-XG-ensemble approach through TKD (Temporal Knowledge Distillation) using historical ensemble models.

It is worth highlighting that although most of the related work focused on distilling knowledge from a neural network to another, having both MLP and XG allows the study to explore whether it is also viable to transfer knowledge between two heterogeneous architectures (e.g. a neural network and a tree). Moreover, with the historical models (teacher models) being ensembles of heterogeneous architectures, the study tries to explore if the student network can still benefit from such a knowledge source.

A supervised binary classification approach was used to train the fraud detection classifier, where each algorithm was run 10 times for each training time period. AUPRC value for each run is collected and the average of all the 10 collected values is recorded as the final performance measure.

5. Experimental Results

This section presents the performance comparisons between baseline machine learning models and their TKD counterparts. Both one month and consecutive months results as described in the Section are presented 4.2). Next, we analyze the model training time over the entire 6 months experiment period to show the training time characteristics of TKD compared to the baseline.

5.1. Model Performance

As shown in Table 2, experiment period 3, which has the highest number of months in its training period and the most historical models for TKD, was used as the primary test period. Note that the focus is to evaluate TKDs effectiveness rather than finding the best performing model. Therefore, the three baseline models (MLP, XG, MLP-XG) were paired with their TKD versions. AUPRC comparisons were made between regular training and TKD training.

Table 3. Relative AUPRC difference of TKD methods against their base models.

Model #	$\Delta$	Relative Improvement (%)
MLP	0.0402	28.59%
XG	0.3097	26.39%
MLP_XG	0.0435	28.98%

Fig. 2 shows the AUPRC for each model candidate listed in Section 4.3 over period 3. For each pair, TKD versions produce significantly higher AUPRC hence higher performance in fraud detection. Metric $\Delta$ , defined as the absolute gain of AUPRC of TKD version over the baseline, is presented in Table 3. In addition, the relative percentage of the metric is also shown to better quantify performance improvement by TKD. Similar to what is observed in Fig. 2, TKD method consistently improves the model performance over the baseline regardless of the model type.

To assess TKD’s impact on model stability over time, Period 1 was used as the primary time period (as there are three months for testing in Period 1 and one historical model available from Period 0 to enable TKD). It is important to note that AUPRC cannot be compared across datasets since they might have different positive sample ratios. Hence, in this experiment, $\Delta$ was used as the performance metric for each testing month from February to April 2018 in Fig. 3.

Overall, Fig. 3 shows that TKD produces the highest performance gain in April 2018, which is the last month of the testing period. This indicates that TKD provides durable robustness, which is of interest in stabilizing fraud detection models performance over time in a dynamic production environment. Interestingly, even though $\Delta$ values are quite close for the three model pairs, it appears that the ensemble of different model types benefits the most compared to the other models. Similar observations can be made for Period 2; although it only has two months for testing. Please see Fig. 4 for the results of Period 2.

5.2. Model Training Time

As discussed in Section 1, the reactive model retraining period in dynamic environments introduces significant challenges in terms of data size and the time to retrain the model. The pressure to push the retrained model back into production to prevent ongoing attacks and crime patterns translates to some critical goals: (i) reducing retraining times (ii) using well-balanced data with historical and current data patterns without overgrowing the training data size (which in turn affects goal (i) as well).

Due to the imbalance in the classes, fraud detection models use over-sampling for fraud cases and under-sampling for non-fraud transactions to achieve a target fraud/non-fraud composition. In order to achieve the highest-performance model all fraud cases are used in model training. This, in turn, dictates the number of non-fraud cases to be used in the model training. As a result, with emerging fraud patterns training data sizes practically grow in reactive retraining cases.

Most modern machine learning algorithms (including neural networks and tree-based models) are trained in batch optimization fashion, for which additional model training data translates to improved performance (Devries and Taylor, 2017; Şahin and Steedman, 2018). Therefore, the training time is typically longer in higher performing models that use larger datasets.

Based on the performance analysis, MLP-XG ensemble and MLP-XG-TKD were identified as the highest performance models. Fig.5 shows the average model training time in seconds for MLP-XG and MLP-XG-TKD over 10 repeated runs from November 2017 to April 2018. A machine with Intel (R) Core (TM) i7-6700HQ CPU at 2.6GHz, 16GB RAM and NVIDIA GTX 960M GPU was used for this comparison.

MLP-XG was trained with cumulative time periods of data (similar to Table 2), while TKD training only included the month itself without any historical data. An extended version of the time range up to Apr-18 was used to better illustrate the training time benefits over time. Since TKD provides a way to transfer knowledge hidden in the historical models, without explicitly training on historical data, the average training time only depends on the size of the latest dataset. On the other hand, traditional supervised learning techniques including both MLP and XG require all historical data in their training, which leads to super-linear increases in the training time. Fig. 5 shows the average training times for MLP-XG (blue) and MLP-XG-TKD (red). Both methods take the same time to run at the beginning because no historical models are available for TKD.

Gradually, with more data used in MLP-XG training, its training time increases, while that of MLP-XG-TKD remains approximately the same. MLP-XG-TKD provides faster training consistently over Dec-17 through Apr-18. Over the 6 months of experimentation period, the average training time was reduced by 58.5% with up to 3.8x improvement in Apr-18.

It is important to note that the training time advantage of TKD shown in this experiment translates to significant impact in real-life implementations with large production data sets, yielding reduced training time cost, and resources. This, in turn, yields improved computational cost and agility of responses in adversarial production environments in time-sensitive tasks. TKD provides the opportunity to boost the performance of the baseline models as well as the ensemble models.

6. Conclusions

This study proposes a label augmentation algorithm, Temporal Knowledge Distillation (TKD), for time-sensitive financial anomaly detection applications. This technique aims to provide a new way to boost the model performance by incorporating a wider range of patterns including older and newer patterns without causing unmanageable increases in the data size. Furthermore, it minimizes the model retraining times compared to the baseline models. In adversarial and time-critical use cases, such as cybersecurity and payment fraud detection applications, this yields significantly higher agility and more effective response capabilities to attacks. Despite the recent advancements in acceleration techniques, model retraining times remain as a challenge in deploying AI/ML models in production systems. TKD delivers an alternative approach to model retraining in time-sensitive and high-volume applications, which provides key benefits to the overall success of the deployment systems.

References

(1)
Anandakrishnan et al. (2017) Archana Anandakrishnan, Senthil Kumar, Alexander Statnikov, Tanveer Faruquie, and Di Xu. 2017. Anomaly Detection in Finance: Editors’ Introduction. Proceedings of Machine Learning Research 71:1–7, 2017 KDD 2017: Workshop on Anomaly Detection in Finance (2017).
Anandakrishnan et al. (2018) Archana Anandakrishnan, Senthil Kumar, Alexander Statnikov, Tanveer Faruquie, and Di Xu. 2018. Anomaly Detection in Finance: Editors’ Introduction. In Proceedings of the KDD 2017: Workshop on Anomaly Detection in Finance (Proceedings of Machine Learning Research, Vol. 71). PMLR, 1–7.
A.Shah (2020) A.Shah. 2020. Challenges Deploying Machine Learning Models to Production, MLOps: DevOps for Machine Learning. Towards DataScience (2020). https://towardsdatascience.com/challenges-deploying-machine-learning-models-to-production-ded3f9009cb3
Ba and Caruana (2014) Jimmy Ba and Rich Caruana. 2014. Do Deep Nets Really Need to be Deep? In Advances in Neural Information Processing Systems 27. 2654–2662.
Bagherinezhad et al. (2018) Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. 2018. Label Refinery: Improving ImageNet Classification through Label Progression. CoRR abs/1805.02641 (2018).
Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Philadelphia, PA, USA) (KDD ’06). 535–541.
Buehler (2019) K. Buehler. 2019. Transforming Approaches to AML and Financial Crime. McKinsey (2019).
Chatain et al. (2011) P. L. Chatain, A. Zerzan, W. Noor, N. Dannaoui, and L. de Koker. 2011. Protecting Mobile Money against Financial Crimes Global Policy Challenges and Solutions. World Bank Report (2011).
Chen et al. (2018) Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2018. AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training. 32nd AAAI conference on Artificial Intelligence (2018).
Chen et al. (2020) Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. 2020. Cross-Layer Distillation with Semantic Calibration. CoRR abs/2012.03236 (2020).
Chen et al. (2017) Guobin Chen, Wongun Choi, Xiang Yuan, Tony Han, and Manmohan Chandraker. 2017. Learning Efficient Object Detection Models with Knowledge Distillation. In Advances in Neural Information Processing Systems 30.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. CoRR abs/1603.02754 (2016).
Council (2019) European Payments Council. 2019. Payment Methods Report 2019: Innovations in the Way We Pay. E.U. Payments Council Report (2019).
Daghaghi et al. (2021) Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, and Anshumali Shrivastava. 2021. Accelerating Slide Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations and More. MLSys (2021).
Davis and Goadrich (2006) Jesse Davis and Mark Goadrich. 2006. The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06).
Devries and Taylor (2017) Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017).
Eleftheriou et al. (2019) E. Eleftheriou, M. Le Gallo, S. R. Nandakumar, C. Piveteau, I. Boybat, V. Joshi, R. Khaddam-Aljameh, M. Dazzi, I. Giannopoulos, G. Karunaratne, B. Kersting, M. Stanisavljevic, V. P. Jonnalagadda, N. Ioannou, K. Kourtis, P. A. Francese, and A. Sebastian. 2019. Deep learning acceleration based on in-memory computing. BM Journal of Research and Development, vol. 63, no. 6, pp. 7:1-7:16, 1 Nov.-Dec. (2019).
Farhadi and Yang (2019) Mohammad Farhadi and Yezhou Yang. 2019. TKD: Temporal Knowledge Distillation for Active Perception. CoRR abs/1903.01522 (2019).
Forough and Momtazi (2021) Javad Forough and Saeedeh Momtazi. 2021. Ensemble of deep sequential models for credit card fraud detection. Applied Soft Computing 99 (2021), 106883.
Forum (2020) World Economic Forum. 2020. Transforming Paradigms: A Global AI in Financial Services Survey.
Furlanello et al. (2018) Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born Again Neural Networks. CoRR abs/1805.04770 (2018).
Gou et al. (2020) Jianping Gou, Baosheng Yu, Stephen John Maybank, and Dacheng Tao. 2020. Knowledge Distillation: A Survey. CoRR abs/2006.05525 (2020).
Harrison (2020) P. Harrison. 2020. Ecommerce Account Takeover Fraud Jumps to 378% Since the Start of COVID-19 Pandemic. The Fintech Times (2020). https://thefintechtimes.com/ecommerce-account-takeover-fraud-jumps-to-378-since-the-start-of-covid-19-pandemic/
Hasham et al. (2019) S. Hasham, S. Joshi, and D. Mikkelsen. 2019. Financial Crime and Fraud in the Age of Cybersecurity. McKinsey and Company Report (2019).
Heaton et al. (2016) J. B. Heaton, Nicholas G. Polson, and J. H. Witte. 2016. Deep Learning in Finance. (2016). arXiv:1602.06561
Heun (2021) David Heun. 2021. Small Businesses Fueling Zelle’s Growth. American Banker (2021). https://www.americanbanker.com/news/small-businesses-fueling-zelles-growth
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531 (2015).
IEEE Computational Intelligence Society (2019) IEEE Computational Intelligence Society. 2019. Fraud Detection Competition. https://www.kaggle.com/c/ieee-fraud-detection.
Kim et al. (2021) Minha Kim, Shahroz Tariq, and Simon S. Woo. 2021. FReTAL: Generalizing Deepfake Detection Using Knowledge Distillation and Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 1001–1012.
Marfaing and Garcia (2018) Christelle Marfaing and Alexandre Garcia. 2018. Computer-Assisted Fraud Detection, From Active Learning to Reward Maximization. CoRR abs/1811.08212 (2018).
N et al. (2016) Forrest N, Khalid Ashraf Iandola, Matthew W.and Moskewicz, and Kurt Keutzer;. 2016. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2592-2600 (2016).
Oprea et al. (2019) A. Oprea, A. Gal, I. Moulinier, J.Chen, M. Veloso, E.Kurshan, S. Kumar, and T. Faruquie. 2019. NeurIPS 2019 Workshop on Robust AI in Financial Services: Data, Fairness, Explainability, Trustworthiness, and Privacy.
Reports (2020) Forbes Business Reports. 2020. Fraud Trends And Tectonics. Forbes (2020). https://www.forbes.com/sites/businessreporter/2020/06/08/fraud-trends-and-tectonics/?sh=3a422de06d12
Romero et al. (2014) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. FitNets: Hints for Thin Deep Nets. CoRR abs/1412.6550 (2014).
Şahin and Steedman (2018) Gözde Gül Şahin and Mark Steedman. 2018. Data Augmentation via Dependency Tree Morphing for Low-Resource Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5004–5009.
Saito and Rehmsmeier (2015) Takaya Saito and Marc Rehmsmeier. 2015. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10 (03 2015).
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108 (2019).
Seo et al. (2020) Hyowoon Seo, Jihong Park, Seungeun Oh, Mehdi Bennis, and Seong-Lyun Kim. 2020. Federated Knowledge Distillation. CoRR abs/2011.02367 (2020).
Shein (2021) E. Shein. 2021. Account Takeover Fraud Rates Skyrocketed 282% Over Last Year. TechRepublic (2021). https://www.techrepublic.com/article/account-takeover-fraud-rates-skyrocketed-282-over-last-year/
Shen and Kurshan (2020) Hongda Shen and Eren Kurshan. 2020. Deep Q-Network-based Adaptive Alert Threshold Selection Policy for Payment Fraud Systems in Retail Banking. CoRR abs/2010.11062 (2020).
Shepard et al. (2019) M. Shepard, T.Adams, A. Portilla, M. Ekberg, R. Wainwright, K. Jackson, T. Baumann, C. Bostock, A. Saleh, and P. Saplains Lagoss. 2019. The Global Framework for Fighting Financial Crime Enhancing Effectiveness & Improving Outcomes. Deloitte Report (2019).
Shepovalov and Akella (2020) M Shepovalov and V Akella. 2020. FPGA and GPU-based acceleration of ML workloads on Amazon cloud-A case study using gradient boosted decision tree library. Elsevier, Integration Volume 70, January 2020, Pages 1-9 (2020).
Systems (2019) EWS Early Warning Systems. 2019. Zelle Digital Adoption. EWS Reports (2019).
T Wang et al. (2018) C Wang T Wang, X Zhou, and H Chen. 2018. A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities. Arxiv (2018).
Timeframe Analysis (2019) Timeframe Analysis. 2019. Timeframe Analysis. https://www.kaggle.com/terrypham/transactiondt-timeframe-deduction.
Urban et al. (2016) Gregor Urban, Krzysztof Geras, Samira Ebrahimi Kahou, Özlem Aslan, Shengjie Wang, Rich Caruana, Abdel rahman Mohamed, Matthai Philipose, and Matthew Richardson. 2016. Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)? ArXiv abs/1603.05691 (2016).
Yang et al. (2019) Zhenchuan Yang, Chun Zhang, Weibin Zhang, Jianxiu Jin, and Dongpeng Chen. 2019. Essence Knowledge Distillation for Speech Recognition. CoRR (2019).
Zelle (2021) Zelle. 2021. Zelle® Closes 2020 with Record $307 Billion Sent on 1.2 Billion Transactions. Zelle Press Releases (2021). https://www.zellepay.com/press-releases/zeller-closes-2020-record-307-billion-sent-12-billion-transactions
Zhang et al. (2015) Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp 161–170 (2015).
Zhang et al. (2017) Huan Zhang, Si Si, and Cho-Jui Hsieh. 2017. GPU-acceleration for Large-scale Tree Boosting. Arxiv (2017).
Zhang et al. (2021) Yiman Zhang, Hanting Chen, Xinghao Chen, Yiping Deng, Chunjing Xu, and Yunhe Wang. 2021. Data-Free Knowledge Distillation for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7852–7861.
Zhou (2012) Zhi-Hua Zhou. 2012. Ensemble Methods: Foundations and Algorithms (1st ed.). Chapman & Hall/CRC.
Zhu et al. (2021) Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. 2021. Data-Free Knowledge Distillation for Heterogeneous Federated Learning. CoRR abs/2105.10056 (2021).

Appendix

Table 4. Dataset Pre-processing Details

Raw feature	Type	Encoding	Null value	Notes
TransactionAmt	Continuous	$log10()$	-	-
dist1	Continuous	$log10()$	$-0.001$	-
dist2	Continuous	$log10()$	$-0.001$	-
ProductCD	Categorical	One hot	-	-
card4	Categorical	One hot	NA	-
card6	Categorical	One hot	NA	-
M1-M9	Categorical	One hot	NA	-
device_name	Categorical	One hot	NA	“Others” if frequency $<$ 200
OS	Categorical	One hot	NA	-
Browser	Categorical	One hot	NA	“Others” if frequency $<$ 200
DeviceType	Categorical	One hot	NA	-

Table 5. Multi-layer Perceptron Architecture

Layer	# Neurons	Activation function	Parameter
Dense	400	RELU	-
BatchNormalization	-	-	-
Dropout	-	-	keep_prob = 0.5
Dense	400	RELU	-
Dropout	-	-	keep_prob = 0.5
Dense (Output)	2	Softmax	-
learning rate	-	-	0.01
Batch size	-	-	512

Table 6. Xgboost Hyperparameters

Name	Value
colsample_bytree	0.8
gamma	0.9
max_depth	3
min_child_weight	2.89
reg_alpha	3
reg_lambda	40
subsample	0.94
learning_rate	0.1
n_estimators	200