Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals

Abstract

Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In this study, we address this challenge in the context of the well-studied problem of classifying genomic sequences into different taxonomic units (strain, phyla, order, etc.). We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence and subsequently the taxonomic classification accuracies. The sequences are first transformed into spectra, and projected to a subspace, where sequences belonging to different taxons are better distinguishable. Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy.

Index Terms— genomic sequence, genomic signal processing, machine learning, Gaussian mixture model

1 Introduction

Any sequence of symbols can be mapped to a time series or signal. Biomolecular or genomic sequences, such as DNA or protein sequences that encode information required for the functioning of a cell, can also be viewed as signals and processed for gaining insights about living organisms. This genomic signal processing (GSP) perspective has been appreciated since the early 2000s [1, 2], where space-varying signal across the length of the sequence has been utilized to understand properties of DNA, RNA, and protein sequences. For instance, GSP has been employed for varied tasks in molecular biology, including identifying protein-coding regions in DNA sequences, studying functional domains or binding sites in protein sequences, or more recently the classification of genome sequences into different taxonomic categories (such as species, genus, family, order, etc.).

Obtaining a systematic frequency or spectral representation of each biomolecular sequence is a key step in any of these GSP applications. Within the context of taxonomic classification of sequences, authors of a GSP study [3] for instance used dynamic time warping on variable-length sequence representations to classify genomic signals. But, the study is limited to DNA data from ten organisms and the length of the sequences did not vary much. Authors of two more recent studies [4, 5] utilized GSP along with machine learning to classify genomic sequences at different taxonomic levels, and more importantly, could handle sequences of widely-varying lengths. But they do so by length-normalization (i.e., truncating a long sequence or padding a short sequence to a fixed length), which can possibly lead to loss or distortion of information in the fixed-length spectral representation of variable-length sequences.

In this work¹¹1This work was supported by the Wellcome Trust/DBT India Alliance Intermediate Fellowship Grant IA/I/17/2/503323 awarded to M.N., we propose a novel feature extraction technique that employs GSP and Gaussian mixture models to get a fixed-length subspace representation for genome sequences of varying lengths corresponding to different species. Unlike the existing methods [4, 5] which rely on truncation/padding of DNA sequences, we take the entire sequence information into consideration to compute spectral features. More specifically, we use a sliding window of fixed length to highlight the spectral variations across the length of the sequence, and use a “Universal Background Model (UBM)” based Gaussian Mixture Model (GMM) to project these representations into a fixed-dimensional subspace. Our technique enables different mixtures to flexibly capture the varying properties of the sequences, and thereby obtain informative subspace representations, which can then be used for better taxonomic classification. We consider the earlier GSP study [4] as the state-of-the-art baseline to assess the benefits of our approach.

Refer to caption — Fig. 1: (a) Extraction of magnitude spectra from sliding windows. (b) Average of spectra computed over sequences of the Plant class shows discrimination between the two sub-classes (see Figure S1 to S6 in Suppl. Material for other classes). We employ UBM-GMM to improve upon these average representations.

2 Background

2.1 Genomic Signal Representation

Genomic (DNA) sequences can be converted to discrete signals using any numerical representation, where each of the bases is associated with a numerical value. Digital signal processing can then be applied to these discrete signals. We can assign some numerical values to each of the nucleotides in a DNA sequence without using any biological knowledge, or by using some physical or chemical properties of the nucleotides [6, 7, 8, 9]. In the baseline study [4], 13 different numerical representations were tried, and we also adopt a similar strategy; but we present results for one of the best-performing representations viz., the Purine-Pyrimidine representation (purines (A, G) = -1 and pyrimidines (T, C) = 1)).

2.2 Feature Extraction and Taxonomic Classification in the Baseline GSP Method

Digital signal processing enables the study of the spectral properties of sequences. Distinct spectral properties enable classification. In the baseline, study [4], the sequences of different lengths are normalized (i.e., truncated or “anti-symmetrically” padded) to the median length of all sequences. Discrete Fourier Transform (DFT) is then applied to the length-normalized sequences to extract the corresponding magnitude spectra. The fixed-length spectra were then used to compute pairwise distance matrices using both Pearson’s Correlation Coefficient (PCC) and Euclidean distance metrics. The corresponding columns of the distance matrix were considered as feature vectors for each of the sequences for the classification task.

2.3 Universal Background Model

A Universal Background Model (UBM) is a Gaussian mixture model (GMM) that has been widely used for speaker recognition tasks [10]. We use this model in our study to obtain a fixed-length subspace representation for genomic sequences. A GMM is a generative model trained on data belonging to one particular class. A huge amount of data is required to estimate the parameters of a GMM accurately, especially when the dimension of the feature vector is quite large. UBM-GMM is employed in such cases to tackle the issue of data availability. It is built on data pooled from different classes and hence it tries to capture the properties of all of these classes. Moreover, owing to the availability of a large amount of data, the parameters of UBM-GMM are robustly estimated. Then a particular class’s data can be used to adapt a GMM from the trained UBM-GMM using maximum-a-posteriori adaptation (MAP).

3 Proposed Work

3.1 Our UBM-GMM-based subspace representation

We propose a novel UBM-GMM-based approach to obtain a fixed-length subspace representation for variable-length sequences, which can then be used for the genome classification task. The main contribution of our approach is that we can handle sequences of widely-varying lengths, without suffering the likely loss of information associated with length normalization (padding/truncation) of sequences done in existing studies [4, 5]. This loss of information is a serious issue when the lengths of the sequences vary quite a lot (which is the case for many classes in our application (Tables 1,2)).

The key idea behind our approach is that we take into account information across the entire sequence, by extracting features from every frame (sliding window) of the sequence (Figure 1 (a)). This yields multiple spectra from a given sequence, and we pool all such spectra corresponding to all the sequences across all the classes to build a UBM-GMM (see Figure 2) and subsequently get a fixed-length representation.

To provide more detail, we slide a window (with overlap across windows) of fixed length across the sequence, where we extract spectral features for every segment of the sequence. As discussed in [2], we experimented with a sufficiently large window of sizes ranging from 351 to a few thousand nucleotides. We considered window shifts ranging from 100 to 300 nucleotides. The magnitude spectrum is computed for each windowed sequence, with the DFT order chosen appropriately (window size rounded up to the nearest power of two). These magnitude spectra were then used to build a UBM which projects the spectra to a fixed-dimensional subspace, while primarily ensuring that the variability of the sequences across the entire sequence length is captured in the representation. The number of mixture components for UBM-GMM was empirically chosen as ten.

When adapting a GMM from the UBM-GMM using sequence data, we do mean-only adaptation using MAP, i.e., only the means of the UBM-GMM are changed depending on the feature vectors of the given sequence. The covariance matrices of the mixture components are not adapted due to insufficient data points from a sequence. The means of the adapted GMM are stacked to get a mean-supervector (Figure 2), which acts as the fixed-length subspace representation vector for the entire (untruncated/unpadded) sequence.

3.2 Taxonomic classification using the UBM-GMM-based representation

Once the subspace representations of sequences are obtained using our novel UBM-GMM-based approach, we can use the rest of the setup from the baseline approach for the taxonomic classification task, in order to perform a fair comparison of our and baseline representations’ classification accuracy. Specifically, our subspace representations are used to calculate a pair-wise distance matrix using PCC (as this was found to be the better metric in [4]), and the distance matrix columns used as feature vectors for the classification task. We used four different machine learning classifiers — Linear Discriminant, Linear Support Vector Machine (SVM), Quadratic SVM, and K-Nearest Nighbours (KNN) — as in the baseline study [4].

4 Experimentation

4.1 Dataset

All the datasets used in this study are available from the National Center for Biotechnology Information (NCBI) and through links provided in the Suppl. material ²²2Supplementary material/codes: https://sites.google.com/view/genomic-signal-processing/. The datasets consist of mitochondrial and whole genome sequences corresponding to 8 classes. Each of these classes consists of two or more sub-classes, and classification is done at the sub-class level.

Table 1: Benchmark dataset details shown for certain classes here (see Table S1 in Suppl. Material for remaining classes)

Class

Subclasses (no. of seq.)

Statistics

Fungi

Basidiomycota (30),

Pezizomycotina (104),

Saccharomycotina (92)

Min: 1364; Max: 235849

Med: 39154

Plants

Chlorophyta (44),

Streptophyta (130)

Min: 12998; Max: 1999595

Med: 128211

4.2 Results and Analysis

We ran the experiments on the benchmark datasets using the baseline as well as our approach. We experimented with different numerical representations, distance metrics to compute pair-wise distance matrices, window sizes, and window shifts. Based on the performances, we used the Purine-Pyrimidine representation, median length normalization (for the baseline approach), Pearson’s correlation coefficient (PCC) as a distance metric, window size as 702, window shift as 300 and FFT Order as 1024. The average accuracies computed over earlier discussed four different classifiers are mentioned in Table 2.

Table 2: Comparison of the average accuracies over four different classifiers. For details on individual accuracies of each classifier, see Table S2 in the Suppl. material.

Class	Average Accuracy		Median seq. length	MAD
Class	Baseline	Our approach	Median seq. length	MAD
Primates	98.975	100	16554	50
Protists	78.775	95.75	35660	13944
Fungi	79.925	91.4	39154	34768
Plants	89.65	97.275	128211	290801
Insects	89.95	96.975	15529	768
Vertebrates	98.55	99.7	16616	697
Bacteria	93.825	97	70992	91794
Dengue	99.975	100	10676	174
Average	91.203	97.263

Our approach outperformed the baseline approach on the benchmark datasets especially when the Median Absolute Deviation (MAD) of the sequence lengths was high. It performed quite well even on unseen classes — Bacteria and Dengue — which were not used for building UBM-GMM. This suggests that UBM-GMM captures some important characteristics of the genomic data that results in discriminative properties of the adapted mixture models of a particular class. In order to interpret these results, we tried to visualize the activations per mixture components, for sequences of a given class, during the MAP adaptation process. Figure 3 (a) shows the active mixture components for Plants and Fungi classes along with their sub-classes. We can see from the figures that the active mixture components vary across sub-classes which provides the necessary discriminative properties that enable classification. Variations in active mixture components can also be observed across Plants and Fungi classes. Figure 3 (b) shows the means of the most active mixture components for two sequences belonging to the Chlorophyta and Streptophyta sub-classes of the Plant class each. The means of the active mixture components are clearly distinct across the two sub-classes.

5 Conclusion and Future Work

We proposed a novel UBM-GMM-based subspace representation to facilitate the genomic sequence classification task. In particular, instead of length-normalizing sequences, we use a sliding window of fixed length across the sequences. The features extracted from fixed-length frames are then used to build the UBM-GMM. This helps in ensuring that the properties that vary across a sequence are indeed captured by different activations of different mixture components, and allows the representation model to generalize to genomes of unseen species as well. The next step would be to study if the ordering of these features are important for classification, especially for Fungi, where the performance is comparatively poor. A detailed phylogenetic analysis at different taxonomic levels and its correlation to the proposed methods may lead to new insights.

References

[1] Dimitris Anastassiou, “Genomic ignal processing,” IEEE Signal Processing Magazine, vol. 18, no. 4, pp. 8–20, 2001.
[2] PP Vaidyanathan, “Genomics and proteomics: A signal processor’s tour,” IEEE Circuits and Systems Magazine, vol. 4, no. 4, pp. 6–29, 2004.
[3] Helena Skutkova, Martin Vitek, Petr Babula, Rene Kizek, and Ivo Provaznik, “Classification of genomic signals using dynamic time warping,” BMC Bioinformatics, vol. 14, no. 10, pp. 1–7, 2013.
[4] Gurjit S Randhawa, Kathleen A Hill, and Lila Kari, “ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels,” BMC Genomics, vol. 20, no. 1, pp. 1–21, 2019.
[5] Gurjit S Randhawa, Maximillian PM Soltysiak, Hadi El Roz, Camila PE de Souza, Kathleen A Hill, and Lila Kari, “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study,” PLOS ONE, vol. 15, no. 4, pp. e0232391, 2020.
[6] Hon Keung Kwan and Swarna Bai Arniker, “Numerical representation of DNA sequences,” in 2009 IEEE International Conference on Electro/Information Technology. IEEE, 2009, pp. 307–310.
[7] Ernesto Borrayo, E Gerardo Mendizabal-Ruiz, Hugo Vélez-Pérez, Rebeca Romo-Vázquez, Adriana P Mendizabal, and J Alejandro Morales, “Genomic signal processing methods for computation of alignment-free distances from DNA sequences,” PLOS ONE, vol. 9, no. 11, pp. e110954, 2014.
[8] Emmanuel Adetiba and Oludayo O Olugbara, “Classification of eukaryotic organisms through cepstral analysis of mitochondrial DNA,” in International Conference on Image and Signal Processing. Springer, 2016, pp. 243–252.
[9] Emmanuel Adetiba, Oludayo O Olugbara, and Tunmike B Taiwo, “Identification of pathogenic viruses using genomic cepstral coefficients with radial basis function neural network,” in Advances in Nature and Biologically Inspired Computing, pp. 281–291. Springer, 2016.
[10] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, 2000.