This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of Florence, Dept. of Mathematics and Computer Science, Florence, Italy 22institutetext: University of Perugia, Dept. of Chemistry, Biology and Biotechnology, Perugia, Italy33institutetext: University of Perugia, Dept. of Mathematics and Computer Science, Perugia, Italy

A new method for binary classification of proteins with Machine Learning

Damiano Perri ORCID:0000-0001-6815-6659
Marco Simonetti ORCID:0000-0003-2923-5519
Andrea Lombardi ORCID:0000-0002-7875-2697
Noelia Faginas-Lago ORCID:0000-0002-4056-3364
Osvaldo Gervasi ORCID:0000-0003-4327-520X
1111222233
Abstract

In this work we set out to find a method to classify protein structures using a Deep Learning methodology. Our Artificial Intelligence has been trained to recognize complex biomolecule structures extrapolated from the Protein Data Bank (PDB) database and reprocessed as images; for this purpose various tests have been conducted with pre-trained Convolutional Neural Networks, such as InceptionResNetV2 or InceptionV3, in order to extract significant features from these images and correctly classify the molecule. A comparative analysis of the performances of the various networks will therefore be produced.

Keywords:
Machine Learning, Computational Chemistry, Protein Data Bank, Convolutional Neural Network, Image Processing, Orthogonal Axonometry

1 Introduction

The classification of the geometric structures of proteins and the individuation of possible simple criteria to base their discrimination are complex tasks and a longstanding issue in chemical sciences. To investigate the relationships between structure and activity and for a satisfactory theoretical understanding of the protein folding process, the ability to assess the “correctness” and similarity of possible spatial arrangements of such macromolecules is a prerequisite.
The recent increased practicability of computational approaches based on Deep Learning and Neural Networks further motivates renewed efforts in such direction, since it permits one to resort to approaches based on the search for hidden patterns and regularities across large set of experimentally resolved protein structures.

In a recent paper [1], we developed an approach to the basic problem of classifying as ”real” a protein given its amino acid sequence, using a Deep Learning approach, based upon a Convolutional Neural Network (CNN) trained on a large set of data.
In the present paper, which is intended as a continuation of such previous work, a Convolutional Neural Network is again aimed at classifying as ”true” or ”false” a given structure, but the CNN has been developed after new significant improvements to the original approach for the recognition of the geometric structures. The idea was not to lose valuable spatial information regarding the shape of the protein to be examined, so we moved beyond the molecule model as a simple sequence of amino acids, to get to a more effective and realistic description preserving spatial information. To this purpose we exploited the well known suitability of Convolutional Neural Networks for image analysis, where they are particularly appreciated in the recognition of images and their characteristics, both on two-dimensional or three-dimensional objects, through the extraction of particular features from images so that different kind of objects, like people or things, can be correctly classified.

In this article we illustrate our approach to the problem using a two-dimensional representation based on 2D Convolutional Neural Networks, where any given protein is mapped into a two-dimensional grid of coloured pixels and then processed by the CNN in order to extract the relevant features and the characteristic properties to carry out the protein classification. In order to train the neural network, similarly to the previous work, the set of protein structures was obtained from the Protein Data Bank (PDB)[2] an open access repository containing data about proteins and nucleic acids’ structures.

The paper is organized as follows. In Sec. 2 we briefly point out some key points about data extraction for protein classification. Sec. 3 illustrates the characteristics of our CNN, reports details about data extraction and processing. Preliminary applications, for training and validation, are also reported. Conclusions and perspectives are in Sec. 4.

2 Strategies for classification of complex molecular structures

For some years now, the use of Machine Learning techniques has rapidly become more and more pervasive in the world of Biology[3, 4] and Chemistry [5, 6, 7, 8], especially in the field of classification of macro-molecules that are generally found in the modeling of protein and bio-molecular structures[9]. The identification and correct assignment of the protein attribute to a generic bio-molecule is of considerable importance both for the purposes of genomic mapping and for the preparation of new and more specific groups of drugs[10, 11, 12].
Various methodologies are continuously suggested that take into consideration different aspects[13], such as chemical-physical properties [14] or geometric structure [15, 16], to reach the goal. In the first case, two different approaches are possible:

  1. 1.

    attention is focused on a chemical-physical feature of interest and the bio-molecules are labelled, also using a deep learning or SVM (Support Vector Machine) technique to obtain an accurate and automated classifier;

  2. 2.

    an n-dimensional vector with the descriptors of the chemical-physical properties to be examined is produced with an attached class label[17].

Several studies have confirmed that the combination of protein characteristics is preferable in order to obtain better predictive information than the use of single protein characteristics[18].
In the second case, the spatial arrangement of the various amino acid chains is evaluated for classification, both with SVM[19] and Machine Learning/Deep Learning[20, 21] techniques. Today, convolutional neural networks are widely used in image analysis. They allow the extraction of features thanks to which object recognition or image classification can be performed[22, 23, 24]. Our research fits into this last channel with the aim of testing a way to correctly classify protein groups. Most of the work in this area has mainly focused on the study of the sequence and position of amino acids in the protein chain (primary or secondary sequence); our research has tried to maintain the information related to the sequences, simultaneously capturing all the geometric characteristics using 2D axonometric maps.

3 Methodology: the architecture of the system

The system is a classic binary classifier whose task is to correctly subdivide the biomolecules given in input as ”protein” or ”non-protein”. The data relating to the molecules to be examined are passed to the neural network as two-dimensional image processing, which faithfully reconstructs the geometric structure of the molecule itself.

3.1 Data extraction and processing

In order to validate our methodology, we have extracted from the PDB a sufficient number of records useful to effectively train the network; after several tests, it was found that good results were obtained with a number of samples around 3,000 units. This allowed us to select a group of proteins (equal to 2,911 molecules, with 16,924,350 amino acids) with the best images of their structure and focus our attention on the results. The whole process, from data extraction to image evaluation by the neural network, was performed in Python3, with the help of well-known libraries useful for scientific computing and data processing, such as Numpy and Pandas.

Refer to caption
Figure 1: Information extracted: spatial position, name and colour code of amino acids

The various steps required for the generation and management of the dataset are listed below:
1) Data extraction in XML format from the PDB database, identification of the necessary information we wish to keep and its subsequent transformation into a single object managed by the Pandas library (Fig.1)
2) Data cleaning, with the elimination of any duplicate records, possibly generated by the two different methods of measuring the crystal lattice structure for the molecule
3) Association of a unique RGB color code to each amino acid present in proteins (e.g. Alanine 128, Glycine 65280, Lysine 8421376), in order to visualize the structure of the molecule as an image, on which every single amino acid is coloured in a different way

Refer to caption
Figure 2: Orthogonal axonometry
  • \bullet

    Visualization of the molecule according to a multiview orthographic axonometric map with orthogonal projections on three planes (horizontal plane, vertical plane and lateral/profile plane - an example of this type of axonometry is shown in Figure 2), in order to split and project the whole 3D image on a flat surface, without losing the isometricity and symmetries on the x, y and z components for the individual amino acids. From the analysis of the coordinates of all the amino acids present in the dataset, it was possible through an appropriate translation for the origin of the Cartesian reference system and an integer mapping for the numerical values of the coordinates themselves to represent each single protein in the domain D=[0,3200]33D=[0,3200]^{3}\subset\mathbb{N}^{3}; this allowed us to refer to each point (x, y, z) belonging to the cube D as a 3-indices tuple for the tensor with dimensions 3200x3200x3200, capable of containing the entire biomolecule

  • \bullet

    Each image has been processed to fit within the 299x299 pixel dimensions, necessary as input dimensions for a 2D convolutional neural network; in order to avoid distorting the original axonometric proportions, the figures have been carefully cut out at the edges and in the central areas, to reduce unnecessary black padding.

Refer to caption
(a) Subfigure 1 list of figures text
Refer to caption
(b) Subfigure 2 list of figures text
Refer to caption
(c) Subfigure 3 list of figures text
Refer to caption
(d) Subfigure 4 list of figures text
Figure 3: Images of 4 proteins obtained with our representation method

Therefore, for each single protein, the various amino acids were projected, as colored dots, on the three main planes: in Figure 3 four images obtained with our method for four different proteins are shown.
4) Generation of false samples, necessary for the learning of the neural network. It was decided to proceed starting from the original images; for the single amino acids belonging to each protein we applied a mutation probability, established at the beginning of the process (in our case experimental tests made us lean towards a fixed value of 5%), which induced a colour change on the coloured points representing the amino acid: 2911 images of false proteins were thus produced. In Fig. 4 a portion of a true protein is reproduced (above) with its false analogue (below): it is possible to notice the differences due to the mutation process (pixel that occupies the same position in the two figures, but has different colors).

Refer to caption
Figure 4: Differences between the same portion of a real molecule (above) and its false analogue (below)

3.2 Training and validation

The performances of two neural networks were analyzed: InceptionV3 and InceptionResNetV2. These networks were trained through the transfer learning technique which allowed us to obtain high accuracy values with a reduced number of learning epochs.
Transfer learning is the process by which a neural network learn a new task through knowledge’s transfer a related task previously learned. Using transfer learning, the weight values of the convolutional layers are imported (they are generally used for feature extraction and already present as initial parameters of the network itself), and only the final layers of the neural network that are generally used for classification are trained[25, 26]. Both networks are pre-trained on ImageNet which is an image dataset used for object recognition consisting of 14 million photographs.
The model used for the InceptionV3 neural network is shown in the table 3.2, while the model used for the InceptionResNetV2 neural network is shown in the table 2.

Layer (type) \cellcolor[HTML]FFFFFFOutput Shape \cellcolor[HTML]FFFFFFParam #
\rowcolor[HTML]FFFFFF         \cellcolor[HTML]FFFFFFinception_v3 \cellcolor[HTML]FFFFFF(None, 8, 8, 2048) \cellcolor[HTML]FFFFFF21802784
\rowcolor[HTML]FFFFFF         \cellcolor[HTML]FFFFFFflatten (Flatten) \cellcolor[HTML]FFFFFF(None, 131072) \cellcolor[HTML]FFFFFF0
\rowcolor[HTML]FFFFFF         \cellcolor[HTML]FFFFFFdense (Dense) \cellcolor[HTML]FFFFFF(None, 64) \cellcolor[HTML]FFFFFF8388672
dense_1 (Dense) (None, 64) 4160
dense_2 (Dense) (None, 1) 65
=====================================
Total Params: 30,195,681
Trainable Params: 30,161,249
Table 1: InceptionV3 model
\cellcolor[HTML]FFFFFFLayer (type) \cellcolor[HTML]FFFFFFOutput Shape \cellcolor[HTML]FFFFFFParam #
\cellcolor[HTML]FFFFFFinception_resnet_v2 \cellcolor[HTML]FFFFFF(None, 8, 8, 1536) \cellcolor[HTML]FFFFFF54336736
\cellcolor[HTML]FFFFFFflatten (Flatten) \cellcolor[HTML]FFFFFF(None, 98304) \cellcolor[HTML]FFFFFF0
\cellcolor[HTML]FFFFFFdense (Dense) \cellcolor[HTML]FFFFFF(None, 64) \cellcolor[HTML]FFFFFF6291520
\cellcolor[HTML]FFFFFFdense_1 (Dense) \cellcolor[HTML]FFFFFF(None, 64) \cellcolor[HTML]FFFFFF4160
\cellcolor[HTML]FFFFFFdense_2 (Dense) \cellcolor[HTML]FFFFFF(None, 1) \cellcolor[HTML]FFFFFF65
\cellcolor[HTML]FFFFFF=====================================
\cellcolor[HTML]FFFFFFTotal Params: \cellcolor[HTML]FFFFFF60,632,481
\cellcolor[HTML]FFFFFFTrainable Params: \cellcolor[HTML]FFFFFF60,571,937
Table 2: InceptionResNetV2 model

The results of the training are summarized in the table 3, where it can be seen that the InceptionV3 network has a higher accuracy’s percentage in recognizing false proteins from true ones compared to the results obtained by InceptionResNetV2.

\columncolor[HTML]FFFFFF \columncolor[HTML]FFFFFFInceptionV3 \columncolor[HTML]FFFFFFInceptionResNetV2
\cellcolor[HTML]FFFFFFAccuracy Train Set \columncolor[HTML]FFFFFF99.06% \columncolor[HTML]FFFFFF98.36%
\cellcolor[HTML]FFFFFFAccuracy Test Set \columncolor[HTML]FFFFFF94.57% \columncolor[HTML]FFFFFF92.14%
\cellcolor[HTML]FFFFFFModel Size \columncolor[HTML]FFFFFF461MB \columncolor[HTML]FFFFFF465MB
Table 3: Final Results

The Confusion Matrices[27] of the two models indicate that they are both able to classify proteins with good results, also we believe that a correct parameters’ modulation, through the realization of ad-hoc models, can further increase the degree of accuracy. The matrices are shown in figure 5.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Confusion matrices

4 Conclusions and future works

In this work we presented an improved Convolutional Neural Network for the classification of protein structure. Preliminary results appear encouraging, showing that the methodology used for the representation of the protein geometry, based on a 2D projected image associated to each molecule, retains most of the spatial information and is suitable for recognition. Also, the computational performances seems quite good, being calculations extremely fast as we provide the neural network with an image for each molecule.

However, considerations on the general nature of the CNNs used lead us to think that specifically designed neural networks could significantly improve the results, or even outperform them. A further research path worth being followed is to train the neural network using a greater number of samples, to better analyze the link between the samples’ structural complexity and the classification capacity of the neural network itself. We believe that even better performance might be achieved if we developed a neural network customised for the graphical representation we proposed. Our representation is in fact very particular and within our images the areas with a high information content are located in very specific sectors of the images. Furthermore, a personalised neural network could also reduce the size in MegaBytes of the model obtained in output.

5 Acknowledgments

AL and NFL thank the Dipartimento di Chimica, Biologia e Biotecnologie dell’Università di Perugia (FRB, Fondo per la Ricerca di Base 2019 and 2020) and the Italian MIUR and the University of Perugia for the financial support of the AMIS project through the program “Dipartimenti di Eccellenza”. AL thanks the OU Supercomputing Center for Education & Research (OSCER) at the University of Oklahoma, for allocated computing time.

References

  • [1] Damiano Perri et al. “Binary Classification of Proteins by a Machine Learning Approach” In International Conference on Computational Science and Its Applications, 2020, pp. 549–558 Springer
  • [2] H.. Berman et al. “The Protein Data Bank” In Nucleic Acids Research 28, 2000, pp. 235–242 URL: http://www.rcsb.org/
  • [3] Hugh M Cartwright “Artificial neural networks in biology and chemistry—the evolution of a new analytical tool” In Artificial Neural Networks Springer, 2008, pp. 1–13
  • [4] Vidhya Gomathi Krishnan and David R. Westhead “A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function” In Bioinformatics 19.17 Oxford University Press, 2003, pp. 2199–2209
  • [5] Pavlo O Dral “Quantum chemistry in the age of machine learning” In The journal of physical chemistry letters 11.6 ACS Publications, 2020, pp. 2336–2347
  • [6] Brian B Goldman and W Patrick Walters “Machine learning in computational chemistry” In Annual Reports in Computational Chemistry 2 Elsevier, 2006, pp. 127–140
  • [7] Jane Panteleev, Hua Gao and Lei Jia “Recent applications of machine learning in medicinal chemistry” In Bioorganic & medicinal chemistry letters 28.17 Elsevier, 2018, pp. 2807–2815
  • [8] Adam C Mater and Michelle L Coote “Deep learning in chemistry” In Journal of chemical information and modeling 59.6 ACS Publications, 2019, pp. 2545–2559
  • [9] Mohammad Reza Bakhtiarizadeh, Mohammad Moradi-Shahrbabak, Mansour Ebrahimi and Esmaeil Ebrahimie “Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology” In Journal of Theoretical Biology 356 Elsevier, 2014, pp. 213–222
  • [10] Si-sheng Ou-Yang et al. “Computational drug discovery” In Acta Pharmacologica Sinica 33.9 Nature Publishing Group, 2012, pp. 1131–1140
  • [11] Gregory Sliwoski, Sandeepkumar Kothiwale, Jens Meiler and Edward W Lowe “Computational methods in drug discovery” In Pharmacological reviews 66.1 ASPET, 2014, pp. 334–395
  • [12] Ali Akbar Jamali et al. “DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins” In Drug discovery today 21.5 Elsevier, 2016, pp. 718–724
  • [13] Rishi Das Roy and Debasis Dash “Selection of relevant features from amino acids enables development of robust classifiers” In Amino acids 46.5 Springer, 2014, pp. 1343–1351
  • [14] Masateru Taniguchi “Combination of single-molecule electrical measurements and machine learning for the identification of single biomolecules” In ACS omega 5.2 ACS Publications, 2020, pp. 959–964
  • [15] Simon Brandt, Florian Sittel, Matthias Ernst and Gerhard Stock “Machine learning of biomolecular reaction coordinates” In The journal of physical chemistry letters 9.9 ACS Publications, 2018, pp. 2144–2150
  • [16] Zixuan Cang, Lin Mu and Guo-Wei Wei “Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening” In PLoS computational biology 14.1 Public Library of Science, 2018, pp. e1005929
  • [17] M Michael Gromiha, Shandar Ahmad and Makiko Suwa “Neural network based prediction of protein structure and Function: Comparison with other machine learning methods” In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1739–1744 IEEE
  • [18] Serene AK Ong et al. “Efficacy of different protein descriptors in predicting protein functional families” In Bmc Bioinformatics 8.1 BioMed Central, 2007, pp. 1–14
  • [19] CZ Cai et al. “SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence” In Nucleic acids research 31.13 Oxford University Press, 2003, pp. 3692–3697
  • [20] Juan Cui et al. “Advances in exploration of machine learning methods for predicting functional class and interaction profiles of proteins and peptides irrespective of sequence homology” In Current Bioinformatics 2.2 Bentham Science Publishers, 2007, pp. 95–112
  • [21] Chris HQ Ding and Inna Dubchak “Multi-class protein fold recognition using support vector machines and neural networks” In Bioinformatics 17.4 Oxford University Press, 2001, pp. 349–358
  • [22] Damiano Perri et al. “Towards a Learning-Based Performance Modeling for Accelerating Deep Neural Networks” In Computational Science and Its Applications – ICCSA 2019 Cham: Springer International Publishing, 2019, pp. 665–676
  • [23] Giulio Biondi, Valentina Franzoni, Osvaldo Gervasi and Damiano Perri “An Approach for Improving Automatic Mouth Emotion Recognition” In Computational Science and Its Applications – ICCSA 2019 Cham: Springer International Publishing, 2019, pp. 649–664
  • [24] Paolo Sylos Labini et al. “On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond” In ACM Trans. Archit. Code Optim. 18.1 New York, NY, USA: Association for Computing Machinery, 2021 DOI: 10.1145/3434402
  • [25] Valentina Franzoni, Giulio Biondi, Damiano Perri and Osvaldo Gervasi “Enhancing Mouth-Based Emotion Recognition Using Transfer Learning” In Sensors 20.18, 2020 DOI: 10.3390/s20185222
  • [26] Priscilla Benedetti et al. “Skin Cancer Classification Using Inception Network and Transfer Learning” In Computational Science and Its Applications – ICCSA 2020 Cham: Springer International Publishing, 2020, pp. 536–545
  • [27] Sofia Visa, Brian Ramsay, Anca L Ralescu and Esther Van Der Knaap “Confusion Matrix-based Feature Selection.” In MAICS 710, 2011, pp. 120–127