¹¹institutetext: University of Florence, Dept. of Mathematics and Computer Science, Florence, Italy ²²institutetext: University of Perugia, Dept. of Chemistry, Biology and Biotechnology, Perugia, Italy³³institutetext: University of Perugia, Dept. of Mathematics and Computer Science, Perugia, Italy

A new method for binary classification of proteins with Machine Learning

Damiano Perri ^{ORCID:0000-0001-6815-6659}
Marco Simonetti ^{ORCID:0000-0003-2923-5519}
Andrea Lombardi ^{ORCID:0000-0002-7875-2697}
Noelia Faginas-Lago ^{ORCID:0000-0002-4056-3364}
Osvaldo Gervasi ^{ORCID:0000-0003-4327-520X} 1111222233

Abstract

In this work we set out to find a method to classify protein structures using a Deep Learning methodology. Our Artificial Intelligence has been trained to recognize complex biomolecule structures extrapolated from the Protein Data Bank (PDB) database and reprocessed as images; for this purpose various tests have been conducted with pre-trained Convolutional Neural Networks, such as InceptionResNetV2 or InceptionV3, in order to extract significant features from these images and correctly classify the molecule. A comparative analysis of the performances of the various networks will therefore be produced.

Keywords:

Machine Learning, Computational Chemistry, Protein Data Bank, Convolutional Neural Network, Image Processing, Orthogonal Axonometry

1 Introduction

The classification of the geometric structures of proteins and the individuation of possible simple criteria to base their discrimination are complex tasks and a longstanding issue in chemical sciences. To investigate the relationships between structure and activity and for a satisfactory theoretical understanding of the protein folding process, the ability to assess the “correctness” and similarity of possible spatial arrangements of such macromolecules is a prerequisite.
The recent increased practicability of computational approaches based on Deep Learning and Neural Networks further motivates renewed efforts in such direction, since it permits one to resort to approaches based on the search for hidden patterns and regularities across large set of experimentally resolved protein structures.

In a recent paper [1], we developed an approach to the basic problem of classifying as ”real” a protein given its amino acid sequence, using a Deep Learning approach, based upon a Convolutional Neural Network (CNN) trained on a large set of data.
In the present paper, which is intended as a continuation of such previous work, a Convolutional Neural Network is again aimed at classifying as ”true” or ”false” a given structure, but the CNN has been developed after new significant improvements to the original approach for the recognition of the geometric structures. The idea was not to lose valuable spatial information regarding the shape of the protein to be examined, so we moved beyond the molecule model as a simple sequence of amino acids, to get to a more effective and realistic description preserving spatial information. To this purpose we exploited the well known suitability of Convolutional Neural Networks for image analysis, where they are particularly appreciated in the recognition of images and their characteristics, both on two-dimensional or three-dimensional objects, through the extraction of particular features from images so that different kind of objects, like people or things, can be correctly classified.

In this article we illustrate our approach to the problem using a two-dimensional representation based on 2D Convolutional Neural Networks, where any given protein is mapped into a two-dimensional grid of coloured pixels and then processed by the CNN in order to extract the relevant features and the characteristic properties to carry out the protein classification. In order to train the neural network, similarly to the previous work, the set of protein structures was obtained from the Protein Data Bank (PDB)[2] an open access repository containing data about proteins and nucleic acids’ structures.

The paper is organized as follows. In Sec. 2 we briefly point out some key points about data extraction for protein classification. Sec. 3 illustrates the characteristics of our CNN, reports details about data extraction and processing. Preliminary applications, for training and validation, are also reported. Conclusions and perspectives are in Sec. 4.

2 Strategies for classification of complex molecular structures

For some years now, the use of Machine Learning techniques has rapidly become more and more pervasive in the world of Biology[3, 4] and Chemistry [5, 6, 7, 8], especially in the field of classification of macro-molecules that are generally found in the modeling of protein and bio-molecular structures[9]. The identification and correct assignment of the protein attribute to a generic bio-molecule is of considerable importance both for the purposes of genomic mapping and for the preparation of new and more specific groups of drugs[10, 11, 12].
Various methodologies are continuously suggested that take into consideration different aspects[13], such as chemical-physical properties [14] or geometric structure [15, 16], to reach the goal. In the first case, two different approaches are possible:

1.

attention is focused on a chemical-physical feature of interest and the bio-molecules are labelled, also using a deep learning or SVM (Support Vector Machine) technique to obtain an accurate and automated classifier;
2.

an n-dimensional vector with the descriptors of the chemical-physical properties to be examined is produced with an attached class label[17].

Several studies have confirmed that the combination of protein characteristics is preferable in order to obtain better predictive information than the use of single protein characteristics[18].
In the second case, the spatial arrangement of the various amino acid chains is evaluated for classification, both with SVM[19] and Machine Learning/Deep Learning[20, 21] techniques. Today, convolutional neural networks are widely used in image analysis. They allow the extraction of features thanks to which object recognition or image classification can be performed[22, 23, 24]. Our research fits into this last channel with the aim of testing a way to correctly classify protein groups. Most of the work in this area has mainly focused on the study of the sequence and position of amino acids in the protein chain (primary or secondary sequence); our research has tried to maintain the information related to the sequences, simultaneously capturing all the geometric characteristics using 2D axonometric maps.

3 Methodology: the architecture of the system

The system is a classic binary classifier whose task is to correctly subdivide the biomolecules given in input as ”protein” or ”non-protein”. The data relating to the molecules to be examined are passed to the neural network as two-dimensional image processing, which faithfully reconstructs the geometric structure of the molecule itself.

3.1 Data extraction and processing

In order to validate our methodology, we have extracted from the PDB a sufficient number of records useful to effectively train the network; after several tests, it was found that good results were obtained with a number of samples around 3,000 units. This allowed us to select a group of proteins (equal to 2,911 molecules, with 16,924,350 amino acids) with the best images of their structure and focus our attention on the results. The whole process, from data extraction to image evaluation by the neural network, was performed in Python3, with the help of well-known libraries useful for scientific computing and data processing, such as Numpy and Pandas.

Refer to caption — Figure 1: Information extracted: spatial position, name and colour code of amino acids

The various steps required for the generation and management of the dataset are listed below:
1) Data extraction in XML format from the PDB database, identification of the necessary information we wish to keep and its subsequent transformation into a single object managed by the Pandas library (Fig.1)
2) Data cleaning, with the elimination of any duplicate records, possibly generated by the two different methods of measuring the crystal lattice structure for the molecule
3) Association of a unique RGB color code to each amino acid present in proteins (e.g. Alanine 128, Glycine 65280, Lysine 8421376), in order to visualize the structure of the molecule as an image, on which every single amino acid is coloured in a different way

$\bullet$

Visualization of the molecule according to a multiview orthographic axonometric map with orthogonal projections on three planes (horizontal plane, vertical plane and lateral/profile plane - an example of this type of axonometry is shown in Figure 2), in order to split and project the whole 3D image on a flat surface, without losing the isometricity and symmetries on the x, y and z components for the individual amino acids. From the analysis of the coordinates of all the amino acids present in the dataset, it was possible through an appropriate translation for the origin of the Cartesian reference system and an integer mapping for the numerical values of the coordinates themselves to represent each single protein in the domain $D=[0,3200]^{3}\subset\mathbb{N}^{3}$ ; this allowed us to refer to each point (x, y, z) belonging to the cube D as a 3-indices tuple for the tensor with dimensions 3200x3200x3200, capable of containing the entire biomolecule
$\bullet$

Each image has been processed to fit within the 299x299 pixel dimensions, necessary as input dimensions for a 2D convolutional neural network; in order to avoid distorting the original axonometric proportions, the figures have been carefully cut out at the edges and in the central areas, to reduce unnecessary black padding.

Therefore, for each single protein, the various amino acids were projected, as colored dots, on the three main planes: in Figure 3 four images obtained with our method for four different proteins are shown.
4) Generation of false samples, necessary for the learning of the neural network. It was decided to proceed starting from the original images; for the single amino acids belonging to each protein we applied a mutation probability, established at the beginning of the process (in our case experimental tests made us lean towards a fixed value of 5%), which induced a colour change on the coloured points representing the amino acid: 2911 images of false proteins were thus produced. In Fig. 4 a portion of a true protein is reproduced (above) with its false analogue (below): it is possible to notice the differences due to the mutation process (pixel that occupies the same position in the two figures, but has different colors).

3.2 Training and validation

The performances of two neural networks were analyzed: InceptionV3 and InceptionResNetV2. These networks were trained through the transfer learning technique which allowed us to obtain high accuracy values with a reduced number of learning epochs.
Transfer learning is the process by which a neural network learn a new task through knowledge’s transfer a related task previously learned. Using transfer learning, the weight values of the convolutional layers are imported (they are generally used for feature extraction and already present as initial parameters of the network itself), and only the final layers of the neural network that are generally used for classification are trained[25, 26]. Both networks are pre-trained on ImageNet which is an image dataset used for object recognition consisting of 14 million photographs.
The model used for the InceptionV3 neural network is shown in the table 3.2, while the model used for the InceptionResNetV2 neural network is shown in the table 2.

Layer (type)	\cellcolor[HTML]FFFFFFOutput Shape	\cellcolor[HTML]FFFFFFParam #
\rowcolor[HTML]FFFFFF \cellcolor[HTML]FFFFFFinception_v3			\cellcolor[HTML]FFFFFF(None, 8, 8, 2048)	\cellcolor[HTML]FFFFFF21802784
\rowcolor[HTML]FFFFFF \cellcolor[HTML]FFFFFFflatten (Flatten)			\cellcolor[HTML]FFFFFF(None, 131072)	\cellcolor[HTML]FFFFFF0
\rowcolor[HTML]FFFFFF \cellcolor[HTML]FFFFFFdense (Dense)			\cellcolor[HTML]FFFFFF(None, 64)	\cellcolor[HTML]FFFFFF8388672
dense_1 (Dense)	(None, 64)	4160
dense_2 (Dense)	(None, 1)	65
=====================================
Total Params:		30,195,681
Trainable Params:		30,161,249

\cellcolor[HTML]FFFFFFLayer (type)	\cellcolor[HTML]FFFFFFOutput Shape	\cellcolor[HTML]FFFFFFParam #
\cellcolor[HTML]FFFFFFinception_resnet_v2	\cellcolor[HTML]FFFFFF(None, 8, 8, 1536)	\cellcolor[HTML]FFFFFF54336736
\cellcolor[HTML]FFFFFFflatten (Flatten)	\cellcolor[HTML]FFFFFF(None, 98304)	\cellcolor[HTML]FFFFFF0
\cellcolor[HTML]FFFFFFdense (Dense)	\cellcolor[HTML]FFFFFF(None, 64)	\cellcolor[HTML]FFFFFF6291520
\cellcolor[HTML]FFFFFFdense_1 (Dense)	\cellcolor[HTML]FFFFFF(None, 64)	\cellcolor[HTML]FFFFFF4160
\cellcolor[HTML]FFFFFFdense_2 (Dense)	\cellcolor[HTML]FFFFFF(None, 1)	\cellcolor[HTML]FFFFFF65
\cellcolor[HTML]FFFFFF=====================================
\cellcolor[HTML]FFFFFFTotal Params:		\cellcolor[HTML]FFFFFF60,632,481
\cellcolor[HTML]FFFFFFTrainable Params:		\cellcolor[HTML]FFFFFF60,571,937

\columncolor[HTML]FFFFFF	\columncolor[HTML]FFFFFFInceptionV3	\columncolor[HTML]FFFFFFInceptionResNetV2
\cellcolor[HTML]FFFFFFAccuracy Train Set	\columncolor[HTML]FFFFFF99.06%	\columncolor[HTML]FFFFFF98.36%
\cellcolor[HTML]FFFFFFAccuracy Test Set	\columncolor[HTML]FFFFFF94.57%	\columncolor[HTML]FFFFFF92.14%
\cellcolor[HTML]FFFFFFModel Size	\columncolor[HTML]FFFFFF461MB	\columncolor[HTML]FFFFFF465MB