Overview

EvoRator employs machine-learning for prediction of residue-level evolutionary rates from protein structure. This prediction is based on various features comprising the structural information characterizing the protein.

Input specification

Mandatory inputs: EvoRator requires an atomic coordinates file in PDB format or a valid PDB identifier, as well as valid chain identifier, to predict evolutionary rates for a protein of interest.

Optional input: The user can either choose to let EvoRator attempt to retrieve a ConSurf table of rates from ConSurf-DB, or provide their own ConSurf table.

Output

The predicted rates are mapped onto the three-dimensional structure of the query protein, which can be viewed using the NGL viewer.

In case that a ConSurf table of rates was provided, the user can either choose to predict evolutionary rates for gapped regions (PERfGR), or for identifying functional residues (PERfIFR).

In PERfGR, the visual output of ConSurf is augmented with EvoRator prediction.

In PERfIFR, the differences between ConSurf and EvoRator predicted rates are mapped onto the three-dimensional structure, and the results of the regression is printed to the screen.

In all cases, the predictions and the features that were used to obtain them can be downloaded as a csv file.

Feature extraction

For a given amino acid in a protein, EvoRator extracts various function, sequence and structure-based features, as well as graph-based features derived from network representation of protein structure. EvoRator than feeds these features to a machine learning regression algorithm.

Sites

An amino acid can reside at a specific site, e.g., active or binding, or in a disordered region. EvoRator parses the input pdb file and checks if the amino acid occurs at a site which is:

  • Catalytic - To detect catalytic residues, EvoRator queries a database of catalytic residues obtained from the Mechanism and Catalytic Site Atlas.
  • Binding - To detect binding residues, EvoRator parses and extracts information from the REMARK 800 field.
  • Disordered - To detect disordered residues, EvoRator checks if the residue has disordered atoms using BioPython.
  • Glycosylated - To detect glycosylated residues, EvoRator uses Glycosylator.
  • Interfacing - To detect interfacing residues, EvoRator uses GetInterfaces.py with default parameters. This script detects residues within 4.5A from each other between two molecules, as well as residues within 10A in the neighborhood of the contact residues.

Sequence-based features

The different amino acids have different physicochemical properties that can influence the evolutionary rate. EvoRator checks the amino-acid composition of a site:

  • Amino acid type - EvoRator uses DSSP (Dictionary of Secondary Structure in Proteins) to check the specific identity of the amino acid (i.e. A/C/D/E/F/G/H/I/K/L/M/N/P/Q/R/S/T/V/W/Y).
  • Amino acid grouped by different physicochemical properties - EvoRator uses BioPython to check if a given amino acid is either aliphatic (I/V/L), aromatic (F/Y/W/H), charged (K/R/D/E), tiny (G/A/C/S), diverse (T/M/Q/N/P), polar (A/G/T/S/N/Q/D/E/H/R/K/P), or hydrophobic (C/M/F/I/L/V/W/Y).

Biophysical features

Different sites have different biophysical properties properties that can influence the evolutionary rate. EvoRator extracts the following biophysical properties:

  • Weighted contact number (WCN) - Contact number (CN) is the number of neighbouring residues located in a protein structure within a given distance (for example, 10 Å) from a focal residue. WCN is similar to the CN, but the neighbouring residues are weighted by their inverse square distance to the focal residue, and all residues in a structure are considered to be neighbouring residues.
  • Relative solvent accessibility (RSA) - Measures the proportion of the surface of an amino acid that is accessible to solvent (that is, water) in the folded protein structure, from 0 (completely inaccessible) to 1 (completely accessible). EvoRator calculates the Cα-based and side-chain based WCN as well as RSA for each amino acid using DSSP and scripts provided by Wilke et al, 2017 .
  • Secondary structure (according to the DSSP nomenclature) - EvoRator uses DSSP to assign a secondary structure state for each amino acid.

Graph-based features

A protein molecule can be represented as a graph in which the nodes represent the amino acids and the edges represent the interaction between amino acids. EvoRator obtains graph representation from NAPS web server and extracts the following features:

  • Node degree - the number of edges connected to a node.
  • Betweenness centrality - measures the influence a node has over the flow of information in a graph (see LC Freeman, 1977). It is often used to find nodes that connect one part of a graph to another.
  • Average neighbor degree - the average degree of the neighborhood of each node.
  • Clustering coefficient - the clustering of a node is the fraction of possible triangles through that node that actually exist (see Watts D. & Strogatz S., 1998).
  • Degree centrality - the fraction of nodes a given node is connected to.
  • Eigenvector centrality - measures how well connected a node is to other well-connected nodes in the network (see Bonacich P., 1987).
  • k-clique membership - the number of connected subgraphs (i.e. subgraphs where all the nodes are connected to each other) of size k that contain a given node (see the clique problem).
  • Graphlet degree vector - vector of 73 coordinates that is a signature of a node that describes the topology of node's neighborhood (see Milenkovic T. & Przulj N., 2008).
  • Median neighbor RSA - the median RSA of node's neighborhood.
  • Median neighbor WCN - the median Cα and side-chain based WCN of node's neighborhood.
  • Total number of each amino acid in the neighborhood of each node.
  • Total number of each amino-acid physicochemical group in the neighborhood of each node.
  • Total number of each secondary structure state in the neighborhood of each node.
  • Total number of glycosylated residues in the neighborhood of each node.
  • Total number of disordered residues in the neighborhood of each node.
  • Total number of catalytic residues in the neighborhood of each node.
  • Total number of binding residues in the neighborhood of each node.
  • Total number of interfacing residues in the neighborhood of each node.

The following feature is extracted exclusively for PERfGR and PERfIFR:

  • Average neighbor evolutionary rate - the average ConSurf rate of node's neighborhood.