Computational and Systems Biology

Genome-scale annotation of protein binding sites via language model and geometric deep learning

Qianmu Yuan
Chong Tian
Yuedong Yang author has email address

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China

https://doi.org/10.7554/eLife.93695.2

Open access
Copyright information

Figures and data

The overview of GPSite. The protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively. According to the structure, a protein radius graph is constructed where residues constitute the nodes and adjacent nodes are connected by edges. In addition to the pre-computed residue features of ProtTrans embedding and DSSP structural properties, a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation. Here, the R group denotes the centroid of the heavy sidechain atoms. The resulting geometric-aware attributed graph is input to a shared GNN to perform edge-enhanced message passing for capturing the common binding-relevant characteristics among different molecules. Finally, ten ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner. Examples of the applications of GPSite include binding site identification and protein-level Gene Ontology (GO) ⁴¹ function prediction.

The performance of GPSite and the state-of-the-art methods. (A) The ROC and precision-recall curves of GPSite on the ten binding site test sets. The numbers in the legends are areas under the curves. (B-C) The AUPR values of the top-performing methods in each test set. The methods marked with * denote evaluations using the ESMFold-predicted structures as input.

The performance of GPSite on low-quality predicted structures. (A) The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of DNA, RNA and peptide. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. (B) Distributions of the TM-scores between native and predicted structures in the DNA, RNA and peptide datasets. (C) The correlations between the prediction quality of ESMFold and the performance of GPSite and GraphBind on the RNA-binding site test set when TM-score < 0.5. The scatters denote the average TM-score and AUPR for each bin after sorting the proteins according to the TM-scores and evenly dividing them into 20 discrete bins. The lines are fit to the original data (without binning) using linear regression. (D) The glucocorticoid receptor (GR) in complex with DNA, a coactivator peptide, and Zn²⁺ ions (PDB: 7PRW). The ESMFold-predicted protein structure (gray) is superimposed to the native structure (cyan) using US-align (TM-score = 0.72). The ligands are colored in orange. (E) Superposition of the native (cyan) and predicted (gray) DNA-binding domains of GR (TM-score = 0.96). (F-H) The Zn²⁺, DNA and peptide binding site predictions by GPSite for the predicted GR structure in cartoon or surface view. True positives, false positives and false negatives are colored in green, red and yellow, respectively. The ligands in orange were subsequently added based on the native complex structure to show the quality of the predictions by GPSite.

The effects of protein features and model designs. (A) Ablation studies on sequence and structure information in the DNA, RNA and peptide test sets. The average performance of the ten test sets is also shown. (B) Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the ten ligands. (C) Performance boosts in AUPR using GPSite compared to the single-task baseline. (D) Visualization of the distributions of residues encoded by raw feature vectors (left) or hidden embedding vectors from the pre-trained shared network in GPSite (right) for the unseen carbohydrate-binding site dataset using t-SNE. The binding and non-binding residues are colored in red and gray, respectively. (E) The performance when using the hidden embeddings from GPSite as input features to train an MLP for carbohydrate-binding site prediction, and its comparisons with other methods.

Analyses of Swiss-Prot based on the binding site annotations by GPSite. (A) The distributions of the binding scores assigned by GPSite for proteins with or without certain ligand-binding molecular function in GO. (B) The ROC curves when using the GPSite binding scores to distinguish between binding and non-binding proteins of various ligands. (C) The percentage of proteins predicted as binding to DNA and RNA by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the top 15 terms with the highest percentages are displayed. (D) The percentage of surface pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random surface residue being annotated as an interface residue. (E) The pathogenic probabilities of variants located in non-binding sites or different types of binding sites predicted by GPSite.

Statistics of the ten binding site benchmark datasets used in this study

The performance of GPSite on the five-fold cross-validation and independent test sets

Performance comparison of GPSite with state-of-the-art methods on the ten binding site test sets

Performance comparison of GPSite with state-of-the-art methods on the ten binding site test sets

Performance comparison of GPSite with ScanNet and PeSTo on the protein-protein binding site test set from PeSTo
²⁴

The numbers of proteins with TM-score > 0.7 or ≤ 0.7 between native and ESMFold-predicted structures in the ten binding site datasets

The prediction quality of ESMFold measured by TM-score between native and predicted structures in the ten binding site datasets

The ablation studies on protein features and model designs in the ten binding site test sets

Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the ten ligands

Performance comparison on the ten binding site test sets under different training and evaluation settings

Cross-type performance by applying different ligand-specific MLPs in GPSite for the test sets of different ligands

Runtime comparison of the GPSite webserver with other top-performing servers. Five protein chains (i.e., 8HN4_B, 8USJ_A, 8C1U_A, 8K3V_A and 8EXO_A) comprising 100, 300, 500, 700, and 900 residues, respectively, were selected for testing, and the average runtime is reported for each method. Note that a significant portion of GPSite’s runtime (75 s, indicated in orange) is allocated to structure prediction using ESMFold.

The performance of GPSite when using native or predicted structures as input during the test phase.

Distributions of the TM-scores between native and predicted structures in the protein, ATP, HEM, Zn²⁺, Ca²⁺, Mg²⁺ and Mn²⁺ datasets.

The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of protein, ATP, HEM, Zn²⁺, Ca²⁺, Mg²⁺ and Mn²⁺. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. Since there are only 5 proteins with TM-score ≤ 0.7 in the HEM and Mn²⁺ test sets (details shown in Appendix 2-table 5), the corresponding results may not be statistically significant.

The prediction results of GPSite and GraphBind for the ribosome biogenesis protein ERB1. (A) The state E2 nucleolar 60S ribosome biogenesis intermediate (PDB: 7R6Q). The ribosome biogenesis protein ERB1 (chain m) is highlighted in blue, while other protein chains are colored in gray. The RNA chains are shown in orange. (B) The RNA-binding sites on ERB1 (colored in red). (C) The ESMFold-predicted structure of ERB1 (TM-score = 0.24). The RNA-binding sites are also mapped onto this predicted structure (colored in red). (D-G) The prediction results of GPSite and GraphBind for the predicted and native ERB1 structures. The confidence of the predictions is represented with a gradient of color from blue for non-binding to red for binding.

The run time of ESMFold with respect to the sequence length in Swiss-Prot evaluated on an NVIDIA A100 GPU. The run time is presented as mean ± standard deviation per range of number of residues (range size equals 100).

The univariate and bivariate distributions of the protein length and the pTM estimated by ESMFold of the Swiss-Prot sequences. The probability density curves are fit using kernel density estimation. The darker region in the bivariate heatmap corresponds to a higher number of samples.

The percentage of proteins predicted as binding to peptide, protein, ATP, HEM, Zn²⁺, Ca²⁺, Mg²⁺ and Mn²⁺ by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the 15 terms with the highest percentage are displayed.

The percentage of pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random residue being annotated as an interface residue.

Sign up for email alerts