The overview of GPSite. The protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively. According to the structure, a protein radius graph is constructed where residues constitute the nodes and adjacent nodes are connected by edges. In addition to the pre-computed residue features of ProtTrans embedding and DSSP structural properties, a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation. Here, the R group denotes the centroid of the heavy sidechain atoms. The resulting geometric-aware attributed graph is input to a shared GNN to perform edge-enhanced message passing for capturing the common binding-relevant characteristics among different molecules. Finally, ten ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner. Examples of the applications of GPSite include binding site identification and protein-level Gene Ontology (GO) 41 function prediction.

The performance of GPSite and the state-of-the-art methods. (A) The ROC and precision-recall curves of GPSite on the ten binding site test sets. (B-C) The AUPR values of the top-performing methods in each test set. The methods marked with * denote evaluations using the ESMFold-predicted structures as input.

The performance of GPSite on low-quality predicted structures. (A) The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of DNA, RNA and peptide. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. (B) Distributions of the TM-scores between native and predicted structures in the DNA, RNA and peptide datasets. (C) The correlations between the prediction quality of ESMFold and the performance of GPSite and GraphBind on the RNA-binding site test set when TM-score < 0.5. The scatters denote the average TM-score and AUPR for each bin after sorting the proteins according to the TM-scores and evenly dividing them into 20 discrete bins. The lines are fit to the original data (without binning) using linear regression. (D) The glucocorticoid receptor (GR) in complex with DNA, a coactivator peptide, and Zn2+ ions (PDB: 7PRW). The ESMFold-predicted protein structure (gray) is superimposed to the native structure (cyan) using US-align (TM-score = 0.72). The ligands are colored in orange. (E) Superposition of the native (cyan) and predicted (gray) DNA-binding domains of GR (TM-score = 0.96). (F-H) The Zn2+, DNA and peptide binding site predictions by GPSite for the predicted GR structure in cartoon or surface view. True positives, false positives and false negatives are colored in green, red and yellow, respectively. The ligands in orange were subsequently added based on the native complex structure to show the quality of the predictions by GPSite.

The effects of protein features and model designs. (A) Ablation studies on sequence and structure information in the DNA, RNA and peptide test sets. The average performance of the ten test sets is also shown. (B) Performance comparison between GPSite and the baseline model using MSA profiles for proteins with different Neff values in the combined test set of the ten ligands. (C) Performance boosts in AUPR using GPSite compared to the single-task baseline. (D) Visualization of the distributions of residues encoded by raw feature vectors (left) or hidden embedding vectors from the pre-trained shared network in GPSite (right) for the unseen carbohydrate-binding site dataset using t-SNE. The binding and non-binding residues are colored in red and gray, respectively. (E) The performance when using the hidden embeddings from GPSite as input features to train an MLP for carbohydrate-binding site prediction, and its comparisons with other methods.

Analyses of Swiss-Prot based on the binding site annotations by GPSite. (A) The distributions of the binding scores assigned by GPSite for proteins with or without certain ligand-binding molecular function in GO. (B) The ROC curves when using the GPSite binding scores to distinguish between binding and non-binding proteins of various ligands. (C) The percentage of proteins predicted as binding to DNA and RNA by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the top 15 terms with the highest percentages are displayed. (D) The percentage of surface pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random surface residue being annotated as an interface residue. (E) The pathogenic probabilities of variants located in non-binding sites or different types of binding sites predicted by GPSite.

Statistics of the ten binding site benchmark datasets used in this study

The performance of GPSite on the five-fold cross-validation and independent test sets

Performance comparison of GPSite with state-of-the-art methods on the ten binding site test sets

The performance of GPSite and its comparison with PeSTo when re-splitting the protein-protein binding site benchmark dataset

The numbers of proteins with TM-score > 0.7 or ≤ 0.7 between native and ESMFold-predicted structures in the ten binding site datasets

The prediction quality of ESMFold measured by TM-score between native and predicted structures in the ten binding site datasets

The ablation studies on protein features and model designs in the ten binding site test sets

The performance of GPSite when using native or predicted structures as input during the test phase.

Distributions of the TM-scores between native and predicted structures in the protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+ datasets.

The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. Since there are only 5 proteins with TM-score ≤ 0.7 in the HEM and Mn2+ test sets (details shown in Appendix 2-table 5), the corresponding results may not be statistically significant.

The run time of ESMFold with respect to the sequence length in Swiss-Prot evaluated on an NVIDIA A100 GPU. The run time is presented as mean ± standard deviation per range of number of residues (range size equals 100).

The univariate and bivariate distributions of the protein length and the pTM estimated by ESMFold of the Swiss-Prot sequences. The probability density curves are fit using kernel density estimation. The darker region in the bivariate heatmap corresponds to a higher number of samples.

The percentage of proteins predicted as binding to peptide, protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+ by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the 15 terms with the highest percentage are displayed.

The percentage of pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random residue being annotated as an interface residue.