Introduction

Proteins perform most biological functions by specifically interacting with other molecules such as nucleic acids, peptides, proteins, or small ligands of various kinds 13. Knowledge of these binding interfaces benefits protein function prediction, disease mechanism understanding, and novel drug design 46. Although experimental techniques such as X-ray crystallography and nuclear magnetic resonance spectroscopy can solve the native complex structures to detect binding sites, they are costly, time-consuming, and unsuitable for proteins with unknown binding partners. With the explosive growth of proteins in sequence databases 7,8, developing effective and efficient computational methods to recognize potential binding regions from sequences in a large-scale manner is imperative to fill the gap between genome and phenome.

A conventional way to predict binding interfaces is comparative modeling, which employs alignment algorithms to transfer known binding residues from similar templates to the query proteins 911. Nevertheless, this strategy will be seriously restricted when no high-quality template exists. Therefore, methods based on machine learning and deep learning have prevailed in recent years, which can be divided into sequence-based and structure-based approaches according to their used protein information. Sequence-based methods leverage machine learning classifiers (e.g., support vector machine) to learn local binding characteristics from sequence contexts in a sliding window 1215, or employ deep learning models like transformer 16 to capture global dependencies 17,18. Despite requiring only readily available protein sequences, these predictors are of limited accuracy due to the lack of tertiary structure information. On the other hand, experimental structure-based approaches are often more effective. Earlier methods encode protein structures as 2D images 19 or 3D voxels 20, which are processed via grid-based convolutional neural networks. Current approaches tend to handle protein structures as graphs composed of surface point clouds 21,22, atoms 23,24 or residues 25,26, and adopt geodesic convolution or graph neural networks (GNNs) to learn the binding-relevant spatial patterns. Unfortunately, the expressive capacities of most methods remain restricted, as the geometry of the structure is not yet fully explored. More importantly, both present sequence- and structure-based predictors are hampered for high-throughput practices at the genome scale for two reasons. Firstly, the dependency on evolutionary profiles from multiple sequence alignments (MSA) for most methods leads to high computational expense and occasionally subpar performance for shallow sequence alignments. Secondly, albeit powerful when native structures are available, structure-based methods will exhibit performance declines for unbound or predicted structures, probably owing to their sensitivity towards details and errors in the structures.

Our previous work 27 has validated the feasibility of exploiting predicted structures from AlphaFold2 28 for training to enhance the model’s robustness, yet the computationally intensive structure prediction pipeline still restrains its application to novel sequences absent from the AlphaFold Protein Structure Database 29. Since protein sequence can be regarded as a language in biology, unsupervised pre-training with language models has recently been applied to protein sequence representation learning and has displayed competitive or better results than manually-engineered evolutionary features in different downstream tasks 18,3033. Based on this, ESMFold 34 replaces evolutionary information from MSA with a large-scale pre-trained protein language model to directly infer atomic-level protein structure from single sequence. This results in an order-of-magnitude acceleration of prediction while maintaining accuracy nearly as alignment-based methods including AlphaFold2. Therefore, it is promising to develop a fast and accurate model tailored for large-scale sequence-based binding site prediction based on the recent advances in protein representation learning and structure prediction with language models.

To facilitate protein structure modeling, geometric deep learning techniques have recently flourished in protein docking 35, protein structure pre-training 36, protein design 37,38, and binding site prediction 2124, since they can better manipulate data with no natural grid-like topology than 3D convolutional neural networks. Among these, current geometric binding site predictors are mostly built on surface point clouds 21,22 or atom graphs 23,24. However, although the binding interface is mainly comprised of surface atoms, the global structure of the protein in general influences how the interface is formulated and how the binding partner is interacted with, which should be modeled. Besides, the calculation of protein surfaces and mapping of their properties are usually time-consuming, while methods based on full atom graphs are memory-consuming and thus difficult to process long sequences. Consequently, designing a geometry-aware message passing mechanism on residue graphs to synergistically integrate sequence and structure information is potentially more suitable for the binding site prediction task.

In this study, we propose GPSite (Geometry-aware Protein binding Site predictor), a fast, accurate and versatile network for concurrently predicting binding residues of ten types of biologically relevant molecules including DNA, RNA, peptide, protein, ATP, HEM, and metal ions (Zn2+, Ca2+, Mg2+, Mn2+) in a multi-task framework. GPSite was trained on informative sequence embeddings and predicted structures generated by pre-trained protein language models, and thus does not rely on MSA or native structures. To better capture the high-level bio-physicochemical characteristics in the predicted structures, a comprehensive geometric featurizer along with an edge-enhanced graph neural network is designed to extract the residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even under conditions where the predicted structures are of lower quality. GPSite runs fast enough to process genome-scale sequence databases such as the entire Swiss-Prot 7, allowing for rapid binding residue annotations for over 568,000 sequences. Further analyses indicate that such annotations can promote discoveries in associations of binding sites with molecular functions, biological processes, and genetic variants. Besides the standalone code, we also provide the GPSite webserver and annotation database at https://bio-web1.nscc-gz.cn/app/GPSite.

Results

The geometry-aware protein binding site predictor (GPSite)

GPSite is a geometry-aware versatile protein binding site predictor that can fast and accurately identify binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions (Zn2+, Ca2+, Mg2+, Mn2+) from protein sequences. As shown in Figure 1 and detailed in Methods, an input protein sequence is fed to the pre-trained language model ProtTrans 31 and the folding model ESMFold to generate informative sequence embedding and predicted structure, respectively. From the predicted structure, the coordinates of the N, Cα, C and O atoms as well as the centroid of the heavy sidechain atoms are gathered, and DSSP39 is employed to calculate the relative solvent accessibility and secondary structure for each residue. Then, a protein radius graph is built where residues constitute the nodes and adjacent nodes are connected by edges. In addition to the residue features by ProtTrans and DSSP, an end-to-end geometric featurizer is designed to construct a local coordinate system for each residue and extract geometric features capturing the arrangements of backbone and sidechain atoms in or between residues. Specifically, the geometric node features consist of intra-residue distances between two atoms (including the sidechain centroid), relative directions of other inner atoms or sidechain centroid to Cα, as well as the bond and torsion angles. Similarly, the geometric edge features comprise inter-residue distances between atoms from the two neighboring residues, relative directions of all atoms in the adjacent residue to Cα of the central residue, and spatial orientation between the two reference frames of the neighboring nodes.

The overview of GPSite. The protein sequence is input to the pre-trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively. According to the structure, a protein radius graph is constructed where residues constitute the nodes and adjacent nodes are connected by edges. In addition to the pre-computed residue features of ProtTrans embedding and DSSP structural properties, a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation. Here, the R group denotes the centroid of the heavy sidechain atoms. The resulting geometric-aware attributed graph is input to a shared GNN to perform edge-enhanced message passing for capturing the common binding-relevant characteristics among different molecules. Finally, ten ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner. Examples of the applications of GPSite include binding site identification and protein-level Gene Ontology (GO) 41 function prediction.

To learn the residue representations by considering multi-scale interactions in node, edge, and global context levels, we design a GNN with message passing, edge update and global node update acting on this geometric-aware attributed graph. The message passing layer adopts the multi-head attention in transformer enhanced by edge features to aggregate information from the local neighborhood and update the central node. Then the features of an edge are updated using its connecting nodes. Finally, a gated attention is applied to update the node states using the global context information. Benefiting from the above pipeline, GPSite is invariant to rotation and translation. Besides, GPSite leverages the multi-task learning strategy, where the GNN is shared among different ligands to capture the common binding-relevant characteristics, followed by ten ligand-specific multilayer perceptrons (MLPs) to mine the binding patterns of particular molecules. This framework also reduces the time for inference of multiple ligands by the concurrent prediction fashion. In conclusion, GPSite is distinguished from the previous approaches in four key aspects. First, profiting from the effectiveness and low computational cost of ProtTrans and ESMFold, GPSite is liberated from the reliance on MSA and native structures, thus enabling genome-wide binding site prediction. Second, unlike methods that only explore the Cα models of proteins 25,40, GPSite exploits a comprehensive geometric featurizer to fully refine knowledge in the backbone and sidechain atoms. Third, the employed message propagation on residue graphs is global structure-aware and time-efficient compared to the methods based on surface point clouds 21,22, and memory-efficient unlike methods based on full atom graphs 23,24. Residue-based message passing is also less sensitive towards errors in the predicted structures. Last but not least, instead of predicting binding sites for a single molecule type or learning binding patterns separately for different molecules, GPSite applies multi-task learning to better model the latent relationships among different binding partners.

GPSite outperforms state-of-the-art methods

We collected ten binding site benchmark datasets for the ten ligands from Protein Data Bank (PDB) 42 as detailed in Methods, which were combined to train and evaluate GPSite using the five-fold cross-validation and independent test sets. As shown in Appendix 2-table 2, GPSite obtains average area under the receiver operating characteristic curve (AUC) values over the ten ligand types of 0.918 and 0.921; as well as average area under the precision-recall curve (AUPR) values of 0.603 and 0.594 on the cross-validation and independent tests, respectively. The consistent performance on the validation and test sets indicates the robustness of our model. In Figure 2A, the receiver operating characteristic (ROC) curves and the precision-recall curves on the ten test sets are plotted to overview the performance of GPSite for different ligands.

The performance of GPSite and the state-of-the-art methods. (A) The ROC and precision-recall curves of GPSite on the ten binding site test sets. The numbers in the legends are areas under the curves. (B-C) The AUPR values of the top-performing methods in each test set. The methods marked with * denote evaluations using the ESMFold-predicted structures as input.

To demonstrate the effectiveness of our method, we compared GPSite with 8 sequence-based predictors including DRNApred 12, NCBRPred 43, SVMnuc 44, PepBind 14, PepNN-Seq 45, PepBCL 17, TargetS 15, and LMetalSite 18, as well as 15 structure-based predictors including NucBind 44, COACH-D46, GraphBind 25, GeoBind 22, aaRNA 47, PepNN-Struct 45, DeepPPISP 48, SPPIDER 49, MaSIF-site 21, GraphPPIS 26, ScanNet 23, PeSTo 24, DELIA 19, MIB 50, and IonCom51 (see Appendix 1-note 1 for more details). Figure 2B and 2C show the results of the top-performing predictors in the test sets, where GPSite surpasses all other sequence-based and even experimental structure-based methods in AUPR for more than 16.5%, 13.2%, 55.4%, 1.7%, 27.7%, 10.8%, 7.0%, 14.8%, 17.1%, and 13.4% in the DNA, RNA, peptide, protein, ATP, HEM, Zn2+, Ca2+, Mg2+, and Mn2+ binding site test sets, respectively. The results of more contending methods and criteria (e.g., AUC and F1-score) are tabulated in Appendix 2-table 3. Given the substantial overlap between our protein-binding site test set and the training set of PeSTo, we conducted separate training and comparison using the datasets of PeSTo, where GPSite still demonstrates a remarkable improvement over PeSTo (Appendix 1-note 2). Moreover, GPSite is computationally efficient, achieving comparable or faster prediction speed compared to other top-performing methods (Appendix 3-figure 1).

Though trained on predicted protein structures, GPSite can also adopt native structures as input for prediction whenever applicable. By doing this, extra performance boosts can be gained with average AUPR increase of 7.8% (Appendix 3-figure 2). However, experimental structures are not always available in real-world scenarios, such as genome-scale sequence databases. To this end, for the best structure-based method (measured by AUPR) in each test set, we also investigated the impact on performance when using ESMFold-predicted structures as input. As expected, the performance of these methods mostly decreases substantially utilizing predicted structures for testing, because they were trained with high-quality native structures. For example, the AUPR of GraphBind for predicting RNA-binding sites decreases from 0.506 to 0.433, compared to the AUPR of 0.573 by GPSite. Similarly, the AUPR of ScanNet drops from 0.476 to 0.399, compared to the AUPR of 0.484 by GPSite for predicting protein-binding sites. Therefore, in the practical situations where experimental structures are unavailable, the superiority of our method will be further reflected.

GPSite is robust for low-quality predicted structures

Since GPSite is built on ESMFold, it is necessary to examine the quality of the predicted structures and its impact on the model performance. Figure 3B and Appendix 3-figure 3 show the distributions of TM-scores between native and predicted structures calculated by US-align 52 in the ten benchmark datasets, where most proteins are accurately predicted with TM-score > 0.7 (see also Appendix 2-table 5). Overall, ESMFold achieves median TM-scores of 0.89, 0.76, 0.93, 0.93, 0.94, 0.94, 0.93, 0.94, 0.95, and 0.96 for the DNA, RNA, peptide, protein, ATP, HEM, Zn2+, Ca2+, Mg2+, and Mn2+ datasets, respectively (Appendix 2-table 6). We next explored whether GPSite can maintain its performance on low-quality predicted structures. Figure 3A presents the performance of GPSite on ESMFold-predicted structures with TM-score > 0.7 or ≤ 0.7, and the comparisons with the leading structure-based methods in the test sets of DNA, RNA and peptide. Reasonably, compared to the well-predicted proteins, the performance of GPSite is inferior on the subsets of proteins with TM-score ≤ 0.7. Nevertheless, GPSite continues to outshine the most advanced structure-based methods input with ESMFold-predicted structures or even experimental structures. Similar trends are also observed for the rest of the ligands in Appendix 3-figure 4. Given the infrequency of low-quality predicted structures except for the RNA test set, we took a closer inspection of the 104 proteins with predicted structures of TM-score < 0.5 in the RNA test set. In this subset, GraphBind achieves AUPR values of 0.455 and 0.376 using native and predicted structures, respectively, compared to the AUPR of 0.516 by GPSite. As shown in Figure 3C with lines fit to the per-protein TM-score and AUPR using linear regression, GPSite consistently outperforms GraphBind using predicted structures regardless of the prediction quality of ESMFold, and is only surpassed by GraphBind input with native structures on proteins of extremely low quality (TM-score < 0.3). An example is presented in Appendix 1-note 3 for illustration. To sum up, ESMFold could produce high-quality structures in most cases, and even for the low-quality predicted structures, GPSite is robust enough to generate reliable predictions better than current state-of-the-art structure-based methods.

The performance of GPSite on low-quality predicted structures. (A) The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of DNA, RNA and peptide. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. (B) Distributions of the TM-scores between native and predicted structures in the DNA, RNA and peptide datasets. (C) The correlations between the prediction quality of ESMFold and the performance of GPSite and GraphBind on the RNA-binding site test set when TM-score < 0.5. The scatters denote the average TM-score and AUPR for each bin after sorting the proteins according to the TM-scores and evenly dividing them into 20 discrete bins. The lines are fit to the original data (without binning) using linear regression. (D) The glucocorticoid receptor (GR) in complex with DNA, a coactivator peptide, and Zn2+ ions (PDB: 7PRW). The ESMFold-predicted protein structure (gray) is superimposed to the native structure (cyan) using US-align (TM-score = 0.72). The ligands are colored in orange. (E) Superposition of the native (cyan) and predicted (gray) DNA-binding domains of GR (TM-score = 0.96). (F-H) The Zn2+, DNA and peptide binding site predictions by GPSite for the predicted GR structure in cartoon or surface view. True positives, false positives and false negatives are colored in green, red and yellow, respectively. The ligands in orange were subsequently added based on the native complex structure to show the quality of the predictions by GPSite.

We finally illustrate a potential reason for the robustness of GPSite by an example from the test set where GPSite is able to discern among various interfaces even though the structure is not perfectly predicted. Figure 3D shows the structure of the human glucocorticoid receptor (GR), a transcription factor that binds DNA and assembles a coactivator peptide to regulate gene transcription (PDB: 7PRW, chain A). The DNA-binding domain of GR also consists of two C4-type zinc fingers to bind Zn2+ ions. Although the structure of this protein is not perfectly predicted (TM-score = 0.72), the local structures of the binding domains of peptide and DNA are actually predicted accurately as viewed by the superpositions of the native and predicted structures in Figure 3D and 3E. Therefore, GPSite can correctly predict all Zn2+ binding sites and precisely identify the binding sites of DNA and peptide with AUPR values of 0.949 and 0.924, respectively (Figure 3F, G and H).

The effects of protein features and model designs

To reveal the roles of distinct protein features and model designs in GPSite, we conducted comprehensive ablation studies. As shown in Figure 4A and Appendix 2-table 7, when removing the ProtTrans embeddings from GPSite, the model yields inadequate performance (average AUPR of 0.516 among the ten test sets) due to the complete neglect of the sequence information in proteins. The introduction of one-hot sequence encodings or MSA profile (elaborated in Appendix 1-note 4) partially restores the performance to average AUPR of 0.545 or 0.568, respectively. Nevertheless, the utilization of language model representations in GPSite attains the highest average AUPR of 0.594. To further understand the advantages of ProtTrans over the evolutionary features from MSA, we compared their performance against the number of effective homologous sequences (Neff) of the proteins from the combined test set of the ten ligands. Neff is an HHblits 53 parameter quantifying the effective size of homologous sequence cluster. As evidenced in Figure 4B and Appendix 2-table 8, ProtTrans consistently obtains competitive or superior performance compared to the MSA profile. Notably, for the target proteins with few homologous sequences (Neff < 2), ProtTrans surpasses MSA profile significantly with an improvement of 3.9% on AUC (P-value = 4.3×10−8). On the other hand, removing the structure information (implemented by a transformer model solely input with ProtTrans sequence features) obtains the worst performance with an average AUPR of 0.484 (Figure 4A). This observation indicates that the knowledge of protein structure may be more critical than sequence information in binding site prediction tasks. Additionally, the removal of the geometric featurizer within GPSite also causes a substantial decline in performance (average AUPR from 0.594 to 0.533), attesting to the significance of GPSite’s perception of protein geometry. We also assessed the benefit of training with predicted instead of native structures, which brings an average AUPR increase of 4.2% as detailed in Appendix 1-note 5.

The effects of protein features and model designs. (A) Ablation studies on sequence and structure information in the DNA, RNA and peptide test sets. The average performance of the ten test sets is also shown. (B) Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the ten ligands. (C) Performance boosts in AUPR using GPSite compared to the single-task baseline. (D) Visualization of the distributions of residues encoded by raw feature vectors (left) or hidden embedding vectors from the pre-trained shared network in GPSite (right) for the unseen carbohydrate-binding site dataset using t-SNE. The binding and non-binding residues are colored in red and gray, respectively. (E) The performance when using the hidden embeddings from GPSite as input features to train an MLP for carbohydrate-binding site prediction, and its comparisons with other methods.

We next elucidate the benefits of the multi-task framework in GPSite by comparing it with a baseline approach in which a model is trained and evaluated for each dataset separately. As depicted in Figure 4C and Appendix 2-table 7, GPSite consistently outperforms the single-task baseline, especially for the ligands with limited training data. For instance, directly fitting a model on the HEM training set with only 176 proteins reaches an AUPR of 0.716 for the test set. Alternatively, combining datasets of diverse ligands in a multi-task framework brings an AUPR increase of 0.086 for HEM. This suggests that multi-task learning can compensate for the scarcity of training data by leveraging a shared network trained on a larger dataset encompassing different types of ligand-binding proteins that potentially share similar binding patterns. We also conducted cross-type evaluations to investigate the specificity of the ligand-specific MLPs and the inherent similarities among different ligands in Appendix 1-note 6.

Residues that are conserved during evolution, exposed to solvent, or inside a pocket-shaped domain are inclined to participate in ligand binding. During the preceding multi-task training process, the shared network in GPSite should have learned to capture such common binding mechanisms. Here we show how GPSite can be easily extended to the binding site prediction for other unseen ligands by adopting the pre-trained shared network as a feature extractor. We considered a carbohydrate-binding site dataset from 54 which contains 100 proteins for training and 49 for testing. We first visualized the distributions of residues in this dataset using t-SNE 55, where the residues are encoded by raw feature vectors encompassing ProtTrans embeddings and DSSP structural properties, or latent embedding vectors from the shared network of GPSite trained on the ten molecule types previously. As shown in Figure 4D, the binding and non-binding residues overlap and are indistinguishable when encoded by raw feature vectors. On the contrary, the latent representations from GPSite effectively improve the discriminability between the binding and non-binding residues. Employing these informative hidden embeddings as input features to train a simple MLP exhibits remarkable performance with an AUC of 0.881 (Figure 4E), higher than that of training a single-task version of GPSite from scratch (AUC of 0.853) or other state-of-the-art methods such as MTDsite 54 and SPRINT-CBH 56. These results highlight the effectiveness of multi-task learning and the scalability of GPSite to unseen ligands.

Large-scale binding site annotation for Swiss-Prot

In light of the efficiency and effectiveness of GPSite, we sought to annotate and analyze the potential binding interfaces of various kinds for the entire Swiss-Prot database. For this task, we applied ESMFold to predict the structures of 568,326 sequences in Swiss-Prot, which required approximately 8.5 days as described in Methods. Typically, it takes 16 seconds to predict the structure of a protein with 500 residues, or 100 seconds for 1000 residues (Appendix 3-figure 6). The feature extraction and GPSite inference procedures overall cost about 5 hours. All ESMFold-predicted structures accompanied by the binding site annotations for Swiss-Prot are freely available in our user-friendly GPSiteDB database (https://bio-web1.nscc-gz.cn/database/GPSiteDB). Appendix 3-figure 7 further illustrates the distributions of the protein length and the predicted TM-score (pTM) estimated by ESMFold for the Swiss-Prot sequences, where most proteins are no longer than 500 residues and predicted with high confidence (pTM > 0.8). For the subsequent downstream analyses, we only considered the predicted structures with pTM > 0.7, resulting in a total of 370,140 structures.

Exploiting the residue-level binding site annotations, we could readily extend GPSite to discriminate between binding and non-binding proteins of various ligands. Specifically, a protein-level binding score indicating the overall binding propensity to a specific ligand can be generated by averaging the top k predicted scores among all residues. Empirically, we set k to 5 for metal ions and 10 for other ligands, considering the distributions of the numbers of binding residues per sequence observed in the training set. As depicted in Figure 5A, the GPSite binding scores for proteins with the corresponding ligand-binding molecular functions are significantly higher than those without such annotations in Swiss-Prot (P-value < 10−165 for all ligands according to Mann–Whitney U test 57). The accuracy of the GPSite protein-level binding scores is further validated by the ROC curves in Figure 5B, where GPSite achieves satisfactory AUC values for all ligands except protein (AUC of 0.608). This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete (Appendix 1-note 7). Moreover, we attempted to gather the top 20,000 proteins with the highest GPSite binding scores for each ligand to expand the binding function annotations in Swiss-Prot. We could immediately notice that the GPSite-predicted binding proteins are involved in biological processes consistent with existing knowledge as shown in Figure 5C and Appendix 3-figure 8. For instance, the DNA-binding proteins predicted by GPSite are prone to participate in DNA repair, DNA-templated transcription, DNA recombination and replication, while the RNA-binding proteins are inclined to perform translation and RNA modification.

Analyses of Swiss-Prot based on the binding site annotations by GPSite. (A) The distributions of the binding scores assigned by GPSite for proteins with or without certain ligand-binding molecular function in GO. (B) The ROC curves when using the GPSite binding scores to distinguish between binding and non-binding proteins of various ligands. (C) The percentage of proteins predicted as binding to DNA and RNA by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the top 15 terms with the highest percentages are displayed. (D) The percentage of surface pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random surface residue being annotated as an interface residue. (E) The pathogenic probabilities of variants located in non-binding sites or different types of binding sites predicted by GPSite.

Capitalizing on the predicted structures and annotations within GPSiteDB, cell biologists are empowered to easily locate the genetic variants and assess their potential disruptions in protein-ligand interactions and pathogenicity. This facilitates the establishment of rational working hypotheses to propel therapeutic development in a more informed manner. Here we conduct analyses on the associations between binding sites and genetic variants for the human proteome as an example. Notably, 20.67% of the pathogenic variant sites on the surfaces of the predicted structures fall in the GPSite-predicted interfaces, higher than the benign variants (11.97%) or the random baseline (16.35%) as described in Figure 5D. Consistent trend is observed in Appendix 3-figure 9 when considering variants in the entire structure (rather than solely on surface). Besides, we investigated the pathogenic probabilities of variants in different locations in Figure 5E. As expected, the pathogenicity of variants located in the predicted binding sites is higher than those in the non-binding sites. Interestingly, our analysis uncovered that the pathogenic probabilities of variants in the predicted binding sites of ATP and metal ions surpass those of other ligands. One possible reason is that the binding interfaces of ATP and metal ions typically comprise small pockets formed by a limited number of residues. Consequently, variants affecting even a single residue within such pockets may exert a substantial influence on the overall pocket functionality and lead to diseases.

Discussion

In this study, we present GPSite to accurately and efficiently predict protein binding sites of diverse biologically relevant molecules including DNA, RNA, peptide, protein, ATP, HEM, and metal ions. By leveraging the informative sequence embeddings and predicted structures from pre-trained language models, GPSite is liberated from the reliance on MSA or experimental protein structures. To encapsulate the high-level bio-physicochemical characteristics in the predicted structures, GPSite incorporates a comprehensive geometric featurizer and an edge-enhanced graph neural network to refine both residual and relational geometric contexts in an end-to-end manner. GPSite also stands out from the previous approaches by applying a multi-task framework to effectively model the intrinsic relationships among different binding partners. Results across various benchmark datasets indicate that GPSite substantially outperforms state-of-the-art sequence-based and structure-based methods, even under conditions where the predicted structures are of lower quality. Finally, we demonstrate GPSite’s scalability to genome-scale sequence databases by annotating binding sites for over 568,000 sequences in Swiss-Prot within 9 days. Further analyses suggest that these annotations are not only accordant with existing knowledge but also capable of facilitating discoveries of unexplored biology in protein function and genetic variant.

Despite the noteworthy advancements achieved by GPSite, there remains scope for further improvements. GPSite may be improved by pre-training 36 on the abundant predicted structures in ESM Metagenomic Atlas 34, and then fine-tuning on binding site datasets. Besides, the hidden embeddings from ESMFold may also serve as informative protein representations. Additional opportunities for upgrade exist within the network architecture. For example, a variational Expectation-Maximization framework 58 can be adopted to handle the hierarchical atom-to-residue graph structure inherent in proteins. Meta-learning 59 could also be explored in this multi-task scenario, which allows fast adaptation to unseen tasks with limited labels.

As the gap between unannotated and annotated sequences is expanding at an unparalleled rate, GPSite serves as a reliable, efficient, versatile and user-friendly tool for unraveling the extensive and dynamic landscape of protein-ligand interactions. By harnessing the capabilities of GPSite, researchers can readily uncover fresh biological functions of proteins, gain valuable insights into the underlying pathogenic mechanisms of gene mutations, or design novel drugs targeting specific binding pockets.

Methods

Benchmark datasets

The benchmark datasets for evaluating binding site predictions of DNA, RNA, peptide, ATP, and HEM are constructed from BioLiP 60, a database of biologically relevant protein-ligand complexes primarily from PDB. For each ligand, we collected the corresponding binding proteins with resolutions of ≤ 3.0 Å and lengths of 50–1500 from BioLiP released on 29 March 2023. A binding residue is defined if the smallest atomic distance between the target residue and the ligand is < 0.5 Å plus the sum of the Van der Waal’s radius of the two nearest atoms. We combined the binding site annotations of identical sequences and then removed redundant proteins sharing sequence identity > 25% over 30% alignment coverage using CD-HIT 61. Finally, each benchmark dataset was split into a training set with proteins released before 1 January 2021, as well as an independent test set with proteins released from 1 January 2021 to 29 March 2023. Besides, the benchmark dataset of protein-protein binding sites is directly from 26, which contains non-redundant transient heterodimeric protein complexes dated up to May 2021. Surface regions that become solvent inaccessible on complex formation are defined as the ground truth protein-binding sites. The benchmark datasets of metal ion (Zn2+, Ca2+, Mg2+ and Mn2+) binding sites are directly from 18, which contain non-redundant proteins dated up to December 2021 from BioLiP. Combining all these ten datasets results in a total of 8441 training sequences and 1838 test sequences. Details of the statistics of these benchmark datasets are given in Appendix 2-table 1.

Structure prediction and preprocessing

We harnessed ESMFold, a fast and accurate end-to-end model to predict protein structures from sequences. ESMFold is based on a language model with 3B parameters pre-trained over sequences in UniRef50 62, and a folding head similar to AlphaFold2 trained on experimental structures from PDB and predicted structures from AlphaFold2. The structure prediction for our whole benchmark datasets (∼10,200 sequences, ∼300 amino acids on average) cost only ∼28 h on an NVIDIA A100 GPU. For each residue in the predicted structures, we gathered the coordinates of the N, Cα, C and O atoms as well as the centroid of the heavy sidechain atoms (denoted as R). In this way, the structure of a protein can be represented by a coordinate matrix X ∈ ℝn×5×3, with n denoting the number of residues.

Protein features

GPSite leverages the pre-trained protein language model ProtTrans (version: ProtT5-XL-U50) to generate sequence features efficiently, thus bypassing slow sequence alignments. ProtTrans is a transformer-based auto-encoder named T5 63 pre-trained with the BERT’s denoising objective 64, essentially learning to predict the masked amino acids. Concretely, ProtTrans contains 3B parameters, which was first trained on BFD 65 and then fine-tuned on UniRef50. We extracted the output from the last ProtTrans encoder layer as sequence representations, containing a 1024-dimensional vector for each residue. The inference cost of ProtTrans is extremely low, and the embedding extraction process for our whole benchmark datasets can be done within 5 min on an NVIDIA A100 GPU. The feature values in the sequence embeddings are further normalized to scores between 0 and 1 as follows:

where v is the original feature value, and vmin and vmax are the minimum and maximum values of this feature type observed in the training set, respectively. In addition, we also calculated two structural properties from the predicted structures using DSSP: (i) Relative solvent accessibility (RSA), which is the normalized solvent accessible surface area (ASA) by the maximum ASA of the corresponding amino acid type. (ii) One-hot secondary structure profile representing one of the eight secondary structure states.

The architecture of GPSite

The overall architecture of GPSite is shown in Figure 1. First, the protein sequence is input to the pre- trained language model ProtTrans and the folding model ESMFold to generate the sequence embedding and predicted structure, respectively. Second, a protein radius graph is constructed from the structure, where residues constitute the nodes and adjacent nodes (distance between Cα <15Å) are connected by edges. In addition to the pre-computed residue features (ProtTrans embedding and structural properties by DSSP), a comprehensive, end-to-end geometric featurizer is employed to extract the geometric node features including distance, direction and angle, as well as geometric edge features between residues including distance, direction and orientation. Third, the resulting geometric-aware attributed graph is input to a shared GNN with message passing, edge update and global node update, to capture the common binding-relevant characteristics among different molecules. Finally, ten ligand-specific MLPs are adopted to learn the binding patterns of particular molecules in a multi-task manner.

The geometric featurizer

GPSite represents the protein as a radius graph derived from the Cα coordinates of the residues, where the radius is equal to 15Å. An end-to-end featurizer is utilized to act directly on the atomic coordinate matrix X for geometric feature extraction similar to 38, except that we additionally encode the sidechain conformations of the residues. In this representation, a local coordinate system is first defined at each residue based on the relative position of the Cα atom to other backbone atoms. Then, several geometric node and edge features are derived to capture the arrangements of backbone and sidechain atoms in or between residues.

(i) Local coordinate system. We define a local coordinate system Qi [bi, ni, bi × ni] for residue i, where biis the negative bisector of the angle formed by the N, Cα, and C atoms, and ni is a unit vector normal to this plane. Formally, we have:

Based on the local coordinate systems, we could construct geometric features that are invariant to rotation and translation for single or pair of residues.

(ii) Geometric node features. GPSite constructs distance, direction and angle features for each residue. Given the coordinates of two atoms A and B, the distance feature is computed via RBF(‖A - B‖), where RBF(·) is a radial basis function. For the intra-residue distance features of node i, A, B and AB. Here, R denotes the centroid of the heavy sidechain atoms. The direction features encoding relative directions of other inner atoms to Cα in residue i are computed via , where A ∈ {Ni, Ci, Oi, Ri}. As shown in Figure 1, we also incorporate the sine and cosine values of the bond angles (i, βi, γi) and torsion angles (ϕi, ψi, ωi) to consider the backbone geometry.

(ii) Geometric edge features. Similarly, we construct geometric features between neighboring residues including distance, direction and orientation. The inter-residue distance features RBF(‖A - B‖) between nodes i and j are computed with atoms and . The edge direction features consider relative directions of all atoms in residue j to , namely . To reflect the relative spatial rotation between the two reference frames of residue i and j, the orientation feature is employed, where q(·) is the quaternion encoding function representing 3D rotation matrices as four-element vectors 66.

The edge-enhanced graph neural network

The above-mentioned attributed graph with features from ProtTrans, DSSP and the geometric featurizer is input to several GNN layers with message passing, edge update and global node update modules, to learn the residue representations by considering multi-scale interactions in node, edge, and global context levels.

(i) Message passing with graph transformer. Since transformer is well-acknowledged as the most powerful network in modeling sequence and graph data 18,40,67, we adopt its multi-head attention mechanism while taking the edge features into account for message passing in graphs. Formally, we denote the hidden feature vectors of node i and edge ji in layer l as and , respectively. Before the first GNN layer, we apply an MLP to project the initial node and edge features to the d-dimensional space. To update node i, the message passing in layer l is performed as follows:

where the attention coefficient from node j to i is calculated by:

The learnable weight matrices are used to project the node feature vectors into the corresponding query, key and value representations. is used to transform the edge features which will be subsequently added to the key and value representations. N(i) denotes the neighbors of node i. In practice, we use multi-head attention to linearly project the queries, keys and values multiple times, perform the attention function in parallel and finally concatenate them together.

(ii) Edge update. To improve the model’s capability, we update the features of an edge using its connecting nodes:

where ∥ denotes the vector concatenation and EdgeMLP is an MLP for edge update.

(iii) Node update with global context attention. While the local node and edge interactions play crucial roles in learning residue representations, the global information is also valuable for improving accuracy. However, global self-attention across the whole protein is computationally intensive. Alternatively, here we learn a global context vector for each protein and use it to apply gated attention for the node representations similar to 38:

where n is the number of residues in a protein, GateMLP is an MLP for gated attention, σ(·) is the sigmoid function, and ⊙ denotes the element-wise product operation.

Multi-task learning

To better capture the intrinsic similarities of binding patterns among different ligands and enable efficient predictions in a concurrent fashion, GPSite employs a multi-task framework, where the shared edge-enhanced GNN is used to model the common binding-relevant characteristics, followed by ten ligand-specific MLPs to mine the binding patterns of particular molecules. In the training steps, different types of ligand-binding proteins are input to the same network, and predictions for the ten types of ligands are yielded. Nonetheless, only predictions with the corresponding known ligand-binding sites are used to calculate loss and perform backpropagation, while the predictions of other ligands without ground truth data are masked. That is, each protein is used to train the shared GNN and the corresponding ligand-specific MLP(s) of its known binding partner(s) without affecting other irrelevant MLP(s).

Implementation and evaluation

We performed five-fold cross-validation on the training data, where the ten training sets were mixed and split into five folds randomly, and then each time a model was trained on four folds and evaluated on the remaining fold. This process was repeated for five times and the average validation performance was used to optimize the hyperparameters of the network. In the test phase, all five trained models from cross-validation were used to make predictions, which were averaged as the final prediction of GPSite. Specifically, we adopted Pytorch 1.13.1 68 to implement GPSite, which contains a 4-layer shared GNN with 128 hidden units and 4 attention heads. The Adam optimizer 69 with the one-cycle learning rate policy 70 was used for model optimization on the binary cross entropy loss. Within each epoch, we randomly drew 25,000 samples from the training data with replacement to train our model using a batch size of 16. The training process lasted at most 25 epochs and we performed early stopping based on the validation performance, which took ∼1.5 h on an NVIDIA A100 GPU.

Similar to the previous works, we use recall, precision, accuracy, F1-score, Matthews correlation coefficient (MCC), AUC, and AUPR to evaluate the prediction performance, whose detailed definitions are given in Appendix 1-note 8. AUC and AUPR are independent of thresholds, thus reflecting the overall performance of a model. The other metrics are calculated by converting the predicted binding probabilities to binary predictions with a threshold for each ligand, which is determined by maximizing MCC on the validation sets. We adopted AUPR for hyperparameter selections as it is more sensitive and informative than AUC in imbalanced two-class classification tasks 71.

High-throughput annotation and analysis on Swiss-Prot

We downloaded all the available 569,516 sequences in Swiss-Prot (release: 2023-05-03) and then removed sequences longer than 2700 residues (0.21%) due to the memory limit of GPUs, resulting in a total of 568,326 sequences. Non-standard amino acids in these sequences were also removed. ESMFold was applied to predict the structures from sequences on 16 NVIDIA A100 (80 GB) GPUs, which cost ∼8.5 days. The structure preprocessing and feature extraction (by ProtTrans and DSSP) procedures overall cost ∼4 h on the same GPU cluster, and the inference of binding sites using GPSite took ∼1 h.

For the downstream analyses of protein function and variant, we only considered the predicted protein structures with length ≥ 50 and pTM > 0.7 evaluated by ESMFold, consisting of 370,140 structures eventually. Interface residues are defined as residues with predicted binding probabilities higher than the pre-defined thresholds described in implementation and evaluation. Surface residues on the predicted structures are defined as the residues with RSA > 5% 72 computed by DSSP. The annotated GO terms of molecular function and biological process for all sequences were downloaded from UniProt 7, and we up-propagated the annotations using all types of relationships defined in the hierarchical structure of GO (release: 2023-05-10). The binding proteins of a specific ligand were determined as those annotated with the corresponding ligand-binding molecular function, and the non-binding proteins were randomly sampled to the same number as binding proteins. Concretely, we collected 21680, 42074, 1240, 24108, 74428, 4960, 15030, 2088, 24161, and 4093 binding proteins from Swiss-Prot for DNA, RNA, peptide, protein, ATP, HEM, Zn2+, Ca2+, Mg2+, and Mn2+ respectively. The pathogenicity annotations of human protein altering variants were downloaded from UniProt (release: 2023_02), which contain UniProt manually reviewed natural variants, as well as variants imported from other public resources such as Ensembl Variation 73 and ClinVar 74, and the conflicting annotations were removed.

Acknowledgements

This work was supported by the National Key R&D Program of China (2022YFF1203100), National Natural Science Foundation of China (12126610), Guangdong Key Field R&D Plan (2019B020228001 and 2018B010109006), and Guangzhou S&T Research Plan (202007030010 and 202002020047).

Competing interests

The authors declare that no competing interests exist.

Data availability

The binding site benchmark datasets, source code and trained models of GPSite are available at https://github.com/biomed-AI/GPSite. The user-friendly GPSite webserver is freely available at https://bio-web1.nscc-gz.cn/app/GPSite. All predicted structures along with the binding site annotations for the Swiss-Prot sequences are available in our GPSiteDB database at https://bio-web1.nscc-gz.cn/database/GPSiteDB.

Appendix 1

Appendix 1-note 1. Brief introductions to the competitive methods

DRNApred 12

DRNApred is a sequence-based DNA- and RNA-binding site predictor, which is based on a logistic regression input with sequence features including evolutionary information, i.e., hidden Markov models (HMM) profile and predicted structural properties including secondary structure (SS), relative solvent accessibility (RSA), and disorder. We used its webserver (http://biomine.cs.vcu.edu/servers/DRNApred/) for performance evaluation.

NCBRPred 43

NCBRPred is a sequence-based DNA- and RNA-binding site predictor, which is based on a bidirectional Gated Recurrent Units (BiGRUs) input with sequence features including the evolutionary information PSSM (position-specific scoring matrix) and HMM, as well as the predicted RSA and SS. We used its webserver (http://bliulab.net/NCBRPred/server) for performance evaluation.

NucBind, SVMnuc and COACH-D

NucBind 44 is a structure-based DNA- and RNA-binding site predictor, which combines the predictions from a support vector machine (SVM) based ab-initio method SVMnuc and a template-based method COACH-D 46. SVMnuc was trained with sequence features from PSSM, HMM and predicted SS. We used its webserver (http://yanglab.nankai.edu.cn/NucBind/) for evaluation.

GraphBind 25

GraphBind is a structure-based nucleic-acid- and ligand-binding site predictor, which adopts a hierarchical graph neural network (GNN) for massage passing on protein residue graphs. Its input features mainly contain PSSM, HMM, SS, and atomic features. We used its standalone program with pre-trained model weights from http://www.csbio.sjtu.edu.cn/bioinf/GraphBind/sourcecode.html for inference and evaluation.

GeoBind 22

GeoBind is a structure-based nucleic-acid- and ligand-binding site predictor, which employs geodesic convolution to point cloud on the protein surface. Its input features contain HMM, atom type, and local curvature of the surface. We used its webserver (http://www.zpliulab.cn/GeoBind/) for evaluation.

aaRNA 47

aaRNA is a structure-based RNA-binding site predictor, which employs a fully connected neural network input with sequence features based on PSSM and HMM, as well as structural features like RSA, SS, and curvature of the protein surface. We used its webserver (https://sysimm.ifrec.osaka-u.ac.jp/aarna/) for evaluation.

PepBind 14

PepBind is a sequence-based peptide-binding site predictor, which combines the predictions from a SVM-based ab-initio method SVMpep and two template-based methods S-SITE and TM-SITE 10. SVMnuc was trained with sequence features from PSSM, HMM, predicted SS and predicted intrinsic disorder. The structures required by TM-SITE are predicted by the I-TASSER Suite 75. We used its webserver (http://yanglab.nankai.edu.cn/PepBind/) for evaluation.

PepNN-Struct and PepNN-Seq 45

PepNN-Struct is a structure-based peptide-binding site predictor which employs a graph transformer network to encode the protein representations and applies reciprocal multi-head attention to model the interaction between the protein structure and peptide sequence. A one-hot encoding is used to represent the protein and peptide sequence information. The pre-trained contextualized language model ProtBert 31 is also used to embed the protein sequences. To perform the peptide-agnostic binding site prediction, the model is input with random length poly-glycine peptides. PepNN-Seq is similar to PepNN-Struct, except that the graph transformer model is substituted with an MLP. We used its standalone program with pre-trained model weights from https://gitlab.com/oabdin/pepnn for inference and evaluation.

PepBCL 17

PepBCL is a sequence-based peptide-binding site predictor, which fine-tuned the pre-trained protein language model from 31 to predict peptide-binding sites. It also used contrastive learning to address the data imbalance problem. We adopted its standalone code with pre-trained model weights for inference and evaluation (https://github.com/Ruheng-W/PepBCL). There are two PepBCL models trained on two different datasets (1154 vs 640 training proteins), and here we report the results of the model trained on the larger dataset, since it performs slightly better.

DeepPPISP 48

DeepPPISP is a structure-based protein-protein binding site predictor, which utilizes one-hot protein sequence, PSSM, and SS as input. The model adopts a convolutional neural network (CNN) to capture local and global protein features. The predictions of DeepPPISP for the test proteins are directly obtained from our previous work 26, which were originally produced by re-training DeepPPISP on our training set using its standalone code in https://github.com/CSUBioGroup/DeepPPISP.

SPPIDER 49

SPPIDER is a structure-based protein-protein binding site predictor, which is based on a fully connected neural network input with sequence features including PSSM and structure features like RSA. The model measures the impacts from spatially neighboring residues by adopting weighted averages over features of spatially nearest neighbors. The predictions of SPPIDER for the test proteins are directly obtained from our previous work 26, which were originally generated by the SPPIDER webserver (https://sppider.cchmc.org/).

MaSIF-site 21

MaSIF-site is a structure-based protein-protein binding site predictor, which maps the geometric and chemical features on the protein surface to patches and uses the geodesic convolutional layers to capture the surface fingerprints. MaSIF-site does not rely on any features from multiple sequence alignments (MSA). The predictions of MaSIF-site for the test proteins are directly obtained from our previous work 26, which were originally generated by the standalone program with pre-trained model weights through a docker container from https://github.com/lpdi-epfl/masif.

GraphPPIS 26

GraphPPIS is a structure-based protein-protein binding site predictor, which exploits a deep graph convolutional neural network (GCN) with initial residual and identity mapping to refine information in the protein residue graphs. The input features of GraphPPIS consist of PSSM, HMM, and structural properties (RSA, SS and torsion angles) from DSSP. The prediction results for the test proteins can be obtained from our webserver (http://bio-web1.nscc-gz.cn/app/graphppis-v2) or the standalone program (https://github.com/biomed-AI/GraphPPIS).

ScanNet 23

ScanNet is a structure-based protein-protein binding site predictor, which adopts geometric deep learning for massage passing on protein atom graphs. Its input features mainly contain atomic features and PSSM. We used its standalone program with pre-trained model weights from https://github.com/jertubiana/ScanNet for inference and evaluation.

PeSTo 24

PeSTo is a structure-based protein-protein binding site predictor, which adopts a geometric transformer for massage passing on protein atom graphs. Its input feature only contains the atomic type. We used its standalone program with pre-trained model weights from https://github.com/LBM-EPFL/PeSTo for inference and evaluation.

TargetS 15

TargetS is a sequence-based ligand-binding site predictor, which extracts evolutionary information (PSSM), predicted SS and ligand-specific binding propensity from sequence context using a sliding-window strategy. It then employs several SVMs to learn the local binding patterns, which are assembled by the modified AdaBoost algorithm. We used its webserver (http://www.csbio.sjtu.edu.cn/TargetS/) for evaluation.

DELIA 19

DELIA is a structure-based ligand-binding site predictor, which uses the bidirectional long short-term memory (BiLSTM) networks to refine sequence features including binding propensity from S-SITE, PSSM, HMM, predicted SS and predicted RSA, as well as a CNN to extract characteristics from the protein distance matrices. We used its webserver (http://www.csbio.sjtu.edu.cn/bioinf/delia/) for evaluation.

MIB 50

MIB is a template-based metal ion-binding site predictor, where the fragment transformation method is used for structural comparison between query proteins and templates without any data training. The predictions of MIB for the test proteins are directly obtained from our previous work 18, which were originally generated from the MIB webserver (http://bioinfo.cmu.edu.tw/MIB/).

IonCom 51

IonCom is a structure-based metal and acid radical ion-binding site predictor, which combines the predictions from an SVM-based ab-initio method IonSeq and four template-based methods including COFACTOR 76, TM-SITE, S-SITE, and COACH 10. IonSeq was trained with sequence features from PSSM, ligand-specific binding propensity, predicted SS, predicted RSA, etc. We used its standalone program with pre-trained weights from https://zhanggroup.org/IonCom/ for inference and evaluation.

LMetalSite 18

LMetalSite is a sequence-based alignment-free metal ion-binding site predictor where the pre-trained protein language model ProtTrans is used to extract sequence embeddings and a transformer with multi-task learning is applied to capture the intrinsic similarities between different metal ions. The prediction results of the test proteins can be obtained from our webserver (http://bio-web1.nscc-gz.cn/app/lmetalsite) or the standalone program (https://github.com/biomed-AI/LMetalSite).

Appendix 1-note 2. Performance comparison between GPSite and PeSTo

Since 340 out of 375 proteins in our protein-protein binding site test set share > 30% identity with the training sequences of PeSTo, we performed a separate comparison between GPSite and PeSTo using the training and test datasets from PeSTo. By re-training with simply the same hyperparameters, GPSite achieves better performance than PeSTo (AUPR of 0.824 against 0.797) as shown in Appendix 2-table 4. Furthermore, when using ESMFold-predicted structures as input, the performance of PeSTo decreases substantially (AUPR of 0.691), and the superiority of our method will be further reflected. As in 24, the performance of ScanNet is also included (AUPR of 0.720), which is also largely outperformed by GPSite.

Appendix 1-note 3. Case study for the ribosome biogenesis protein ERB1

Here we present an example of an RNA-binding protein, i.e., the ribosome biogenesis protein ERB1 (PDB: 7R6Q, chain m), to illustrate the impact of predicted structure’s quality. As shown in Appendix 3-figure 5, ERB1 is an integral component of a large multimer structure comprising protein and RNA chains (i.e., the state E2 nucleolar 60S ribosome biogenesis intermediate). Likely due to the neglect of interactions from other protein chains, ESMFold fails to predict the correct conformation of the ERB1 chain (TM-score = 0.24). Using this incorrect predicted structure, GPSite achieves an AUPR of 0.580, lower than GraphBind input with the native structure (AUPR = 0.636). However, the performance of GraphBind substantially declines to an AUPR of 0.468 when employing the predicted structure as input. Moreover, if GPSite adopts the native structure for prediction, a notable performance boost can be obtained (AUPR = 0.681).

Appendix 1-note 4. Generation of the evolutionary features from MSA

Evolutionarily conserved residues may contain motifs related to important protein properties. Here, we also evaluated the widely used evolutionary features from MSA in our ablation studies, including position-specific scoring matrix (PSSM) and hidden Markov models (HMM) profile. PSSM is produced by running PSI-BLAST 77 to search the query sequence against UniRef90 62 with three iterations and an E-value of 0.001. HMM profile is generated by running HHblits 53 against UniClust30 78 with default parameters. Each residue is encoded into a 20-dimensional vector in PSSM or HMM. The feature values in the sequence representations from PSSM and HMM are further normalized to scores between 0 and 1 as follows:

where v is the original feature value, and vmin and vmax are the minimum and maximum values of this feature type observed in the training set, respectively.

Appendix 1-note 5. The effect of training with predicted structures

We examined the performance under different training and evaluation settings as shown in Appendix 2-table 9. As expected, the model yields exceptional performance (average AUPR of 0.656) when trained and evaluated using native structures. However, if this model is fed with predicted structures of the test proteins, the performance substantially declines to an average AUPR of 0.573. This trend aligns with the observations for other structure-based methods as illustrated in Figure 2. More importantly, in the practical scenario where only predicted structures are available for the target proteins, training the model with predicted structures (i.e., GPSite) results in superior performance than training the model with native structures (average AUPR of 0.594 against 0.573), probably owing to the consistency between the training and testing data. For completeness, the results in Appendix 3-figure 2 are also included where GPSite is tested with native structures (average AUPR of 0.637).

Appendix 1-note 6. The cross-type performance of the multi-task network in GPSite

We conducted cross-type evaluations by applying different ligand-specific MLPs in GPSite for the test sets of different ligands. As shown in Appendix 2-table 10, for each ligand-binding site test set, the corresponding ligand-specific network consistently achieves the best performance. This indicates that the ligand-specific MLPs have specifically learned the binding patterns of particular molecules. We also noticed that the cross-type performance is reasonable for the ligands sharing similar properties. For instance, the DNA-specific MLP exhibits a reasonable AUPR when predicting RNA-binding sites, and vice versa. Similar trends are also observed between peptide and protein, as well as among metal ions as expected. Interestingly, the cross-type performance between ATP and HEM is also acceptable, potentially attributed to their comparable molecular weights (507.2 and 616.5, respectively).

Appendix 1-note 7. GPSite is effective for completing the function annotations in Swiss-Prot

As depicted in Figure 5A, GPSite assigns relatively high prediction scores to the proteins without “protein binding” function in the Swiss-Prot annotations, leading to a modest AUC value of 0.608 (Figure 5B). This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete. To support this hypothesis, we present two proteins as case studies, both sharing < 20% sequence identity with the protein-binding training set of GPSite. The first case is Aminodeoxychorismate synthase component 2 from Escherichia coli (UniProt ID: P00903). GPSite confidently predicted this protein as a protein-binding protein with a high prediction score of 0.936. Notably, this protein was not annotated with the “protein binding” function (GO:0005515) or any of its GO child terms in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P00903?format=txt&versions=171, release: 2023-05-03). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P00903?format=txt&versions=174, release: 2023-11-08) during manuscript revision, this protein is annotated with the “protein heterodimerization activity” function (GO:0046982), which is a child term of “protein binding”. In fact, the heterodimerization activity of this protein has been validated through experiments in the year of 1996 (PMID: 8679677), indicating the potential incompleteness of the Swiss-Prot annotations. The other case is Hydrogenase-2 operon protein HybE from Escherichia coli (UniProt ID: P0AAN1), which was also predicted as a protein-binding protein by GPSite (score = 0.909). Similarly, this protein was not annotated with the “protein binding” function in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=108). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=111), this protein is annotated with the “preprotein binding” function (GO:0070678), which is a child term of “protein binding”. In fact, the preprotein binding function of this protein has been validated through experiments in the year of 2003 (PMID: 12914940). These cases demonstrate the effectiveness of GPSite for completing the missing function annotations in Swiss-Prot.

Appendix 1-note 8. Evaluation metrics

Following the previous studies, we use recall (Rec), precision (Pre), accuracy (Acc), F1-score (F1), Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AUPR) to evaluate the prediction performance:

where true positives (TP) and true negatives (TN) denote the number of correctly predicted binding and non-binding residues, and false positives (FP) and false negatives (FN) denote the number of incorrectly predicted binding and non-binding residues, respectively. AUC and AUPR are independent of thresholds, thus reflecting the overall performance of a model. The other metrics are calculated using a threshold to convert the predicted binding probabilities to binary predictions. We go through 101 thresholds from 0 to 1 with an interval of 0.01, and select the best threshold that maximizes MCC on the validation sets. We adopt AUPR for hyperparameter selections as it is more sensitive and informative than AUC in imbalanced two-class classification tasks 71,79.

Appendix 2

Statistics of the ten binding site benchmark datasets used in this study

The performance of GPSite on the five-fold cross-validation and independent test sets

Performance comparison of GPSite with state-of-the-art methods on the ten binding site test sets

Performance comparison of GPSite with ScanNet and PeSTo on the protein-protein binding site test set from PeSTo

24

The numbers of proteins with TM-score > 0.7 or ≤ 0.7 between native and ESMFold-predicted structures in the ten binding site datasets

The prediction quality of ESMFold measured by TM-score between native and predicted structures in the ten binding site datasets

The ablation studies on protein features and model designs in the ten binding site test sets

Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the ten ligands

Performance comparison on the ten binding site test sets under different training and evaluation settings

Cross-type performance by applying different ligand-specific MLPs in GPSite for the test sets of different ligands

Appendix 3

Runtime comparison of the GPSite webserver with other top-performing servers. Five protein chains (i.e., 8HN4_B, 8USJ_A, 8C1U_A, 8K3V_A and 8EXO_A) comprising 100, 300, 500, 700, and 900 residues, respectively, were selected for testing, and the average runtime is reported for each method. Note that a significant portion of GPSite’s runtime (75 s, indicated in orange) is allocated to structure prediction using ESMFold.

The performance of GPSite when using native or predicted structures as input during the test phase.

Distributions of the TM-scores between native and predicted structures in the protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+ datasets.

The performance of GPSite on structures of different qualities, and the comparisons with the best structure-based methods in the test sets of protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+. The experimental structure-based methods input with ESMFold-predicted structures are marked with *. Since there are only 5 proteins with TM-score ≤ 0.7 in the HEM and Mn2+ test sets (details shown in Appendix 2-table 5), the corresponding results may not be statistically significant.

The prediction results of GPSite and GraphBind for the ribosome biogenesis protein ERB1. (A) The state E2 nucleolar 60S ribosome biogenesis intermediate (PDB: 7R6Q). The ribosome biogenesis protein ERB1 (chain m) is highlighted in blue, while other protein chains are colored in gray. The RNA chains are shown in orange. (B) The RNA-binding sites on ERB1 (colored in red). (C) The ESMFold-predicted structure of ERB1 (TM-score = 0.24). The RNA-binding sites are also mapped onto this predicted structure (colored in red). (D-G) The prediction results of GPSite and GraphBind for the predicted and native ERB1 structures. The confidence of the predictions is represented with a gradient of color from blue for non-binding to red for binding.

The run time of ESMFold with respect to the sequence length in Swiss-Prot evaluated on an NVIDIA A100 GPU. The run time is presented as mean ± standard deviation per range of number of residues (range size equals 100).

The univariate and bivariate distributions of the protein length and the pTM estimated by ESMFold of the Swiss-Prot sequences. The probability density curves are fit using kernel density estimation. The darker region in the bivariate heatmap corresponds to a higher number of samples.

The percentage of proteins predicted as binding to peptide, protein, ATP, HEM, Zn2+, Ca2+, Mg2+ and Mn2+ by GPSite to be annotated with certain biological process in Swiss-Prot. Only the specific biological process terms with depth ≥ 8 in the GO directed acyclic graph are considered, among which the 15 terms with the highest percentage are displayed.

The percentage of pathogenic or benign natural variant sites within GPSite-predicted interfaces. The baseline is the probability of a random residue being annotated as an interface residue.