Methods overview. a) catELMo is an ELMo-based bi-directional amino acid sequence representation model trained on TCR sequences. It takes a sequence of amino acid strings as input and predicts the right (or left) next amino acid tokens. catELMo consists of a charCNN layer and four bidirectional LSTM layers followed by a softmax activation. For a given TCR sequence of length L, each layer returns L vectors of length 1,024. The size of an embedded TCR sequence, therefore, is [5, L, 1024]. Global average pooling with respect to the length of TCR, L, is applied to get a representation vector with a size of 1, 024. b) TCR-epitope binding affinity prediction task. An embedding method (e.g., catELMo) is applied to both TCR and epitope sequences. The embedding vectors are then fed into a neural network of three linear layers for training a binding affinity prediction model. The model predicts whether a given TCR and epitope sequence bind or not. c) Epitope-specific TCR sequences clustering. The hierarchical clustering algorithm is applied to TCR embedding vectors to group TCR sequences based on their epitope-specificity.

Data Summary.

The number of unique epitopes, TCRs, and TCR-epitope pairs used for catELMo and downstream tasks analysis.

Comparison of the amino acid embedding methods for TCR-epitope binding affinity prediction task.

We obtained TCR and epitope embeddings and used them as input features of binding affinity prediction model. A binding affinity prediction model is trained on each embedding method’s embedding dataset. The prediction performance comparison on a), b), c) TCR split and d), e), f) epitope split. a), d) Receiver Operating Characteristic (ROC) curve and b), e) AUC of the prediction model trained on different embedding methods. Error bars represent standard deviations over 10 trials. c), f) AUCs of the prediction model trained on different portions of downstream datasets. Error bands represent 95% confidence intervals over 10 trials.

TCR-epitope binding affinity prediction performance of TCR split.

Average and standard deviation of 10 trials are reported. P-values are from two-sample t-tests between catELMo and the second best method (underlined).

TCR-epitope binding affinity prediction performance of epitope split.

Average and standard deviation of 10 trials are reported. P-values are from two-sample t-tests between catELMo and the second best method (underlined).

tSNE visualization for top five frequent epitopes.

We visually compared the embedding models on TCR-epitope binding affinity prediction task. We conduct tSNE analysis on the top 50 principle components of the last hidden layer features of the TCR-epitope binding affinity prediction models for out-of-sample epitopes. The clearer the boundary between positive pairs (lighter shade) and negative pairs (darker shade) associated with the same epitope sequence, the better the model is at discriminating binding and non-binding pairs.

Comparison of the amino acid embedding methods for epitope-specific TCR clustering.

Hierarchical clustering is applied on the McPAS [30] database. We cluster TCRs binding to the top eight epitopes of a) both human and mouse species, b) only human epitopes, and c) only mouse epitopes. Larger NMI scores indicate TCR sequences that bind to the same epitope are grouped together in the same cluster and TCR sequences that do not bind to the same epitope are separated apart in different clusters.

AUCs of TCR-epitope binding affinity prediction models built on BERT-based embedding models.

Average and standard deviation of 10 trials are reported.

AUCs of TCR-epitope binding affinity prediction models trained on different sizes of catELMo embeddings.

Average and standard deviation of 10 trials are reported.

AUCs of TCR-epitope binding affinity prediction models comparison with state-of-the-art prediction models.

All models are trained on the same dataset. Average and standard deviation of 10 trials are reported.