HLApollo, a transformer-based model for predicting MHC-I peptide presentation, incorporates pan-allelic training, peptide processing information, and multiallelic deconvolution to achieve an average precision (AP) of 74.83%, outperforming other predictors by >12%. An important contributor to accuracy was including a negative set switching strategy to overcome the problem of false negatives in reference data. Adding gene expression data or a protein language model incorporating protein features led to further improvements. At a threshold of >76% AP, predictions could be extended to more untrained alleles, demonstrating HLApollo’s capacity for pan-allelic generalization.

Contributed by Morgan Janes

ABSTRACT: Based on the success of cancer immunotherapy, personalized cancer vaccines have emerged as a leading oncology treatment. Antigen presentation on MHC class I (MHC-I) is crucial for the adaptive immune response to cancer cells, necessitating highly predictive computational methods to model this phenomenon. Here, we introduce HLApollo, a transformer-based model for peptide-MHC-I (pMHC-I) presentation prediction, leveraging the language of peptides, MHC, and source proteins. HLApollo provides end-to-end treatment of MHC-I sequences and deconvolution of multi-allelic data, using a negative-set switching strategy to mitigate misassigned negatives in unlabelled ligandome data. HLApollo shows a 12.65% increase in average precision (AP) on ligandome data and a 4.1% AP increase on immunogenicity test data compared to next-best models. Incorporating protein features from protein language models yields further gains and reduces the need for gene expression measurements. Guided by clinical use, we demonstrate pan-allelic generalization which effectively captures rare alleles in underrepresented ancestries.

Author Info: (1) Early Clinical Development Artificial Intelligence, Genentech, South San Francisco, CA, USA. (2) Oncology Bioinformatics, Genentech, South San Francisco, CA, USA. (3) Early Cli

Author Info: (1) Early Clinical Development Artificial Intelligence, Genentech, South San Francisco, CA, USA. (2) Oncology Bioinformatics, Genentech, South San Francisco, CA, USA. (3) Early Clinical Development Artificial Intelligence, Genentech, South San Francisco, CA, USA. (4) Molecular Biology Department, Genentech, South San Francisco, CA, USA. (5) Molecular Biology Department, Genentech, South San Francisco, CA, USA. (6) Molecular Biology Department, Genentech, South San Francisco, CA, USA. (7) Microchemistry, Proteomics and Lipidomics, Genentech, South San Francisco, CA, USA. (8) Oncology Bioinformatics, Genentech, South San Francisco, CA, USA. (9) Cancer Immunology, Genentech, South San Francisco, CA, USA. (10) Cancer Immunology, Genentech, South San Francisco, CA, USA. (11) Microchemistry, Proteomics and Lipidomics, Genentech, South San Francisco, CA, USA. (12) Protein Chemistry, Genentech, South San Francisco, CA, USA. (13) Microchemistry, Proteomics and Lipidomics, Genentech, South San Francisco, CA, USA. (14) Molecular Biology Department, Genentech, South San Francisco, CA, USA. (15) Cancer Immunology, Genentech, South San Francisco, CA, USA. (16) Oncology Bioinformatics, Genentech, South San Francisco, CA, USA. Computational Science, Freenome, South San Francisco, CA, USA. (17) Early Clinical Development Artificial Intelligence, Genentech, South San Francisco, CA, USA. vincentliuk@gmail.com. Artificial Intelligence, SES AI, Woburn, MA, USA. vincentliuk@gmail.com. (18) Oncology Bioinformatics, Genentech, South San Francisco, CA, USA. suchitj@gene.com.