HLApollo, a transformer-based model for predicting MHC-I peptide presentation, incorporates pan-allelic training, peptide processing information, and multiallelic deconvolution to achieve an average precision (AP) of 74.83%, outperforming other predictors by >12%. An important contributor to accuracy was including a negative set switching strategy to overcome the problem of false negatives in reference data. Adding gene expression data or a protein language model incorporating protein features led to further improvements. At a threshold of >76% AP, predictions could be extended to more untrained alleles, demonstrating HLApollo’s capacity for pan-allelic generalization.
Contributed by Morgan Janes
ABSTRACT: Based on the success of cancer immunotherapy, personalized cancer vaccines have emerged as a leading oncology treatment. Antigen presentation on MHC class I (MHC-I) is crucial for the adaptive immune response to cancer cells, necessitating highly predictive computational methods to model this phenomenon. Here, we introduce HLApollo, a transformer-based model for peptide-MHC-I (pMHC-I) presentation prediction, leveraging the language of peptides, MHC, and source proteins. HLApollo provides end-to-end treatment of MHC-I sequences and deconvolution of multi-allelic data, using a negative-set switching strategy to mitigate misassigned negatives in unlabelled ligandome data. HLApollo shows a 12.65% increase in average precision (AP) on ligandome data and a 4.1% AP increase on immunogenicity test data compared to next-best models. Incorporating protein features from protein language models yields further gains and reduces the need for gene expression measurements. Guided by clinical use, we demonstrate pan-allelic generalization which effectively captures rare alleles in underrepresented ancestries.