Foundation Models for Protein Phenotype Prediction
Author(s)
Calef, Robert
DownloadThesis PDF (55.14Mb)
Advisor
Kellis, Manolis
Zitnik, Marinka
Terms of use
Metadata
Show full item recordAbstract
Understanding the roles of human proteins remains a major challenge, with approximately 20% of human proteins lacking known functions and more than 40% missing context-specific functional insights. Even well-annotated proteins are often poorly characterized in diverse biological contexts, disease states, and perturbations. We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. To support this, we created ProCyon-Instruct, a dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes. By co-training a large language model with multimodal molecular encoders, ProCyon integrates phenotypic and protein data. A novel architecture and instruction tuning strategy allow ProCyon to process arbitrarily interleaved proteinand-phenotype inputs, achieve zero-shot task transfer, and generate free-form text phenotypes interleaved with retrieved protein sequence, structure, and drug modalities in a single unified model.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology