| dc.description.abstract | Understanding the roles of human proteins remains a major challenge, with approximately 20% of human proteins lacking known functions and more than 40% missing context-specific functional insights. Even well-annotated proteins are often poorly characterized in diverse biological contexts, disease states, and perturbations. We present ProCyon, a foundation model for modeling, generating, and predicting protein phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. To support this, we created ProCyon-Instruct, a dataset of 33 million protein phenotype instructions, representing a comprehensive resource for multiscale protein phenotypes. By co-training a large language model with multimodal molecular encoders, ProCyon integrates phenotypic and protein data. A novel architecture and instruction tuning strategy allow ProCyon to process arbitrarily interleaved proteinand-phenotype inputs, achieve zero-shot task transfer, and generate free-form text phenotypes interleaved with retrieved protein sequence, structure, and drug modalities in a single unified model. | |