| dc.description.abstract | The secrets of the genome have captivated scientists for well over a century, though the active role its spatial organization plays in gene regulation, cell determination, and disease formation has become clear only in recent decades. Significant strides have been made toward characterizing and understanding three-dimensional genome organization, but the scale, complexity, and heterogeneity of the genome and nuclear environment complicate investigations into this system. This thesis alleviates these challenges and holds the potential to accelerate genome organization research by presenting several methodological advances.
An efficient Hi-C inversion algorithm appears first. This technique extracts pairwise contact potentials from experimental Hi-C data, uncovering mechanistic details obscured by the correlation between Hi-C contact probabilities. This required the development of a spin-glass model of chromatin and the derivation of a corresponding model inversion; the model may find use in further theoretical studies of chromatin, while the inversion can be applied more broadly. The inversion successfully revealed the location of chromatin loop anchors, supported the phase separation formation of chromatin compartments, and parameterized polymer models that reproduced the experimental Hi-C data with reasonable accuracy.
The focus then shifts toward ChromoGen, a generative AI model that predicts three-dimensional chromatin structures directly from DNA sequence and chromatin accessibility data. ChromoGen provided biologically accurate structural ensembles throughout the genome of two cell types, including one omitted from its training data. This transferability suggests that ChromoGen can provide access to the organization of chromatin in a wide variety of cell types while only relying on widely available sequencing data.
Afterward, we discuss several strategies to extend ChromoGen to full-chromosome structure prediction tasks. Preliminary results suggest that the technology of today can provide this capability, as we have generated physical chromosome conformations for mouse chromosomes, although sequencing data did not guide this generative process. Correspondingly, we explore the possibility of incorporating a multimodal model with ChromoGen, allowing it to condition structure generation on a wide variety of data types. Success in this area could enable true de novo structure prediction, greatly simplifying research aiming to understand the relationship between sequence, structure, and cellular function while also accelerating the development of treatments for diseases that implicate chromatin dysregulation. | |