MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training

Liao, Xudong; Sun, Yijun; Tian, Han; Wan, Xinchen; Jin, Yilun; Wang, Zilong; Ren, Zhenghang; Huang, Xinyang; Li, Wenxue; Tse, Kin Fai; Zhong, Zhizhen; Liu, Guyue; Zhang, Ying; Ye, Xiaofeng; Zhang, Yiming; Chen, Kai

dc.contributor.author	Liao, Xudong
dc.contributor.author	Sun, Yijun
dc.contributor.author	Tian, Han
dc.contributor.author	Wan, Xinchen
dc.contributor.author	Jin, Yilun
dc.contributor.author	Wang, Zilong
dc.contributor.author	Ren, Zhenghang
dc.contributor.author	Huang, Xinyang
dc.contributor.author	Li, Wenxue
dc.contributor.author	Tse, Kin Fai
dc.contributor.author	Zhong, Zhizhen
dc.contributor.author	Liu, Guyue
dc.contributor.author	Zhang, Ying
dc.contributor.author	Ye, Xiaofeng
dc.contributor.author	Zhang, Yiming
dc.contributor.author	Chen, Kai
dc.date.accessioned	2025-09-10T19:26:51Z
dc.date.available	2025-09-10T19:26:51Z
dc.date.issued	2025-08-27
dc.identifier.isbn	979-8-4007-1524-2
dc.identifier.uri	https://hdl.handle.net/1721.1/162639
dc.description	SIGCOMM ’25, Coimbra, Portugal	en_US
dc.description.abstract	Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named experts, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain static during distributed training. In this paper, we advocate for a first-of-its-kind system, called MixNet, that unlocks topology reconfiguration during distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has strong locality, alleviating the need for global reconfiguration. Based on this, we design and implement a regionally reconfigurable high-bandwidth domain that augments existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We build a fully functional MixNet prototype with commodity hardware and a customized collective communication runtime. Our prototype trains state-of-the-art MoE models with in-training topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that MixNet achieves performance comparable to a non-blocking fat-tree fabric while boosting the networking cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2×–1.5× and 1.9×–2.3× at 100 Gbps and 400 Gbps link bandwidths, respectively.	en_US
dc.publisher	ACM\|ACM SIGCOMM 2025 Conference	en_US
dc.relation.isversionof	https://doi.org/10.1145/3718958.3750465	en_US
dc.rights	Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use.	en_US
dc.source	Association for Computing Machinery	en_US
dc.title	MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training	en_US
dc.type	Article	en_US
dc.identifier.citation	Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, and Kai Chen. 2025. MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training. In Proceedings of the ACM SIGCOMM 2025 Conference (SIGCOMM '25). Association for Computing Machinery, New York, NY, USA, 554–574.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory	en_US
dc.identifier.mitlicense	PUBLISHER_POLICY
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2025-09-01T07:53:41Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2025-09-01T07:53:42Z
mit.license	PUBLISHER_POLICY
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 3718958.3750465.pdf
Size:: 2.392Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record