Microsoft’s DeepSpeed-MoE Makes Massive MoE Model Inference up to 4.5x Faster and 9x Cheaper
A Microsoft research team proposes DeepSpeed-MoE, comprising a novel MoE architecture design and model compression technique that reduces MoE model size by up to 3.7x and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.