AI Machine Learning & Data Science Research

Hardware Savings Up to 46 Times for AIGC and Automatic Parallelism in New Colossal-AI Release

Colossal-AI (, the widely-used open-source library for training, inference and fine-tuning of large deep learning models, has released a new automatic parallelism feature and functionality that reduces hardware costs by up to 46 times for AI-Generate Content (AIGC) solutions.

Recently, large-scale AI models have gained significant attention and adoption. AIGC has remained popular, and the chatbot ChatGPT has garnered widespread interest and attracted millions of users in just two weeks, even causing ‘Code Red’ at Google because it could upend the search giant’s business. Other powerful AI applications, such as AlphaCode for software development and ESM2 for drug research, have also gained traction. These and other AI applications are constantly pushing the boundaries of what is possible with large AI models, exploring new areas of implementation and application.

The new Colossal-AI release version 0.2.0 includes several key features to significantly minimize the cost and accelerate the time-to-market of implementing such large models, while enhancing ease of use for developers.

  • Stable Diffusion 2.0 model recipe: Out-of-the-box configuration for Stable Diffusion 2.0 in Colossal-AI enables low-cost training, fine-tuning, and inference, while also reducing GPU memory consumption by up to 5.6 times and hardware costs by up to 46 times. All of which can be achieved with just one line of code.
  • BLOOM model recipe: The ready-made inclusion of the 175 billion parameter BLOOM model offers stand-alone inference with a 4-fold reduction in GPU memory consumption and hardware costs reduced by over 10 times.
  • Automatic parallelism: With just one line of code, users can automatically search for the best parallelism strategy, making distributed training easier and natively supporting popular AI model libraries like Hugging Face and Timm.

With users worldwide such as AWS, Meta, BioMap, as well as over 7,000 Github stars and proven unmatched performance, Colossal-AI is a trusted resource for those looking to leverage the power of large-scale AI.

Optimization Features for Stable Diffusion 2.0

AIGC is a highly sought-after topic in the AI field and has been recognized as a “2022 BREAKTHROUGH OF THE YEAR” by Science. Stable Diffusion has also recently been upgraded to version 2.0. Some users have remarked on the rapid pace of development for this technology, noting that version 2 is being released before version 1 has even been fully developed.

However, the high cost of implementing AIGC has limited its widespread adoption to some extent. For example, the Stability AI behind Stable Diffusion maintains over 4,000 NVIDIA A100 GPU clusters and has incurred operating costs of over $50 million. As AIGC continues to evolve rapidly with iterative models, algorithms, and downstream tasks, reducing costs has become a critical issue for its successful implementation.

Stable Diffusion 2.0 is built on the user-friendly PyTorch Lightning framework, and Colossal-AI has promptly released a comprehensive recipe containing out-of-the-box open-source training, fine-tuning, and inference solutions that are more efficient and have lower hardware requirements as the official large-scale model solution for PyTorch Lightning. The recipe includes:

  • Reducing GPU memory consumption of training by 5.6x and hardware cost by up to 46x
  • Supporting DreamBooth for fast personalized fine-tuning on a single GPU
  • Reducing inference GPU memory consumption by 2.5x

In the near future, this solution will also be integrated into Hugging Face, the most popular AI model community, to further enhance user experience.


Increasing the batch size is a widely used method to speed up training and reduce costs. However, the limited memory capacity of GPU limits batch size and raises hardware requirements.

Colossal-AI’s memory optimization technologies and support for Stable Diffusion 2.0 reduce the memory requirement for using Stable Diffusion with large batch size 16 training on every GPU from 64.5GB to 11.6GB, a 5.6x reduction. These technologies can also be extended to single or multiple GPUs in parallel. This allows users to meet their requirements with only consumer-grade graphics cards like the 3060, rather than the most advanced A100 80GB, resulting in hardware cost savings of up to 46 times. With Colossal-AI, more users can affordably conduct Stable Diffusion-related research and implementation on consumer-grade GPUs.

Optimization Strategies

Flash Attention
Colossal-AI was the first to introduce Flash Attention technology for Stable Diffusion 1.0, increasing attention speed by 104% and reducing peak end-to-end training memory by 23%. Flash Attention is an efficient implementation of attention for long sequence tasks that uses flatten to reduce memory reads/writes between GPU high bandwidth memory (HBM) and an approximate attention algorithm to block sparse attention, making it faster than existing approximate attention methods. While Stable Diffusion 1.0 only had a small number of attention layers, the potential for memory optimization with Flash Attention was further demonstrated in Stable Diffusion 2.0 by replacing many convolutional layers with attention layers.

ZeRO + Gemini
Colossal-AI uses Zero Redundancy Optimizer (ZeRO) to eliminate memory redundancy, greatly improving memory usage efficiency compared to classic data parallelism without sacrificing computational granularity and communication efficiency. Colossal-AI also incorporates Chunk-based memory management, which further improves the performance of ZeRO. Chunk-based memory management stores consecutive sets of parameters in operational order in a continuous memory space called Chunk, all the same size, for efficient use of network bandwidth between PCI-e and GPU-GPU, reduced communication, and avoidance of potential memory fragmentation.

Colossal-AI’s heterogeneous memory manager, Gemini, reduces GPU memory footprint by offloading optimizer states to CPU, allowing for simultaneous use of GPU memory and CPU memory (including CPU DRAM or NVMe SSD memory) to increase the scale of available models beyond the memory limit of a single GPU.

One line of code to get started quickly

As an official PyTorch Lightning partner, Colossal-AI’s memory optimizations can be easily accessed with just one line of code.

DreamBooth Fine-tuning

DreamBooth is a method that uses just 3-5 images of a desired subject to personalize text-to-image models like Stable Diffusion and generate a series of images of that subject. Colossal-AI’s memory optimization can be easily accessed with the file for fast, personalized model fine-tuning and enhanced ease of use.


Since model inference is insensitive to numerical precision, this offers the possibility of implementing low-cost inference with low precision. The Stable Diffusion 2.0 model can be Int8 quantized for inference with a single line of code, resulting in 2.5 times lower memory consumption (3.1 GB memory required) with minimal performance loss.

model = replace_module(model)

Low-Cost Inference for the 175 Billion BLOOM Model

As models become larger, memory consumption during inference becomes a key factor to consider. For example, the open source 175 billion parameters BLOOM model released by Hugging Face requires at least 87.5GB/43.8GB of memory per GPU when using common FP32/FP16 for inference on 8 GPUs in a single node. This large memory footprint can’t be supported by even the most advanced 8*A100 (80GB/40GB) server, and multi-node inference brings additional costs and communication overhead.

Colossal-AI now provides a dedicated BLOOM recipe that enables efficient Int8 quantized model parallel inference, allowing deployment of large model inference services like BLOOM (175 billion parameters) on 8-GPU servers using consumer-grade graphics cards like 3090/4090 without significant CPU memory usage increase or performance loss. This can reduce hardware deployment costs by more than 10 times compared to using A100 solutions.

By quantizing the model with Int8, the overall memory footprint of the model can be reduced from 352.3GB (FP16) to 185.6GB with Colossal-AI, while its model parallelism technique reduces the memory requirement per GPU to 23.2GB. Model parallelism avoids increasing the CPU memory footprint by using lazy_init in the main process, thereby obtaining a meta model with almost no memory footprint. Colossal-AI quantifies and slices the model in the main process, and uses lazy_init in each of the remaining processes to obtain a meta model that takes up almost no memory, and then passes the model parameters between processes via gloo backend. This results in peak CPU memory usage reaching a theoretically optimal level without loading model parameters in segments, improving memory usage efficiency under non-intensive requests compared to “pipeline-like” distribution of model slicing by layer.

Auto-Parallelism with One Line of Code

In recent years, the deployment of large-scale machine learning models has become increasingly important. However, distributed training systems often require manual parallelization plans, which can be complex and require expert knowledge in system engineering and configuration. This can be a challenge for most AI developers without the necessary skills. The need for manual parallelization can make deploying large-scale machine learning models difficult and expensive.

Colossal-AI’s one-line auto-parallelism system simplifies the process of deploying large-scale machine learning models for AI developers. Compared to other solutions that require manual configuration of complex parallel policies and model modification, Colossal-AI only requires one line of code from the user, along with cluster information and model configurations, to enable distributed training. It seamlessly integrates with popular AI model frameworks like Hugging Face and Timm.

# wrap the model using auto_engine
model, optimizer = auto_engine(model, optimizer, cluster_info)
# normal training loop

The new Colossal-AI release greatly reduces the requirements for AI developers to use distributed training techniques to pre-train and fine-tune large models. At the same time, the auto-parallelism system can search for optimal parallel strategies at a finer granularity.

Graph Tracing

Colossal-AI is the first auto-parallelism system that uses static graph analysis based on the PyTorch framework. Obtaining a static execution plan for PyTorch, a dynamic graph framework, has long been an area of research in the field of machine learning systems. Colossal-AI’s parallel computation system uses ColoTracer, a forked version of the torch.FX Tracer, to guide the search for an optimal parallelization strategy. The meta-information of each tensor, such as tensor shape, dims, dtype, etc., is computed and recorded during the tracing process. This approach has the advantage of better generalization, as it is not tied to specific models or configurations.

Fine-grained Parallelism Search

Colossal-AI’s auto-parallelism searches for strategies in regard to each operand with the goal of achieving the fastest runtime while meeting memory budget constraints. It ultimately determines the actual training time strategy, including the tensor split strategy for each tensor, the type of communication operators to be inserted between different computing nodes, whether to replace operators, etc. The tensor, data, and hybrid parallelism such as column and row split used by NVIDIA in Megatron-LM and other parallelism systems are all subsets of strategies that can be searched by Colossal-AI. In addition to these parallelisms that can be manually specified, Colossal-AI can specify a unique parallelism method for each operation and, potentially finding a better parallelism strategy than what human experts could provide.

Distributed Tensor and Shape-Consistency System

The Colossal-AI system uses a device-mesh, similar to PyTorch’s latest DTensor release, to manage its cluster. Colossal-AI uses a sharding-spec to annotate the storage status of each tensor and facilitate their distribution across the cluster. The system also employs a shape-consistency manager to automatically transform tensors between different sharding-specs, allowing for seamless slicing and dicing of tensors, while the shape-consistency manager ensures that the output of upstream operands is consistently stored in the cluster, regardless of how the input of downstream operands is stored. This makes Colossal-AI highly versatile and easy to use without users worrying about the storage status of tensors when performing operations on them.

Here are some key advantages of Colossal-AI compared to PyTorch DTensor:

  1. Colossal-AI’s device-mesh uses cluster performance metrics and profiling results to estimate the time consumption of different communication operators. This helps Colossal-AI optimize communication between nodes and improve overall system efficiency.
  2. Colossal-AI’s shape-consistency manager uses a greedy search algorithm to find relatively efficient ways to transform tensors between different sharding-specs, rather than simply transforming dimensions one by one. This can lead to more efficient and effective transformations.
  3. The integration of all-to-all operations in Colossal-AI increases the scalability of the system by enabling more efficient communication between nodes. This is especially useful for large-scale machine learning tasks that require the transfer of large amounts of data between nodes.

Integration with activation checkpoint

Colossal-AI uses activation checkpointing to compress memory usage during the training of large machine learning models. Instead of storing the entire model, activations of intermediate layers are stored. Colossal-AI’s automatic search function for activation checkpointing finds the most efficient checkpoint within a given memory budget, rather than just aiming for maximum memory compression. To avoid a lengthy search process for an optimal activation checkpoint, Colossal-AI has implemented a two-stage search process. This allows the system to find a feasible distributed training solution in a reasonable amount of time while still benefiting from activation checkpointing for memory management. The integration of activation checkpointing in Colossal-AI improves the efficiency and effectiveness of large model training.

About Colossal-AI

Colossal-AI is a unified deep learning system for the big model era, which supports efficient and fast deployment of AI large model training and inference, and reduces the cost of AI large model applications. Since becoming open source, Colossal-AI has ranked first on the GitHub Trending multiple times with over 7,000 Stars. It has been successfully accepted as the official tutorial for top international AI and HPC conferences such as SC, AAAI, and PPoPP.

Relevant solutions have been successfully applied by well-known tech giants for autonomous driving, cloud computing, retail, medicine, and chips, and have been widely acclaimed. For example, the recent hot ChatGPT is not yet open source and does not have internet access. Colossal-AI has successfully helped a Fortune 500 enterprise to develop a chatbot model with enhanced online search engine capabilities.

Find the source code and more information about Colossal-AI on GitHub at:


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

0 comments on “Hardware Savings Up to 46 Times for AIGC and Automatic Parallelism in New Colossal-AI Release

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: