Supercharge Your AIGC Experience: Leverage xDiT for Multiple GPU Parallel in ComfyUI Flux.1 Workflow
ComfyUI, is the most popular web-based Diffusion Model interface optimized for workflow. Yet, its design for native single-GPU usage leaves it struggling with the demands of today’s large DiTs, resulting in unacceptably high latency for users like Flux.1. Leveraging the power of xDiT, we’ve successfully implemented a multi-GPU parallel processing workflow within ComfyUI, effectively addressing Flux.1’s performance challenges.
Read the Chinese Version of this article
Image and video generation are key areas for large-model multi-modal applications. They use diffusion models based on DiTs to process ultra-long input sequences. However, since the computational complexity of Transformer increases squarely with the sequence length, the delay problem of DiTs in actual deployment is particularly prominent. xDiT is an open source high-performance inference framework designed specifically for DiTs. It can extend DiT inference in parallel to multi-card or even multi-machine ultra-large-scale devices. Combined with compilation optimization technology, it ensures that the real-time requirements of applications are met.
Open source project address: https://github.com/xdit-project/xDiT
By applying xDiT, we successfully deployed multi-GPU parallel processing in the ComfyUI workflow, thereby achieving the following results:
Faster & Quality-Preserving
Without compromising the quality of image generation, XDiT significantly accelerates the generation efficiency of the FLUX.1 workflow in ComfyUI by leveraging multi-GPU parallelism. Taking the FLUX.1 dev version as an example, under the condition of 20 steps, the single image generation time has been reduced from 13.87 seconds to 6.98 seconds.
While increasing the generation speed, we have also ensured the quality of the generated images. Utilizing the identical prompt input, the resultant images exhibit no significant disparities in quality, both demonstrating a distinct capability for instruction adherence.
Prompt: A spacious futuristic classroom with digital plants adorning the walls and floating 3D holographic projections of the solar system hanging from the ceiling. On one side, a large touch-sensitive smartboard reads “XDiT” in neon pen. In the center, there are several transformable smart desks, and students are engaged in interactive learning through augmented reality glasses. In the corner, a small robot is providing personalized learning support to the students.
Simple & Seamless
In the initial stage, to apply XDiT’s capabilities to ComfyUI, we implemented two core nodes, XfuserPipelineLoader
and XfuserSampler
, at the Pipeline granularity through customization. Based on an HTTP service, we achieved end-to-end generation functionality in ComfyUI utilizing XDiT's capabilities.
However, such a highly customized implementation does not integrate well with the vast ComfyUI community ecosystem; moreover, the end-to-end generation logic largely contradicts ComfyUI’s highly modularized design philosophy aimed at achieving greater flexibility.
We hope to achieve parallel acceleration support for ComfyUI with minimal modifications. To achieve this goal, the core requirement is to complete the transformation of the diffusion model-related parts throughout the entire workflow. Taking the standard FLUX.1-dev official ComfyUI workflow as an example:
The diffusion model is loaded through the #12 Load Diffusion Model
node, processed and encapsulated by the #30 Model Sampling Flux
and #22 BasicGuider
nodes, and then passed to #13 SamplerCustomAdvanced
to perform multi-step denoising calculations. However, since ComfyUI was initially designed for personal computers with a single GPU, there remains a significant challenge in distributing computational tasks belonging to a single node across multiple GPUs for parallel processing without making major adjustments to the existing workflow.
To address this issue, we have implemented customized optimizations for the core #12 Load Diffusion Model
loading node and the computational node #13 SamplerCustomAdvanced
. By integrating the distributed computing capabilities of the Ray framework, we only need to use XDiTUNetLoader
and XDiTSamplerCustomAdvanced
as replacements to automatically distribute the model's computational tasks across multiple computing nodes. Thanks to XDiT's excellent parallel processing performance, we have significantly increased processing speed and effectively optimized computational efficiency without sacrificing the stability and flexibility of the existing workflow.
Plugins Supporting
For a series of diffusion models, relying solely on prompts and basic models is far from sufficient to reach a production-ready level. To address this, the ComfyUI community has also supported a range of plugins including Loras, ControlNet, and IP-Adapter to better achieve stylization and controllability of generated content.
To accomplish this, XDiT has successfully integrated the most popular Loras plugin into ComfyUI. Now, multiple Loras can be easily loaded using a single node called XDiTFluxLoraLoader
. Moreover, due to the modularization of the Loras module, when switching between different Loras, there's no need to change the overall weight of the model internally; only the weights of the newly added Loras need to be adjusted. Based on this, XDiT can support dynamic switching of Loras during runtime, completing the switch almost instantly without waiting.
Currently, we are continuously working on completing the integration and adaptation of other commonly used community plugins such as ControlNet and IP-Adapter, which will be iteratively updated in subsequent versions.
Concat Us!
Currently, this feature is still in the demonstration development and experimental phase. If you are interested in the parallel version of ComfyUI’s xDiT, we welcome you to contact us via email at jiaruifang@tencent.com.