Flux Dev

Run the Flux Dev model with limited VRAM in 8bit mode. It's possible, but inpractical, since the downloads alone are "only" 40GB.

Setup

pip install accelerate diffusers optimum-quanto transformers sentencepiece

In int4 mode there are places where the pre-trained weights in fp16 overflow, resulting in a blank image.

Inference

from diffusers import FluxPipeline, FluxTransformer2DModel
from optimum.quanto.models import QuantizedDiffusersModel, QuantizedTransformersModel
import torch
from transformers import T5EncoderModel

class Flux2DModel(QuantizedDiffusersModel):
    base_class = FluxTransformer2DModel

class T5Model(QuantizedTransformersModel):
    auto_class = T5EncoderModel

if __name__ == '__main__':
    T5EncoderModel.from_config = lambda c: T5EncoderModel(c).to(dtype=torch.float16)  # Duck and tape for Quanto support.
    t5 = T5Model.from_pretrained('./flux-t5')._wrapped
    transformer = Flux2DModel.from_pretrained('./flux-fp8')._wrapped
    pipe = FluxPipeline.from_pretrained('black-forest-labs/FLUX.1-dev',
                                        text_encoder_2=t5,
                                        transformer=transformer)
    # This method moves one whole model at a time to the GPU when it's in forward mode.
    pipe.enable_model_cpu_offload()
    image = pipe('cat playing piano', num_inference_steps=10, output_type='pil').images[0]
    image.save('cat.png')

Disclaimer

Use of this code and the copy of documentation requires citation and attribution to the author via a link to their Hugging Face profile in all resulting work.