How to run Wan 2.2 open-source video generation model on a cloud GPU

Video has become a central medium for communication, entertainment, and sharing ideas, and AI is making video generation faster and more accessible. Open-source models are advancing rapidly, narrowing the gap with, and in some cases surpassing proprietary systems. Tencent’s Hunyuan and Alibaba’s Wan (Tongyi Wanxiang) series are prime examples, with the latest release, Wan 2.2, standing out as the best open-source AI video generator currently available. Built on a Mixture-of-Experts diffusion architecture, it combines cinematic control with high-fidelity aesthetics and has outperformed popular models such as OpenAI Sora, Pika 2.2, Runway 3, and Luma Ray 2.
Here’s a brief overview of Wan 2.2 specifications:
| Alibaba Wan 2.2 | |
|---|---|
| Architecture | Diffusion model with a two-expert Mixture of Experts (MoE): a high-noise expert plans global layout in early steps, and a low-noise expert refines details later |
| Model Variants | Text-to-video: Wan2.2-T2V-A14B (MoE)Text-image-to-video: Wan2.2-TI2V-5B (High-compression VAE, Hybrid T2V+I2V)Image-to-video: Wan2.2-I2V-A14B (MoE)Speech-to-video: Wan2.2-S2V-14B |
| Resolution and Frame Rate | 720p, 24 fps (16x9 ratio) |
| License | Apache 2.0 |
Wan 2.2. is currently the highest-performing open-weights video generation model on the Artificial Analysis leaderboard.

Source: Artificial Analysis
Benchmarks shared by the Wan video team also indicate high scores in comparison with other SOTA models such as OpenAI Sora, Seedance 1.0 and Kling 2.0

Source: Wan Inc
How to run Wan 2.2 on H100 GPUs
Pre-requisites to self-host Wan 2.2
Create a GPU virtual machine (VM) on Ori Global Cloud. We recommend using NVIDIA H100 GPUs to reduce video generation time. With a single H100 GPU, generating a 720p video takes 20-25 minutes, however using a setup with multiple H100 GPUs can dramatically speed up the video generation process.
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.
Step 1: SSH into your VM, and create a virtual environment
1apt install python3.12-venv
2python3.12 -m venv wan-env
3source .wan-env/bin/activateStep 2: Clone the github repository
1git clone https://github.com/Wan-Video/Wan2.2.git
2cd Wan2.2Step 3: Install dependencies
1pip install -r requirements.txtStep 4: If you run into errors installing flash attention, run these packages and run the requirements command again
1pip install torch torchvision torchaudio
2pip install packaging
3pip install --upgrade psutil
4pip install ninja
5pip install flash-attn --no-build-isolationStep 5: We installed Jupyter to make it easy to run the prompts and download the video files
1uv pip install notebook
2jupyter notebook --allow-root --no-browser --ip=0.0.0.0Step 6: Download the model with Huggingface command line interface (CLI)
1pip install "huggingface_hub[cli]"Text-to-video (T2V):
1huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14BImage-to-video (I2V):
1huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14BSpeech-to-video (S2V):
1huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./Wan2.2-S2V-14BNote: There is also a 5B hybrid model that combines T2V and I2V capabilities. You can find the instructions to download in this model card.
Step 7: Run video generation
Text-to-video (T2V):
1python generate.py --task t2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-T2V-A14B --offload_model True --convert_model_dtype --prompt "A large, majestic dragon with olive green scales and flaming red eyes is set against the backdrop of a serene, snowy valley. The video begins with a tracking shot of the valley and then the camera zooms in on the dragon highlighting the details on its face. To maintain visual clarity of this video, every element within the frame is crisp and discernible."Image-to-video (I2V):
1python generate.py --task i2v-A14B --size 1280*720 --ckpt_dir ./Wan2.2-I2V-A14B --offload_model True --convert_model_dtype --image ./servers.png --prompt "Turn the image into smooth first person view (FPV) footage"Note: The speech-to-video model which turns an image into a video with the help of an input audio file, took a long time to finish during our model testing and could not integrate the audio well.
If you’d like to use ComfyUI on a cloud GPU to generate images with Wan 2.2, check out our Genmo Mochi 1 tutorial that uses ComfyUI. The workflow download instructions are available here.
How good is Wan 2.2
We found Wan 2.2 to be the best open-weights video generation model currently available. From top-notch aesthetics to flexible camera control and high-fidelity content, Wan 2.2 impressed us both in terms of text-to-video and image-to-video generation capabilities. The model showcases excellent prompt understanding and can generate videos with realistic facial expressions and fairly accurate anatomy understanding. However, the model did struggle with text rendering, especially longer and punctuated sentences. The model showcases excellent prompt understanding and can generate videos with realistic facial expressions and accurate anatomy understanding.
Here are a couple of video rolls from our model testing
Text-to-video:
Image-to-Video:
Chart your AI reality on Ori
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications
- Deploy Private Clouds for flexible and secure enterprise AI.
- Leverage GPU Instances as on-demand virtual machines.
- Operate Inference Endpoints effortlessly at any scale.
- Scale GPU Clusters for training and inference.
- Manage AI workloads on Serverless Kubernetes without infrastructure overhead.
