Setting Up Open-Sora: Local Video Generation with Open Source Models

Open-Sora-Plan is an ambitious open-source implementation of video generation technology, inspired by OpenAI's Sora model. While the official repository provides installation instructions, there are some crucial details and specific files needed for successful setup that aren't immediately apparent.

First, follow the installation instructions in the official repository. Once you've completed those steps, you'll need to manually download some essential model files that aren't covered in the basic setup.

I've created two Python scripts to handle the model downloads. The first downloads the Open-Sora model files:

from huggingface_hub import snapshot_download

repo_id = "LanguageBind/Open-Sora-Plan-v1.3.0"
local_dir = "./open_sora_plan_v1_3_0"

snapshot_download(repo_id, local_dir=local_dir)

print(f"Repository downloaded to {local_dir}")

The second script downloads the required text encoder model. Be warned - this is a substantial 50GB download from Google:

from transformers import MT5ForConditionalGeneration, MT5Tokenizer

repo_name = "google/mt5-xxl"
local_dir = "./mt5-xxl-pytorch"

print("Downloading tokenizer...")
tokenizer = MT5Tokenizer.from_pretrained(repo_name, cache_dir=local_dir)

print("Downloading PyTorch model...")
model = MT5ForConditionalGeneration.from_pretrained(repo_name, cache_dir=local_dir)

After downloading these models, the file placement can be tricky. Here's how your sample_t2v_v1_3.sh script should look when running from the root of your cloned Open-Sora repository:

CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=1 --nproc_per_node 1 --master_port 29514 \
    -m opensora.sample.sample \
    --model_path "/home/r/Documents/Open-Sora-Plan/open_sora_plan_v1_3_0/any93x640x640" \
    --version "v1_3" \
    --num_frames 93 \
    --height 352 \
    --width 640 \
    --cache_dir "./cache_dir" \
    --text_encoder_name_1 "/home/r/Documents/Open-Sora-Plan/mt5-xxl-pytorch/models--google--mt5-xxl/snapshots/e07c395916dfbc315d4e5e48b4a54a1e8821b5c0" \
    --text_prompt "examples/sora.txt" \
    --ae "WFVAEModel_D8_4x8x8" \
    --ae_path "/home/r/Documents/Open-Sora-Plan/open_sora_plan_v1_3_0/vae" \
    --save_img_path "./train_1_3_nomotion_fps18" \
    --fps 18 \
    --guidance_scale 7.5 \
    --num_sampling_steps 100 \
    --max_sequence_length 512 \
    --sample_method "EulerAncestralDiscrete" \
    --seed 1234 \
    --num_samples_per_prompt 1 \
    --rescale_betas_zero_snr \
    --prediction_type "v_prediction" \
    --save_memory \
    --device "cuda"

A crucial consideration for running this model is its substantial VRAM requirements. The model consumes approximately 23.5GB of VRAM, making it just barely compatible with high-end consumer GPUs like the NVIDIA 3090 or 4090. The --save_memory flag has been added to help manage memory usage, along with the --device "cuda" flag for proper GPU utilization.

home

> End of output.

‎