Benchmark Metrics Definition

1. Character Identification Similarity (CIDS)

Evaluates character identity consistency using Grounding DINO for character detection and feature extraction via CLIP or face embedding models (ArcFace/AdaFace/FaceNet). This metric measures both 'Cross-Similarity' between generated and reference images, and 'Self-Similarity' within the sequence of generated images.

CIDS Pipeline

1) Detection & Cropping: Grounding DINO detects character bounding boxes from reference and generated images based on text prompts. 2) Feature Extraction: For realistic characters, features are extracted using ArcFace, AdaFace, and FaceNet. For non-realistic characters, CLIP is used. Both produce a 512-dimensional vector. 3) Matching & Scoring: Cosine similarity is computed between feature vectors to match characters and calculate similarity scores.

Cross-Similarity Score

Measures the similarity between characters in the generated images and their corresponding reference images.

Self-Similarity Score

Evaluates the identity consistency of a character across different generated shots within the same story. A higher score indicates that the character's appearance remains stable throughout the sequence.

2. Style Similarity (CSD)

Adopts a CSD (CLIP Style Disentanglement) based metric to evaluate both intra-image (self) and inter-image (reference) consistency. Each image is encoded by a CLIP vision encoder trained on large-scale style datasets, and pairwise cosine similarity is computed between style embeddings.

CSD (CLIP Style Disentanglement)

This metric quantifies style consistency using CSD-CLIP features. The process involves: 1) Extracting features with a CLIP vision encoder. 2) Using CSD layers to separate content and style features. 3) Computing the cosine similarity between the resulting style feature embeddings to score both self-similarity (within generated images) and cross-similarity (vs. reference images).

3. Prompt Adherence (Alignment Score)

Evaluates how well generated images align with the storyboard descriptions. Using GPT-4.1, we assess four key aspects on a 0-4 Likert scale, which is then converted to a 100-point scale: Character Interaction, Shooting Method, Static Shot Description, and Individual Actions.

Onstage Character Count Matching (OCCM)

Addresses superfluous or missing characters by calculating the accuracy of the character count. The score is calculated as $OCCM = 100 \times \exp(-\frac{|D-E|}{\epsilon + E})$, where $D$ is the detected count, $E$ is the expected count, and $\epsilon = e^{-6}$ is a smoothing factor.

Character Interaction Actions

Evaluates the alignment between the group-level interactions of characters in the generated image and the intended interactions described in the storyboard’s static shot description.

Shooting Method

Assesses consistency between the depicted camera perspective (e.g., close-up, wide shot) in the generated image and the specified shot design in the storyboard.

Static Shot Description

Measures the overall correspondence between the generated scene and the narrative details provided in the storyboard’s static shot description, including setting, mood, and layout.

Individual Actions

Evaluates the accuracy of gestures, expressions, and poses for each character in the generated image relative to their described behavior in the static shot description.

4. Image Quality & Aesthetic

Evaluates the visual and aesthetic quality of the generated images. This includes an Aesthetic Score from Aesthetic Predictor V2.5, diversity and clarity via Inception Score (IS), and a specific check for Copy-Paste behavior.

Aesthetic Score

Uses Aesthetic Predictor V2.5 (a SigLIP-based model) to assess image aesthetics on a 1-10 scale. It penalizes blurry, noisy, or visually unappealing images. Scores above 5.5 are generally considered high quality.

Copy-Paste Detection

Quantifies the model's tendency to directly reuse character reference images instead of generating novel depictions. For single-image input models, it measures the similarity gap between a generated character and both its true reference and an unrelated reference. A higher value indicates stronger copy-paste behavior. This metric is not applicable to models that take multiple reference images as input.

5. Diversity

Measures the model's ability to generate varied outputs. Diversity is a key aspect of generation quality, evaluated alongside clarity using the Inception Score (IS).

Inception Score (IS)

The diversity component of the Inception Score evaluates the variety of content across a batch of generated images. Higher scores suggest the model is producing a wider range of distinct images.

6. Human Evaluation

A user study was conducted to validate the automated metrics. Participants assessed generated results on three key dimensions. The analysis revealed a strong correlation between the automated metrics and human preferences.

Environment Consistency

Evaluates if scenes within the same story appear visually cohesive. This metric corresponds to the automated Style Similarity (CSD) score. Top performers in user ratings: UNO (82.0), GPT-4o (81.2), Doubao (80.4).

Character Identification Consistency

Assesses how consistently main characters are identifiable and coherent throughout the story. This metric corresponds to the automated CIDS score. Top performers in user ratings: Doubao (92.6), AIbrm (88.4), UNO (84.0).

Subjective Aesthetics

Gauges the overall artistic appeal, detail, and storytelling effectiveness of the generated images. This metric corresponds to the automated Aesthetic Quality score. Top performers in user ratings: GPT-4o (85.6), Doubao (85.0), AIbrm (83.0).

Correlation Analysis

The study demonstrated a strong correlation between human evaluation and automated metrics. Key correlations found were: Self CSD (Kendall's $\tau$=0.42, Spearman's $\rho$=0.56, Pearson's $\sigma$=0.60), Self CIDS ($\tau$=0.50, $\rho$=0.68, $\sigma$=0.80), and Aesthetics ($\tau$=0.26, $\rho$=0.40, $\sigma$=0.54).