onseok

Measuring What the Eye Sees: PSNR, SSIM, and VMAF

If you work in video streaming long enough, you will inevitably face this question: how do you measure whether one encode looks better than another?

The naive answer is “just watch it.” But when you have hundreds of encoding configurations across dozens of test clips — different codecs, bitrates, resolutions, and preset levels — subjective evaluation does not scale. You need a number. The question is which number to trust.

This post walks through the three most important objective quality metrics in the streaming industry: PSNR, SSIM, and VMAF. We will look at how each one works, where it fails, and how they are used in practice to make real encoding decisions.

Why Objective Metrics Exist

The gold standard for video quality evaluation is a subjective test: you gather a panel of viewers, show them clips under controlled conditions, and collect their ratings. The result is a MOS (Mean Opinion Score) — typically on a 1-5 or 0-100 scale.

The problem is that subjective tests are expensive, slow, and not repeatable in a CI pipeline. An objective metric attempts to predict MOS computationally. The value of a metric is measured by how well its scores correlate with actual human ratings — typically evaluated via PCC (Pearson Correlation Coefficient) and SRCC (Spearman Rank-Order Correlation Coefficient) against subjective datasets.

Every metric discussed below is a full-reference (FR) metric, meaning it requires access to both the original uncompressed source and the compressed output. This is in contrast to no-reference (NR) metrics that evaluate quality from the compressed signal alone.

PSNR: The Simplest Baseline

How It Works

PSNR (Peak Signal-to-Noise Ratio) is the oldest and simplest video quality metric. It measures the ratio between the maximum possible signal power and the power of the distortion (noise) introduced by compression.

The computation is straightforward. First, calculate the MSE (Mean Squared Error) between corresponding pixels of the reference and distorted frames:

$$MSE = \frac{1}{M \times N} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I_{ref}(i,j) - I_{dist}(i,j)]^2$$

Then derive PSNR in decibels:

$$PSNR = 10 \cdot \log_{10}\left(\frac{MAX^2}{MSE}\right)$$

where $MAX$ is the maximum pixel value (255 for 8-bit content, 1023 for 10-bit).

For video, PSNR is computed per-frame and then averaged (arithmetic mean) across all frames.

Typical Score Ranges

PSNR (dB)Perceived Quality
> 50Virtually lossless — differences invisible even in A/B comparison
40 - 50Excellent — minor artifacts visible only under close inspection
30 - 40Good to fair — compression artifacts become noticeable
20 - 30Poor — clearly degraded, blocking and blurring visible
< 20Unusable

Why It Is Still Used

Despite being a purely mathematical error measure with no perceptual model, PSNR remains ubiquitous:

Where It Fails

The fundamental limitation of PSNR is illustrated below: two images with the same PSNR score can have drastically different perceptual quality, because PSNR is blind to the type of distortion — it only measures the amount.

Two images can have the same PSNR score but look very different to the human eye. Source: TestDevLab

PSNR measures pixel-level error, not perceptual quality. This leads to well-known failure modes:

  1. Texture masking: A complex, high-motion scene can have low PSNR but look subjectively fine because the eye cannot track individual pixel errors in textured regions. PSNR penalizes these errors just as heavily as errors in smooth gradients where they are actually visible.

  2. Structural distortion insensitivity: PSNR treats all pixel errors equally. A spatially correlated blur (which humans find very objectionable) can produce the same MSE as uniformly distributed noise (which is far less noticeable).

  3. Cross-codec comparison: PSNR is unreliable when comparing fundamentally different codecs. The Moscow State University (MSU) codec comparisons historically showed x264 far ahead of MainConcept by PSNR, but subjective evaluations consistently placed them much closer. Different codecs introduce different types of artifacts, and PSNR cannot distinguish between a blocking artifact and a ringing artifact that may be equally objectionable.

  4. Resolution and content dependence: A PSNR of 35 dB on a 4K nature documentary means something very different from 35 dB on a 480p animation. The absolute number is not directly comparable across content or resolutions.

SSIM: Adding Structure

SSIM (Structural Similarity Index), proposed by Wang et al. in 2004, was the first widely adopted metric to incorporate aspects of human visual perception. Instead of measuring raw pixel error, it compares three properties between local patches of the reference and distorted images:

  1. Luminance — comparison of mean pixel intensities
  2. Contrast — comparison of standard deviations
  3. Structure — comparison of normalized pixel patterns (correlation coefficient)
$$SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$$

SSIM outputs a score between 0 and 1 (where 1 means identical). Its multi-scale variant, MS-SSIM, applies the computation at multiple resolutions and pools the results, which better captures distortions at different spatial frequencies.

SSIM is sensitive to color and structural changes that PSNR might miss entirely. Source: TestDevLab

SSIM correlates better with subjective ratings than PSNR, particularly for blur and structural distortions. However, it still struggles with temporal artifacts (since it is fundamentally a per-frame metric) and with content-adaptive quality assessment.

VMAF: Netflix’s Perceptual Fusion Metric

The Problem VMAF Solves

In 2016, Netflix published a metric that would change how the industry thinks about video quality. The motivation was practical: Netflix needed to optimize encoding across thousands of titles with wildly different visual characteristics — from animated shows to dark thrillers to high-motion sports. No single elementary metric predicted subjective quality well enough across all content types.

VMAF (Video Multi-Method Assessment Fusion) takes a fundamentally different approach: instead of designing a single better metric, it fuses multiple elementary metrics using machine learning and trains the fusion model against large-scale subjective data.

Architecture

VMAF follows a three-stage pipeline: feature extraction, ML fusion, and score output. The diagram below shows how the CPU-based feature extractors process each reference-distorted frame pair sequentially:

VMAF feature extractor processing on CPU — each feature (VIF, ADM, Motion) is computed sequentially per frame pair. Source: NVIDIA Developer Blog

Stage 1: Feature Extraction

Three elementary feature extractors run on each frame pair:

FeatureDescriptionScales
VIF (Visual Information Fidelity)Measures information loss using a natural scene statistics model. Computed across 4 scales of a Gaussian scale space.4
ADM (Adaptive Detail Metric)Evaluates detail visibility — how well fine structures survive compression. Also known as DLM in earlier literature.4
MotionTemporal activity computed as the mean absolute difference between adjacent frames. Captures how much movement is present.1

This produces a feature vector of 9 values per frame (4 VIF scales + 4 ADM scales + 1 motion).

Stage 2: Machine Learning Fusion

The feature vector is fed into a Support Vector Machine (SVR) trained on a large corpus of subjective ratings. Netflix trained the default model (vmaf_v0.6.1) on HDTV viewing conditions (3 screen heights distance, 1080p display) using content spanning multiple genres and distortion levels.

The SVM learns non-linear relationships between the elementary features and subjective quality. For example, it learns that high motion reduces the perceptual impact of detail loss (because the eye cannot track fine details in fast-moving scenes), while low motion amplifies it.

Stage 3: Score Output

VMAF outputs a score per frame on a 0-100 scale, designed to correlate directly with MOS:

VMAF quality scale — scores map directly to perceptual quality levels. Source: TestDevLab

VMAF ScoreSubjective Quality
93 - 100Excellent — perceptually transparent
80 - 93Good — minor artifacts, not objectionable
60 - 80Fair — noticeable degradation
40 - 60Poor — clearly impaired
< 40Bad — heavily distorted

Available Models

Netflix ships several pre-trained models for different viewing scenarios:

ModelTarget Use Case
vmaf_v0.6.1HDTV at 3H viewing distance (1080p). The default.
vmaf_4k_v0.6.14K UHD at 1.5H viewing distance. Stricter on detail loss.
vmaf_b_v0.6.3Bootstrap model that outputs confidence intervals (CI).
vmaf_v0.6.1negNEG (No Enhancement Gain) mode. Penalizes sharpening or enhancement that artificially inflates scores. Critical for fair codec evaluation.

The NEG model deserves special attention. Standard VMAF can be “gamed” by applying sharpening filters that boost VIF scores without genuinely improving quality. NEG mode disables this enhancement gain, making it essential for honest codec comparisons.

VMAF-CUDA: GPU Acceleration

As of libvmaf 3.0 (December 2023) and FFmpeg 6.1, VMAF computation can be offloaded to NVIDIA GPUs via libvmaf_cuda. The GPU implementation parallelizes the feature extraction stage across CUDA cores:

GPU-accelerated feature extraction — CUDA enables parallel processing of VIF, ADM, and Motion features. Source: NVIDIA Developer Blog

This achieves up to 4.4x throughput improvement and 37x lower latency at 4K resolution compared to CPU-only computation — which matters enormously when you are running quality analysis on thousands of encodes in a CI pipeline.

Relative feature extractor speedup on GPU vs CPU. Source: NVIDIA Developer Blog

In practice, the throughput difference is dramatic — at 4K resolution, GPU-accelerated VMAF in FFmpeg achieves significantly higher frame rates:

FFmpeg VMAF score calculation throughput — GPU vs CPU at different resolutions. Source: NVIDIA Developer Blog

Practical Usage with FFmpeg

Basic VMAF Measurement

ffmpeg -i distorted.mp4 -i reference.mp4 \
  -lavfi libvmaf="model=version=vmaf_v0.6.1:log_path=vmaf.json:log_fmt=json" \
  -f null -

VMAF with PSNR and SSIM (all at once)

ffmpeg -i distorted.mp4 -i reference.mp4 \
  -lavfi "libvmaf=model=version=vmaf_v0.6.1:\
  feature=name=psnr:\
  feature=name=float_ssim:\
  log_path=metrics.json:log_fmt=json" \
  -f null -

4K Model with NEG Mode

ffmpeg -i distorted_4k.mp4 -i reference_4k.mp4 \
  -lavfi "libvmaf=model=version=vmaf_4k_v0.6.1neg:\
  log_path=vmaf_4k.json:log_fmt=json:n_threads=8" \
  -f null -

GPU-Accelerated VMAF (NVIDIA)

ffmpeg -hwaccel cuda -i distorted.mp4 -hwaccel cuda -i reference.mp4 \
  -lavfi "[0:v]hwupload_cuda[dist];[1:v]hwupload_cuda[ref];\
  [dist][ref]libvmaf_cuda=log_path=vmaf_gpu.json:log_fmt=json" \
  -f null -

Reading the Output

The JSON output contains per-frame scores and pooled statistics:

{
  "pooled_metrics": {
    "vmaf": {
      "min": 72.41,
      "max": 99.12,
      "mean": 91.34,
      "harmonic_mean": 90.87
    }
  },
  "frames": [
    {"frameNum": 0, "metrics": {"vmaf": 94.21, "psnr_y": 42.3, "float_ssim": 0.987}},
    {"frameNum": 1, "metrics": {"vmaf": 93.87, "psnr_y": 41.8, "float_ssim": 0.985}}
  ]
}

The harmonic mean is generally preferred over arithmetic mean for VMAF pooling, as it is more sensitive to low-scoring frames (which disproportionately affect perceived quality). A single badly encoded scene can ruin the viewing experience even if the rest of the video scores high.

How These Metrics Shape Real Encoding Decisions

Per-Title Encoding

The most impactful application of VMAF in the industry is per-title encoding (or more precisely, per-shot encoding). The idea, pioneered by Netflix in 2015, is simple: instead of using a fixed encoding ladder for all content, analyze each title’s visual complexity and generate a custom bitrate ladder that achieves a target VMAF score.

An animated show like BoJack Horseman might achieve VMAF 93 at 750 kbps, while a dark, grainy thriller like Mindhunter might need 4,500 kbps for the same VMAF score. A fixed ladder wastes bandwidth on easy content and under-delivers on hard content.

The chart below shows this clearly: different content encoded with the same CRF values produces wildly different quality-bitrate relationships.

Different content at the same CRF values — visual complexity determines how much bitrate is needed for a given quality level. Source: Fraunhofer Video-Dev

The typical workflow is to encode at multiple resolution-bitrate pairs, compute quality for each, and build rate-quality curves per resolution:

Rate-quality curves at different resolutions — each resolution has a saturation point beyond which adding more bitrate yields diminishing returns. Source: Fraunhofer Video-Dev

The convex hull is the envelope connecting the optimal (bitrate, quality) points across all resolutions. Points on or near this hull represent the most efficient encoding configurations:

The convex hull of optimal encoding points — the final per-title bitrate ladder is selected from points closest to this curve. Source: Fraunhofer Video-Dev

ABR Quality Monitoring

In an adaptive bitrate player, VMAF scores can be used to validate that the encoding ladder provides consistent perceptual quality across renditions. If the 720p@2Mbps rendition scores VMAF 85 but the 1080p@3Mbps rendition only scores VMAF 82 (because the higher resolution exposes more compression artifacts at insufficient bitrate), the ladder needs adjustment.

Codec Evaluation

When evaluating whether to adopt a new codec (say, AV1 over H.264), VMAF provides the most reliable quality axis for BD-rate (Bjontegaard Delta rate) calculations. BD-rate tells you the bitrate savings at equivalent quality:

$$BD\text{-}rate = \frac{\text{Bitrate}_{new} - \text{Bitrate}_{ref}}{\text{Bitrate}_{ref}} \times 100\%$$

computed over the quality range where rate-quality curves overlap. A BD-rate of -30% with VMAF means the new codec achieves the same VMAF score at 30% lower bitrate.

Quality Gates in CI/CD

For a video SDK, VMAF enables automated quality regression testing:

# Pseudocode for a quality gate
- encode test clips with current SDK build
- compute VMAF against reference encodes
- assert VMAF_mean >= 90 and VMAF_min >= 75
- assert VMAF_delta vs baseline < 1.0 point

If an SDK update accidentally degrades encoding quality, the pipeline catches it before release.

Metric Comparison

PSNRSSIMVMAF
What it measuresPixel error (MSE)Structural similarityPerceptual quality (ML fusion)
Perceptual modelNoneLuminance, contrast, structureVIF + ADM + motion + SVM
Score range~20-50 dB (higher = better)0-1 (higher = better)0-100 (higher = better)
MOS correlationModerate (~0.7 PCC)Good (~0.8 PCC)Excellent (~0.93 PCC)
Compute costNegligibleLowHigh (10-50x PSNR)
Cross-codec reliabilityPoorFairGood
Temporal awarenessNoneNoneMotion feature
Content adaptivenessNoneNoneLearned from diverse content
GPU accelerationN/AN/AVMAF-CUDA (libvmaf 3.0+)

Limitations and Open Questions

No metric is perfect. VMAF has its own blind spots:

The broader lesson is that no single number captures the full complexity of human visual perception. In practice, the best approach is to use VMAF as the primary metric, cross-check with PSNR for sanity, and always do spot-check subjective evaluation on the hardest scenes.

References

#video #streaming #vmaf #quality #ffmpeg