How AndroidX Media3 Renders Subtitles: A Deep Dive into the Pipeline

24 Feb, 2026

Have you ever wondered what happens between the moment a subtitle file is downloaded and the text appears at the bottom of your video? In AndroidX Media3 (the successor to ExoPlayer), the answer involves a surprisingly deep pipeline of parsers, encoders, resolvers, and renderers.

Subtitles displayed on a video — scene from Charade (1963) — Subtitles overlaid on video content. *Charade* (1963), public domain.

In this post, we’ll first explore the fundamentals of subtitle technology — what formats exist, how they differ, and why the problem is harder than it looks — then trace the full journey of a subtitle through Media3’s pipeline, from raw .vtt bytes to rendered pixels. All Media3 code references are based on AndroidX Media3 1.9.2 (the latest stable release as of February 2026).

Background: The World of Timed Text

Before diving into Media3’s implementation, it’s worth understanding the landscape of subtitle technology. The problem of synchronizing text with video has a surprisingly rich history, spanning decades of broadcast standards, web specifications, and community-driven formats.

Subtitles vs. Closed Captions vs. SDH

These terms are often used interchangeably, but they have distinct technical meanings:

Closed captioning example showing dialogue and non-speech audio descriptions — Closed captions include non-speech audio descriptions like {{screaming}} in addition to dialogue.
Image by Henrique, CC BY-SA 3.0

Term	Target Audience	Content	Technical Implementation
Subtitles	Hearing viewers who don’t speak the language	Dialogue and narration only	Text-based file or embedded text track
Closed Captions (CC)	Deaf and hard-of-hearing viewers	Dialogue + sound effects + speaker identification + music descriptions	In North America, specifically refers to CEA-608/708 in-band caption data
SDH (Subtitles for the Deaf and Hard of Hearing)	Deaf and hard-of-hearing viewers	Same content as CC	Implemented as subtitle tracks (not CEA-608/708), used on Blu-ray and streaming because HDMI does not carry CEA-608/708 data

The distinction between “subtitles” and “captions” is primarily a North American convention. In Europe and Asia, “subtitles” is the umbrella term that covers both translation subtitles and accessibility captions. Technically, the key difference is the transport mechanism: closed captions are an in-band protocol embedded in the video signal, while subtitles are text or bitmap data carried as a separate track.

Text-Based vs. Bitmap-Based Subtitles

Subtitle formats fall into two fundamental categories:

Text-based formats store subtitle content as character strings with timing and styling metadata. The player must rasterize the text into pixels at playback time. This makes them small, searchable, and user-customizable (font size, color, etc.).

Bitmap-based formats store pre-rendered images. The player simply displays the image at the correct time — no text rendering needed. This guarantees exact visual appearance but at the cost of large file sizes, no searchability, and no user customization.

Category	Formats	Typical Use
Text-based	SRT, WebVTT, TTML, SSA/ASS, CEA-608/708	Web streaming, broadcast, DVDs, fansubs
Bitmap-based	PGS, DVB-SUB, VOBSub	Blu-ray, European digital TV, DVDs

Major Subtitle Formats

SRT (SubRip Text)

SRT is the most ubiquitous subtitle format in the world, yet it has no formal specification. It originated from SubRip, a Windows program that used OCR to extract bitmap subtitles from DVDs and convert them to text. The format is strikingly simple:

1
00:00:01,000 --> 00:00:04,000
This is the first subtitle.

2
00:00:05,000 --> 00:00:08,000
This is the second subtitle.
It can span multiple lines.

Each block has a sequential number, timestamps (note the comma as millisecond separator), subtitle text, and a blank line separator. Basic HTML tags (, , ) are unofficially supported.

Its popularity stems from extreme simplicity — it is human-readable plain text that works in virtually every video player. The Library of Congress describes it as “the most universal format, supported by almost all software, platforms and social networks.”

WebVTT (Web Video Text Tracks)

WebVTT is the W3C standard for web-native subtitles (W3C Candidate Recommendation, 2019). It evolved directly from SRT — originally called “WebSRT” — but adds significant capabilities:

WEBVTT

00:00:11.000 --> 00:00:13.000
<v Roger>We are in New York City

00:00:13.000 --> 00:00:16.000 position:10% align:start size:50%
<v Roger>We're actually at the Luckey Lounge

Key differences from SRT:

Uses a period (not comma) as millisecond separator
Supports CSS styling via the ::cue pseudo-element
Supports cue positioning (position, size, align, line, vertical)
Supports voice tags (<v Speaker>), language tags (<lang>), and ruby annotations
Supports region definitions for predefined display areas
Integrates with HTML5 via the <track> element

WebVTT is the native subtitle format for HLS (HTTP Live Streaming) and is the primary text format handled by Media3’s WebvttParser.

TTML (Timed Text Markup Language)

TTML is the W3C’s XML-based subtitle standard (W3C Recommendation, 2018). It is significantly more complex than WebVTT, designed for broadcast and enterprise use cases:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml">
  <head>
    <styling>
      <style xml:id="s1" tts:color="white" tts:fontFamily="proportionalSansSerif"/>
    </styling>
    <layout>
      <region xml:id="r1" tts:origin="10% 80%" tts:extent="80% 15%"/>
    </layout>
  </head>
  <body>
    <div>
      <p begin="00:00:01.000" end="00:00:03.000" style="s1" region="r1">
        Hello World
      </p>
    </div>
  </body>
</tt>

TTML spawned a family of industry profiles — SMPTE-TT (US broadcast), EBU-TT-D (European broadcast, EBU Tech 3380), and IMSC (W3C, the convergence point that harmonizes them all). TTML is the primary format for DASH streams and earned a Technology & Engineering Emmy Award in 2016.

Media3 handles TTML through its TtmlParser.

SSA/ASS (SubStation Alpha / Advanced SubStation Alpha)

SSA was created in 1996 for the anime fansubbing community. Its successor ASS (v4.00+, 2002) is the most feature-rich text-based subtitle format in existence, supporting:

Pixel-precise positioning and rotation on X/Y/Z axes
Font control, colors with alpha transparency, blur and shadow effects
Karaoke timing with per-syllable highlight effects
Vector drawing commands for custom shapes
Animated transitions between style states

ASS became the de facto standard for anime fansubs because it enables typesetting that matches Japanese on-screen text, translating signs with precise positioning, and karaoke effects for opening/ending songs. The open-source libass library (used by FFmpeg, mpv, VLC) is the reference renderer.

Media3 parses SSA/ASS through SsaParser, handling PlayResX/PlayResY resolution-relative positioning.

CEA-608 and CEA-708

These are fundamentally different from all the formats above — they are real-time protocols, not file formats.

CEA-608 (also known as “Line 21”) dates back to 1980 when ABC, NBC, and PBS first aired captioned programming. The data is transmitted in the Vertical Blanking Interval of the analog NTSC signal at a fixed 480 bits/second. It supports three caption modes:

Roll-up: Text scrolls upward continuously (most common for live TV)
Pop-on: Captions appear as complete blocks
Paint-on: Characters appear one at a time, left to right

CEA-708 is the digital successor for ATSC digital television, embedded in MPEG-2 streams. It offers significantly richer capabilities: Unicode support, 8 font options, 64 text/background colors, adjustable transparency, and user-customizable presentation.

The critical distinction: CEA-608/708 are stateful, stream-oriented protocols where caption data is embedded in every frame of video. A decoder must maintain an internal state machine processing commands in real-time. This is why Media3 treats them fundamentally differently from file-based subtitles — they cannot use the modern SubtitleParser interface and instead require dedicated Cea608Decoder and Cea708Decoder implementations.

The Concept of a “Cue”

Across all subtitle systems, the fundamental unit of timed text is the cue — a block of content that should be displayed during a specific time interval. The W3C WebVTT specification formalizes this as a time-synchronized text segment, and the HTML5 API exposes it through the TextTrackCue interface.

The cue concept maps across every format: an SRT block, a TTML  element, an ASS Dialogue line, and a VTTCue in the browser DOM — they are all cues. In Media3, this universal concept is captured by the CuesWithTiming class, which we’ll explore next.

The Media3 Pipeline: From Bytes to Pixels

Now that we understand the subtitle landscape, let’s see how Media3 handles all of this. Its subtitle system is a multi-layered pipeline with 7 stages:

① Download

DataSource
(HTTP)

→

② Parse

SubtitleParser
(WebVTT / SRT)

→

③ Encode

CueEncoder
(→ bytes)

→

④ Decode

CueDecoder
(→ Cues)

↓

⑤ Resolve

CuesResolver
(Merge / Replace)

←

⑥ Render

TextRenderer
(~60fps loop)

→

⑦ Display

SubtitleView
(Canvas / WebView)

Let’s walk through each one.

Stage 1: Where Do Subtitles Come From?

Subtitles can arrive in two fundamentally different ways:

In-stream Text Tracks

These live inside the media container itself. An HLS stream might contain a WebVTT text track alongside video and audio. A DASH manifest might reference a TTML text adaptation set. In these cases, the subtitle data flows through the same Extractor pipeline as video and audio — they are simply another track type (C.TRACK_TYPE_TEXT) extracted from the container.

Sideloaded Subtitle Tracks

These are separate files — typically .vtt or .srt — loaded via a separate HTTP request. Media3 calls this sideloading, and the recommended approach is MediaItem.SubtitleConfiguration:

val mediaItem = MediaItem.Builder()
    .setUri(videoUri)
    .setSubtitleConfigurations(listOf(
        MediaItem.SubtitleConfiguration.Builder(subtitleUri)
            .setMimeType(MimeTypes.TEXT_VTT)
            .setLanguage("en")
            .setSelectionFlags(C.SELECTION_FLAG_DEFAULT)
            .build()
    ))
    .build()

player.setMediaItem(mediaItem)

Under the hood, Media3’s DefaultMediaSourceFactory converts each SubtitleConfiguration into a ProgressiveMediaSource that downloads the subtitle file, runs it through a SubtitleExtractor (which wraps a SubtitleParser), and merges it with the video source via MergingMediaSource.

Note: The older SingleSampleMediaSource approach is now deprecated, as it only works with the legacy subtitle decoding path.

The key difference between in-stream and sideloaded matters more than you’d think. Sideloaded subtitles need to be fully downloaded and parsed before any cue can be displayed, while in-stream subtitles arrive incrementally with the media segments.

Stage 2: Parsing — From Text to Data

Once the raw bytes arrive, a SubtitleParser converts them into structured CuesWithTiming objects.

The SubtitleParser Interface

public interface SubtitleParser {

    @CueReplacementBehavior
    int getCueReplacementBehavior();

    void parse(
        byte[] data,
        int offset,
        int length,
        OutputOptions outputOptions,
        Consumer<CuesWithTiming> output   // ← callback for each cue
    );

    void reset();
}

Notice the Consumer<CuesWithTiming> output callback — the parser doesn’t return a list. It streams results to the caller. This is a deliberate design choice: for large subtitle files with thousands of cues, this avoids allocating a massive intermediate list.

What CuesWithTiming Looks Like

public class CuesWithTiming {
    public final ImmutableList<Cue> cues;    // The subtitle text/styling
    public final long startTimeUs;            // When to show (microseconds)
    public final long durationUs;             // How long to show
    public final long endTimeUs;              // Computed: start + duration
}

Format-Specific Parsers

Media3 supports an impressive range of formats — reflecting the diversity we covered in the background section:

Format	Parser	Use Case
WebVTT (.vtt)	`WebvttParser`	Web streaming (HLS), W3C standard
SubRip (.srt)	`SubripParser`	Most common standalone format, no formal spec
SSA/ASS	`SsaParser`	Anime fansubs, advanced styling
TTML	`TtmlParser`	DASH, broadcast (W3C Recommendation)
CEA-608	`Cea608Decoder`	US analog broadcast closed captions
CEA-708	`Cea708Decoder`	US digital broadcast closed captions
PGS	`PgsParser`	Blu-ray bitmap subtitles
DVB	`DvbParser`	European digital broadcast (ETSI EN 300 743)

Each parser has its own quirks. WebvttParser delegates CSS styling to WebvttCssParser. SubripParser converts HTML tags (, ) to Android Spannable text. SsaParser handles resolution-relative positioning via PlayResX/PlayResY.

The Cue: Media3’s Universal Subtitle Representation

Regardless of the source format — whether it’s a simple SRT block or a complex ASS dialogue line with rotation and alpha transparency — everything converges on the Cue class:

public final class Cue {
    @Nullable public final CharSequence text;     // Styled text (with Spans)
    @Nullable public final Bitmap bitmap;         // For image-based subs (PGS)
    @Nullable public final Alignment textAlignment; // Left, center, right
    public final float line;                      // Vertical position
    public final float position;                  // Horizontal position
    public final float size;                      // Cue box width
    public final float textSize;                  // Font size
    public final int windowColor;                 // Background color
    public final @VerticalType int verticalType;  // Vertical text (Japanese)
    // ... and more
}

A Cue is immutable and self-contained. It carries everything needed to render one block of subtitle text (or one subtitle bitmap) at a specific position on screen. This universality is what allows Media3 to support everything from simple .srt files to complex DVB bitmap subtitles through the same rendering pipeline.

This is the same “cue” concept from the W3C TextTrackCue interface, adapted for Android’s rendering model.

Stage 3 & 4: The Encode-Decode Round Trip

Here’s where things get interesting. After parsing, the CuesWithTiming objects are serialized to bytes by CueEncoder, passed through the media pipeline as samples, then deserialized back by CueDecoder.

SubtitleParser

CuesWithTiming
startTimeUs = 5000000
cues = ["Hello"]

→

serialize

CueEncoder

byte[]
Parcel-serialized
Bundle data

→

deserialize

CueDecoder

CuesWithTiming
(reconstructed)

Why Serialize and Deserialize?

This might seem wasteful — why not just pass the CuesWithTiming objects directly? The answer lies in Media3’s architecture:

The media pipeline only speaks bytes. SampleQueue, SampleStream, DecoderInputBuffer — all of these transport raw byte data with timestamps. Subtitle cues need to travel through the same infrastructure as video and audio samples. Serialization is the price of architectural consistency.

The MIME type application/x-media3-cues signals to TextRenderer that this data has been pre-parsed and just needs decoding, rather than format-specific subtitle parsing.

CueEncoder Internals

public final class CueEncoder {
    public byte[] encode(List<Cue> cues, long durationUs) {
        // 1. Convert each Cue to a Bundle (Android's serialization format)
        // 2. Add durationUs to the Bundle
        // 3. Marshall to Parcel → byte[]
        return bytes;
    }
}

The use of Android’s Bundle/Parcel mechanism is pragmatic — it handles all the complex Cue fields (including Bitmap for PGS subtitles) without custom serialization code.

Stage 5: TextRenderer — The Orchestra Conductor

TextRenderer is the heart of subtitle display. It sits in ExoPlayer’s rendering loop, called on every frame, and decides what to show.

// Simplified render loop (called ~60 times per second)
public void render(long positionUs, long elapsedRealtimeUs) {
    // 1. Read samples from SampleStream
    while (canReadMore) {
        int result = stream.readData(formatHolder, buffer, ...);
        if (result == C.RESULT_BUFFER_READ) {
            ByteBuffer cueData = buffer.data;
            CuesWithTiming cues = cueDecoder.decode(
                buffer.timeUs, cueData.array(),
                cueData.arrayOffset(), cueData.limit());
            cuesResolver.addCues(cues, positionUs);
        }
    }

    // 2. Get current cues for this frame's timestamp
    ImmutableList<Cue> currentCues = cuesResolver.getCuesAtTimeUs(positionUs);

    // 3. Send to UI (on main thread)
    output.onCues(new CueGroup(currentCues, presentationTimeUs));
}

The Two Pipelines: Legacy vs. Modern

Media3 actually maintains two subtitle decoding paths:

TextRenderer receives sample

Modern Path (default)

MIME: x-media3-cues
CueDecoder → CuesWithTiming
→ CuesResolver

Legacy Path (deprecated)

MIME: text/vtt, etc.
SubtitleDecoder → Subtitle
→ getCues(time)

The modern path (default since Media3 1.4.0) does parsing during extraction. By the time TextRenderer sees the data, it’s already CueEncoder-encoded bytes. TextRenderer just decodes and resolves timing.

The legacy path receives raw subtitle data (e.g., raw WebVTT text) and does format-specific parsing inside TextRenderer. This requires experimentalSetLegacyDecodingEnabled(true) and is gradually being phased out.

The motivation for this shift? Parsing during extraction means subtitle work happens on the loading thread, not the playback thread. This prevents subtitle parsing from causing video frame drops.

CuesResolver: Merge or Replace?

One of the most elegant abstractions in the subtitle system is CuesResolver. It answers: “Given the current playback time, which cues should be visible?”

// Package-private interface — not part of the public API
interface CuesResolver {
    boolean addCues(CuesWithTiming cues, long currentPositionUs);
    ImmutableList<Cue> getCuesAtTimeUs(long timeUs);
    void discardCuesBeforeTimeUs(long timeUs);
    long getPreviousCueChangeTimeUs(long timeUs);
    long getNextCueChangeTimeUs(long timeUs);
    void clear();
}

Note that CuesResolver is package-private — it’s an internal implementation detail of TextRenderer, not a public API. But understanding it is key to grasping how subtitle timing works.

There are two implementations, and the choice directly reflects the difference between subtitle formats we covered earlier:

MergingCuesResolver — Used when multiple cues can overlap in time. This is the behavior for most text formats (WebVTT, SRT, SSA). If cue A runs from 0–5s and cue B from 3–8s, both are visible during 3–5s.

Time

0s 3s 5s 8s

Cue A

Cue B

Visible

A + B

ReplacingCuesResolver — Used for formats like CEA-608 where only one set of cues is shown at a time. As we discussed, CEA-608 is a stateful protocol where the caption “window” shows one thing at a time — new cues completely replace old ones.

Time

0s 3s 5s 8s

Cue A

Cue B

Visible

The choice is driven by CueReplacementBehavior, which each SubtitleParser declares via getCueReplacementBehavior().

Stage 6 & 7: From Cues to Pixels

The final stretch: TextRenderer packages the resolved cues into a CueGroup and sends it to TextOutput.onCues(). In most apps, this arrives at SubtitleView.

SubtitleView: Two Rendering Engines

SubtitleView offers two rendering backends:

SubtitleView (FrameLayout)

CanvasSubtitleOutput

Default · VIEW_TYPE_CANVAS

Draws text directly on Android Canvas
Fast, low overhead
Handles most styling (colors, alignment, size)
Used for 99% of use cases

WebViewSubtitleOutput

Optional · VIEW_TYPE_WEB

Renders via HTML/CSS in a WebView
Supports vertical text (Japanese subtitles)
Supports complex CSS styling
Higher overhead

The Canvas renderer measures and draws each Cue directly:

Calculate cue box position from line, position, size
Apply text styling from Spannable (bold, italic, color)
Draw background window (if windowColor is set)
Draw text with proper alignment

The WebView renderer converts each Cue to HTML + CSS and loads it into an invisible WebView — heavier, but necessary for features like vertical text rendering in Japanese subtitles (see: Improved Japanese subtitle support).

Vertical Japanese subtitle rendered using WebView — Vertical cue rendered via **WebViewSubtitleOutput**

Horizontal Japanese subtitle rendered using WebView — Horizontal cue rendered via **WebViewSubtitleOutput**

Images from Improved Japanese subtitle support by Ian Baker, AndroidX Media3 team.

Real-World Scenarios

VOD with Sideloaded Subtitles

User hits play
  → DefaultMediaSourceFactory sees SubtitleConfiguration
  → Creates ProgressiveMediaSource + SubtitleExtractor
  → Downloads entire .vtt file via DataSource
  → SubtitleExtractor → SubtitleParser.parse() → CuesWithTiming
  → CueEncoder serializes each cue to byte samples in SampleQueue
  → MergingMediaSource merges subtitle track with video/audio
  → TextRenderer reads samples, feeds to MergingCuesResolver

Key property: Subtitle parsing happens on the loading thread, not the playback thread.

HLS Live with In-stream Subtitles

Live stream playing
  → HLS playlist has #EXT-X-MEDIA:TYPE=SUBTITLES
  → Each segment contains WebVTT subtitle data
  → Extractor uses SubtitleTranscodingExtractorOutput to parse on the fly
  → SubtitleParser converts raw WebVTT to CueEncoder-encoded samples
  → New cues added to CuesResolver incrementally per segment
  → CuesResolver.discardCuesBeforeTimeUs() keeps memory bounded

Key property: Subtitles arrive incrementally, no full file needed.

CEA-608 Closed Captions

US broadcast content
  → CEA-608 data embedded in video stream (line 21 of NTSC signal)
  → TsExtractor detects CC track
  → TextRenderer creates Cea608Decoder (legacy path only)
  → Decoder maintains internal state machine (CC is stateful!)
  → Cues replace each other (ReplacingCuesResolver behavior)

Key property: CEA-608 is byte-oriented and stateful — fundamentally different from file-based subtitles.

The Engineering Trade-offs

Why Not Just Pass Strings?

A subtitle cue is never “just a string.” As we saw in the format overview, even the simplest SRT block carries timing information, and formats like ASS add positioning, rotation, transparency, and animation. The Cue class handles all of this, making it the universal currency of the subtitle system.

Why Encode → Transport → Decode?

The serialize/deserialize round trip exists because Media3 treats subtitles as first-class media samples. This enables:

Unified buffering — subtitle samples sit in the same SampleQueue as audio/video
Consistent seeking — seek operations work identically across all track types
Track selection — the standard track selector can enable/disable subtitle tracks

Why Two CuesResolver Strategies?

Different subtitle standards have fundamentally different semantics:

WebVTT/SRT: Multiple speakers can have overlapping dialogue — a natural fit for MergingCuesResolver
CEA-608: The caption “window” shows one thing at a time — a roll-up or pop-on model served by ReplacingCuesResolver

Rather than forcing all formats into one model, Media3 lets each parser declare its behavior via CueReplacementBehavior and picks the right resolver automatically.

Wrapping Up

Media3’s subtitle system is a case study in thoughtful media engineering. It bridges a remarkably diverse set of standards — from the formal W3C specifications of WebVTT and TTML, to the community-driven SRT format with no formal spec, to the decades-old CEA-608 broadcast protocol — all through a unified pipeline.

From the streaming Consumer<CuesWithTiming> parser API to the MergingCuesResolver vs ReplacingCuesResolver strategy pattern, every layer reflects real-world constraints shaped by decades of timed text standards.

The next time you see subtitle text at the bottom of a video, you’ll know it traveled through at least 7 pipeline stages, survived a Parcel serialization round trip, and won a timing battle inside a CuesResolver — all in under 16 milliseconds.

References

AndroidX Media3

AndroidX Media3 GitHub Repository — Source code (1.9.2)
Media3 Supported Formats — Official format support table
Sideloading subtitle tracks — Official Android documentation
Introduction to Jetpack Media3 — Getting started guide
Improved Japanese subtitle support — WebView rendering for vertical text