onseok

How AndroidX Media3 Renders Subtitles: A Deep Dive into the Pipeline

Have you ever wondered what happens between the moment a subtitle file is downloaded and the text appears at the bottom of your video? In AndroidX Media3 (the successor to ExoPlayer), the answer involves a surprisingly deep pipeline of parsers, encoders, resolvers, and renderers.

Subtitles displayed on a video — scene from Charade (1963)
Subtitles overlaid on video content. Charade (1963), public domain.

In this post, we’ll first explore the fundamentals of subtitle technology — what formats exist, how they differ, and why the problem is harder than it looks — then trace the full journey of a subtitle through Media3’s pipeline, from raw .vtt bytes to rendered pixels. All Media3 code references are based on AndroidX Media3 1.9.2 (the latest stable release as of February 2026).


Background: The World of Timed Text

Before diving into Media3’s implementation, it’s worth understanding the landscape of subtitle technology. The problem of synchronizing text with video has a surprisingly rich history, spanning decades of broadcast standards, web specifications, and community-driven formats.

Subtitles vs. Closed Captions vs. SDH

These terms are often used interchangeably, but they have distinct technical meanings:

Closed captioning example showing dialogue and non-speech audio descriptions
Closed captions include non-speech audio descriptions like {{screaming}} in addition to dialogue.
Image by Henrique, CC BY-SA 3.0
TermTarget AudienceContentTechnical Implementation
SubtitlesHearing viewers who don’t speak the languageDialogue and narration onlyText-based file or embedded text track
Closed Captions (CC)Deaf and hard-of-hearing viewersDialogue + sound effects + speaker identification + music descriptionsIn North America, specifically refers to CEA-608/708 in-band caption data
SDH (Subtitles for the Deaf and Hard of Hearing)Deaf and hard-of-hearing viewersSame content as CCImplemented as subtitle tracks (not CEA-608/708), used on Blu-ray and streaming because HDMI does not carry CEA-608/708 data

The distinction between “subtitles” and “captions” is primarily a North American convention. In Europe and Asia, “subtitles” is the umbrella term that covers both translation subtitles and accessibility captions. Technically, the key difference is the transport mechanism: closed captions are an in-band protocol embedded in the video signal, while subtitles are text or bitmap data carried as a separate track.

Text-Based vs. Bitmap-Based Subtitles

Subtitle formats fall into two fundamental categories:

Text-based formats store subtitle content as character strings with timing and styling metadata. The player must rasterize the text into pixels at playback time. This makes them small, searchable, and user-customizable (font size, color, etc.).

Bitmap-based formats store pre-rendered images. The player simply displays the image at the correct time — no text rendering needed. This guarantees exact visual appearance but at the cost of large file sizes, no searchability, and no user customization.

CategoryFormatsTypical Use
Text-basedSRT, WebVTT, TTML, SSA/ASS, CEA-608/708Web streaming, broadcast, DVDs, fansubs
Bitmap-basedPGS, DVB-SUB, VOBSubBlu-ray, European digital TV, DVDs

Major Subtitle Formats

SRT (SubRip Text)

SRT is the most ubiquitous subtitle format in the world, yet it has no formal specification. It originated from SubRip, a Windows program that used OCR to extract bitmap subtitles from DVDs and convert them to text. The format is strikingly simple:

1
00:00:01,000 --> 00:00:04,000
This is the first subtitle.

2
00:00:05,000 --> 00:00:08,000
This is the second subtitle.
It can span multiple lines.

Each block has a sequential number, timestamps (note the comma as millisecond separator), subtitle text, and a blank line separator. Basic HTML tags (<b>, <i>, <font color="...">) are unofficially supported.

Its popularity stems from extreme simplicity — it is human-readable plain text that works in virtually every video player. The Library of Congress describes it as “the most universal format, supported by almost all software, platforms and social networks.”

WebVTT (Web Video Text Tracks)

WebVTT is the W3C standard for web-native subtitles (W3C Candidate Recommendation, 2019). It evolved directly from SRT — originally called “WebSRT” — but adds significant capabilities:

WEBVTT

00:00:11.000 --> 00:00:13.000
<v Roger>We are in New York City

00:00:13.000 --> 00:00:16.000 position:10% align:start size:50%
<v Roger>We're actually at the Luckey Lounge

Key differences from SRT:

WebVTT is the native subtitle format for HLS (HTTP Live Streaming) and is the primary text format handled by Media3’s WebvttParser.

TTML (Timed Text Markup Language)

TTML is the W3C’s XML-based subtitle standard (W3C Recommendation, 2018). It is significantly more complex than WebVTT, designed for broadcast and enterprise use cases:

<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml">
  <head>
    <styling>
      <style xml:id="s1" tts:color="white" tts:fontFamily="proportionalSansSerif"/>
    </styling>
    <layout>
      <region xml:id="r1" tts:origin="10% 80%" tts:extent="80% 15%"/>
    </layout>
  </head>
  <body>
    <div>
      <p begin="00:00:01.000" end="00:00:03.000" style="s1" region="r1">
        Hello World
      </p>
    </div>
  </body>
</tt>

TTML spawned a family of industry profiles — SMPTE-TT (US broadcast), EBU-TT-D (European broadcast, EBU Tech 3380), and IMSC (W3C, the convergence point that harmonizes them all). TTML is the primary format for DASH streams and earned a Technology & Engineering Emmy Award in 2016.

Media3 handles TTML through its TtmlParser.

SSA/ASS (SubStation Alpha / Advanced SubStation Alpha)

SSA was created in 1996 for the anime fansubbing community. Its successor ASS (v4.00+, 2002) is the most feature-rich text-based subtitle format in existence, supporting:

ASS became the de facto standard for anime fansubs because it enables typesetting that matches Japanese on-screen text, translating signs with precise positioning, and karaoke effects for opening/ending songs. The open-source libass library (used by FFmpeg, mpv, VLC) is the reference renderer.

Media3 parses SSA/ASS through SsaParser, handling PlayResX/PlayResY resolution-relative positioning.

CEA-608 and CEA-708

These are fundamentally different from all the formats above — they are real-time protocols, not file formats.

CEA-608 (also known as “Line 21”) dates back to 1980 when ABC, NBC, and PBS first aired captioned programming. The data is transmitted in the Vertical Blanking Interval of the analog NTSC signal at a fixed 480 bits/second. It supports three caption modes:

CEA-708 is the digital successor for ATSC digital television, embedded in MPEG-2 streams. It offers significantly richer capabilities: Unicode support, 8 font options, 64 text/background colors, adjustable transparency, and user-customizable presentation.

The critical distinction: CEA-608/708 are stateful, stream-oriented protocols where caption data is embedded in every frame of video. A decoder must maintain an internal state machine processing commands in real-time. This is why Media3 treats them fundamentally differently from file-based subtitles — they cannot use the modern SubtitleParser interface and instead require dedicated Cea608Decoder and Cea708Decoder implementations.

The Concept of a “Cue”

Across all subtitle systems, the fundamental unit of timed text is the cue — a block of content that should be displayed during a specific time interval. The W3C WebVTT specification formalizes this as a time-synchronized text segment, and the HTML5 API exposes it through the TextTrackCue interface.

The cue concept maps across every format: an SRT block, a TTML <p> element, an ASS Dialogue line, and a VTTCue in the browser DOM — they are all cues. In Media3, this universal concept is captured by the CuesWithTiming class, which we’ll explore next.


The Media3 Pipeline: From Bytes to Pixels

Now that we understand the subtitle landscape, let’s see how Media3 handles all of this. Its subtitle system is a multi-layered pipeline with 7 stages:

① Download
DataSource
(HTTP)
② Parse
SubtitleParser
(WebVTT / SRT)
③ Encode
CueEncoder
(→ bytes)
④ Decode
CueDecoder
(→ Cues)
⑤ Resolve
CuesResolver
(Merge / Replace)
⑥ Render
TextRenderer
(~60fps loop)
⑦ Display
SubtitleView
(Canvas / WebView)

Let’s walk through each one.

Stage 1: Where Do Subtitles Come From?

Subtitles can arrive in two fundamentally different ways:

In-stream Text Tracks

These live inside the media container itself. An HLS stream might contain a WebVTT text track alongside video and audio. A DASH manifest might reference a TTML text adaptation set. In these cases, the subtitle data flows through the same Extractor pipeline as video and audio — they are simply another track type (C.TRACK_TYPE_TEXT) extracted from the container.

Sideloaded Subtitle Tracks

These are separate files — typically .vtt or .srt — loaded via a separate HTTP request. Media3 calls this sideloading, and the recommended approach is MediaItem.SubtitleConfiguration:

val mediaItem = MediaItem.Builder()
    .setUri(videoUri)
    .setSubtitleConfigurations(listOf(
        MediaItem.SubtitleConfiguration.Builder(subtitleUri)
            .setMimeType(MimeTypes.TEXT_VTT)
            .setLanguage("en")
            .setSelectionFlags(C.SELECTION_FLAG_DEFAULT)
            .build()
    ))
    .build()

player.setMediaItem(mediaItem)

Under the hood, Media3’s DefaultMediaSourceFactory converts each SubtitleConfiguration into a ProgressiveMediaSource that downloads the subtitle file, runs it through a SubtitleExtractor (which wraps a SubtitleParser), and merges it with the video source via MergingMediaSource.

Note: The older SingleSampleMediaSource approach is now deprecated, as it only works with the legacy subtitle decoding path.

The key difference between in-stream and sideloaded matters more than you’d think. Sideloaded subtitles need to be fully downloaded and parsed before any cue can be displayed, while in-stream subtitles arrive incrementally with the media segments.

Stage 2: Parsing — From Text to Data

Once the raw bytes arrive, a SubtitleParser converts them into structured CuesWithTiming objects.

The SubtitleParser Interface

public interface SubtitleParser {

    @CueReplacementBehavior
    int getCueReplacementBehavior();

    void parse(
        byte[] data,
        int offset,
        int length,
        OutputOptions outputOptions,
        Consumer<CuesWithTiming> output   // ← callback for each cue
    );

    void reset();
}

Notice the Consumer<CuesWithTiming> output callback — the parser doesn’t return a list. It streams results to the caller. This is a deliberate design choice: for large subtitle files with thousands of cues, this avoids allocating a massive intermediate list.

What CuesWithTiming Looks Like

public class CuesWithTiming {
    public final ImmutableList<Cue> cues;    // The subtitle text/styling
    public final long startTimeUs;            // When to show (microseconds)
    public final long durationUs;             // How long to show
    public final long endTimeUs;              // Computed: start + duration
}

Format-Specific Parsers

Media3 supports an impressive range of formats — reflecting the diversity we covered in the background section:

FormatParserUse Case
WebVTT (.vtt)WebvttParserWeb streaming (HLS), W3C standard
SubRip (.srt)SubripParserMost common standalone format, no formal spec
SSA/ASSSsaParserAnime fansubs, advanced styling
TTMLTtmlParserDASH, broadcast (W3C Recommendation)
CEA-608Cea608DecoderUS analog broadcast closed captions
CEA-708Cea708DecoderUS digital broadcast closed captions
PGSPgsParserBlu-ray bitmap subtitles
DVBDvbParserEuropean digital broadcast (ETSI EN 300 743)

Each parser has its own quirks. WebvttParser delegates CSS styling to WebvttCssParser. SubripParser converts HTML tags (<b>, <i>) to Android Spannable text. SsaParser handles resolution-relative positioning via PlayResX/PlayResY.

The Cue: Media3’s Universal Subtitle Representation

Regardless of the source format — whether it’s a simple SRT block or a complex ASS dialogue line with rotation and alpha transparency — everything converges on the Cue class:

public final class Cue {
    @Nullable public final CharSequence text;     // Styled text (with Spans)
    @Nullable public final Bitmap bitmap;         // For image-based subs (PGS)
    @Nullable public final Alignment textAlignment; // Left, center, right
    public final float line;                      // Vertical position
    public final float position;                  // Horizontal position
    public final float size;                      // Cue box width
    public final float textSize;                  // Font size
    public final int windowColor;                 // Background color
    public final @VerticalType int verticalType;  // Vertical text (Japanese)
    // ... and more
}

A Cue is immutable and self-contained. It carries everything needed to render one block of subtitle text (or one subtitle bitmap) at a specific position on screen. This universality is what allows Media3 to support everything from simple .srt files to complex DVB bitmap subtitles through the same rendering pipeline.

This is the same “cue” concept from the W3C TextTrackCue interface, adapted for Android’s rendering model.

Stage 3 & 4: The Encode-Decode Round Trip

Here’s where things get interesting. After parsing, the CuesWithTiming objects are serialized to bytes by CueEncoder, passed through the media pipeline as samples, then deserialized back by CueDecoder.

SubtitleParser
CuesWithTiming
startTimeUs = 5000000
cues = ["Hello"]
serialize
CueEncoder
byte[]
Parcel-serialized
Bundle data
deserialize
CueDecoder
CuesWithTiming
(reconstructed)

Why Serialize and Deserialize?

This might seem wasteful — why not just pass the CuesWithTiming objects directly? The answer lies in Media3’s architecture:

The media pipeline only speaks bytes. SampleQueue, SampleStream, DecoderInputBuffer — all of these transport raw byte data with timestamps. Subtitle cues need to travel through the same infrastructure as video and audio samples. Serialization is the price of architectural consistency.

The MIME type application/x-media3-cues signals to TextRenderer that this data has been pre-parsed and just needs decoding, rather than format-specific subtitle parsing.

CueEncoder Internals

public final class CueEncoder {
    public byte[] encode(List<Cue> cues, long durationUs) {
        // 1. Convert each Cue to a Bundle (Android's serialization format)
        // 2. Add durationUs to the Bundle
        // 3. Marshall to Parcel → byte[]
        return bytes;
    }
}

The use of Android’s Bundle/Parcel mechanism is pragmatic — it handles all the complex Cue fields (including Bitmap for PGS subtitles) without custom serialization code.

Stage 5: TextRenderer — The Orchestra Conductor

TextRenderer is the heart of subtitle display. It sits in ExoPlayer’s rendering loop, called on every frame, and decides what to show.

// Simplified render loop (called ~60 times per second)
public void render(long positionUs, long elapsedRealtimeUs) {
    // 1. Read samples from SampleStream
    while (canReadMore) {
        int result = stream.readData(formatHolder, buffer, ...);
        if (result == C.RESULT_BUFFER_READ) {
            ByteBuffer cueData = buffer.data;
            CuesWithTiming cues = cueDecoder.decode(
                buffer.timeUs, cueData.array(),
                cueData.arrayOffset(), cueData.limit());
            cuesResolver.addCues(cues, positionUs);
        }
    }

    // 2. Get current cues for this frame's timestamp
    ImmutableList<Cue> currentCues = cuesResolver.getCuesAtTimeUs(positionUs);

    // 3. Send to UI (on main thread)
    output.onCues(new CueGroup(currentCues, presentationTimeUs));
}

The Two Pipelines: Legacy vs. Modern

Media3 actually maintains two subtitle decoding paths:

TextRenderer receives sample
Modern Path (default)
MIME: x-media3-cues
CueDecoder → CuesWithTiming
→ CuesResolver
Legacy Path (deprecated)
MIME: text/vtt, etc.
SubtitleDecoder → Subtitle
→ getCues(time)

The modern path (default since Media3 1.4.0) does parsing during extraction. By the time TextRenderer sees the data, it’s already CueEncoder-encoded bytes. TextRenderer just decodes and resolves timing.

The legacy path receives raw subtitle data (e.g., raw WebVTT text) and does format-specific parsing inside TextRenderer. This requires experimentalSetLegacyDecodingEnabled(true) and is gradually being phased out.

The motivation for this shift? Parsing during extraction means subtitle work happens on the loading thread, not the playback thread. This prevents subtitle parsing from causing video frame drops.

CuesResolver: Merge or Replace?

One of the most elegant abstractions in the subtitle system is CuesResolver. It answers: “Given the current playback time, which cues should be visible?”

// Package-private interface — not part of the public API
interface CuesResolver {
    boolean addCues(CuesWithTiming cues, long currentPositionUs);
    ImmutableList<Cue> getCuesAtTimeUs(long timeUs);
    void discardCuesBeforeTimeUs(long timeUs);
    long getPreviousCueChangeTimeUs(long timeUs);
    long getNextCueChangeTimeUs(long timeUs);
    void clear();
}

Note that CuesResolver is package-private — it’s an internal implementation detail of TextRenderer, not a public API. But understanding it is key to grasping how subtitle timing works.

There are two implementations, and the choice directly reflects the difference between subtitle formats we covered earlier:

MergingCuesResolver — Used when multiple cues can overlap in time. This is the behavior for most text formats (WebVTT, SRT, SSA). If cue A runs from 0–5s and cue B from 3–8s, both are visible during 3–5s.

Time
0s 3s 5s 8s
Cue A
A
Cue B
B
Visible
A
A + B
B

ReplacingCuesResolver — Used for formats like CEA-608 where only one set of cues is shown at a time. As we discussed, CEA-608 is a stateful protocol where the caption “window” shows one thing at a time — new cues completely replace old ones.

Time
0s 3s 5s 8s
Cue A
A
Cue B
B
Visible
A
B

The choice is driven by CueReplacementBehavior, which each SubtitleParser declares via getCueReplacementBehavior().

Stage 6 & 7: From Cues to Pixels

The final stretch: TextRenderer packages the resolved cues into a CueGroup and sends it to TextOutput.onCues(). In most apps, this arrives at SubtitleView.

SubtitleView: Two Rendering Engines

SubtitleView offers two rendering backends:

SubtitleView (FrameLayout)
CanvasSubtitleOutput
Default · VIEW_TYPE_CANVAS
  • Draws text directly on Android Canvas
  • Fast, low overhead
  • Handles most styling (colors, alignment, size)
  • Used for 99% of use cases
WebViewSubtitleOutput
Optional · VIEW_TYPE_WEB
  • Renders via HTML/CSS in a WebView
  • Supports vertical text (Japanese subtitles)
  • Supports complex CSS styling
  • Higher overhead

The Canvas renderer measures and draws each Cue directly:

  1. Calculate cue box position from line, position, size
  2. Apply text styling from Spannable (bold, italic, color)
  3. Draw background window (if windowColor is set)
  4. Draw text with proper alignment

The WebView renderer converts each Cue to HTML + CSS and loads it into an invisible WebView — heavier, but necessary for features like vertical text rendering in Japanese subtitles (see: Improved Japanese subtitle support).

Vertical Japanese subtitle rendered using WebView
Vertical cue rendered via WebViewSubtitleOutput
Horizontal Japanese subtitle rendered using WebView
Horizontal cue rendered via WebViewSubtitleOutput
Images from Improved Japanese subtitle support by Ian Baker, AndroidX Media3 team.

Real-World Scenarios

VOD with Sideloaded Subtitles

User hits play
  → DefaultMediaSourceFactory sees SubtitleConfiguration
  → Creates ProgressiveMediaSource + SubtitleExtractor
  → Downloads entire .vtt file via DataSource
  → SubtitleExtractor → SubtitleParser.parse() → CuesWithTiming
  → CueEncoder serializes each cue to byte samples in SampleQueue
  → MergingMediaSource merges subtitle track with video/audio
  → TextRenderer reads samples, feeds to MergingCuesResolver

Key property: Subtitle parsing happens on the loading thread, not the playback thread.

HLS Live with In-stream Subtitles

Live stream playing
  → HLS playlist has #EXT-X-MEDIA:TYPE=SUBTITLES
  → Each segment contains WebVTT subtitle data
  → Extractor uses SubtitleTranscodingExtractorOutput to parse on the fly
  → SubtitleParser converts raw WebVTT to CueEncoder-encoded samples
  → New cues added to CuesResolver incrementally per segment
  → CuesResolver.discardCuesBeforeTimeUs() keeps memory bounded

Key property: Subtitles arrive incrementally, no full file needed.

CEA-608 Closed Captions

US broadcast content
  → CEA-608 data embedded in video stream (line 21 of NTSC signal)
  → TsExtractor detects CC track
  → TextRenderer creates Cea608Decoder (legacy path only)
  → Decoder maintains internal state machine (CC is stateful!)
  → Cues replace each other (ReplacingCuesResolver behavior)

Key property: CEA-608 is byte-oriented and stateful — fundamentally different from file-based subtitles.

The Engineering Trade-offs

Why Not Just Pass Strings?

A subtitle cue is never “just a string.” As we saw in the format overview, even the simplest SRT block carries timing information, and formats like ASS add positioning, rotation, transparency, and animation. The Cue class handles all of this, making it the universal currency of the subtitle system.

Why Encode → Transport → Decode?

The serialize/deserialize round trip exists because Media3 treats subtitles as first-class media samples. This enables:

Why Two CuesResolver Strategies?

Different subtitle standards have fundamentally different semantics:

Rather than forcing all formats into one model, Media3 lets each parser declare its behavior via CueReplacementBehavior and picks the right resolver automatically.

Wrapping Up

Media3’s subtitle system is a case study in thoughtful media engineering. It bridges a remarkably diverse set of standards — from the formal W3C specifications of WebVTT and TTML, to the community-driven SRT format with no formal spec, to the decades-old CEA-608 broadcast protocol — all through a unified pipeline.

From the streaming Consumer<CuesWithTiming> parser API to the MergingCuesResolver vs ReplacingCuesResolver strategy pattern, every layer reflects real-world constraints shaped by decades of timed text standards.

The next time you see subtitle text at the bottom of a video, you’ll know it traveled through at least 7 pipeline stages, survived a Parcel serialization round trip, and won a timing battle inside a CuesResolver — all in under 16 milliseconds.

References

AndroidX Media3

Subtitle Format Specifications

Closed Captioning Standards

Web Standards

#android #media3 #exoplayer #subtitle #video