This document serves as starting point for understanding the design and implementation of the Ogg container format. If you're new to Ogg or merely want a high-level technical overview, start reading here. Other documents linked from the index page give distilled technical descriptions and references of the container mechanisms. This document is intended to aid understanding.
Ogg is intended to be a simplest-possible container, concerned only with framing, ordering, and interleave. It can be used as a stream delivery mechanism, for media file storage, or as a building block toward implementing a more complex, non-linear container (for example, see the Skeleton or Annodex/CMML).
The Ogg container is not intended to be a monolithic 'kitchen-sink'. It exists only to frame and deliver in-order stream data and as such is vastly simpler than most other containers. Elementary and multiplexed streams are both constructed entirely from a single building block (an Ogg page) comprised of eight fields totalling twenty-eight bytes (the page header) a list of packet lengths (up to 255 bytes) and payload data (up to 65025 bytes). The structure of every page is the same. There are no optional fields or alternate encodings.
Stream and media metadata is contained in Ogg and not built into the Ogg container itself. Metadata is thus compartmentalized and layered rather than part of a monolithic design, an especially good idea as no two groups seem able to agree on what a complete or complete-enough metadata set should be. In this way, the container and container implementation are isolated from unnecessary metadata design flux.
The Ogg container is primarily a streaming format, encapsulating chronological, time-linear mixed media into a single delivery stream or file. The design is such that an application can always encode and/or decode all features of a bitstream in one pass with no seeking and minimal buffering. Seeking to provide optimized encoding (such as two-pass encoding) or interactive decoding (such as scrubbing or instant replay) is not disallowed or discouraged, however no container feature requires nonlinear access of the bitstream.
Ogg is designed to contain any size data payload with bounded, predictable efficiency. Ogg packets have no maximum size and a zero-byte minimum size. There is no restriction on size changes from packet to packet. Variable size packets do not require the use of any optional or additional container features. There is no optimal suggested packet size, though special consideration was paid to make sure 50-200 byte packets were no less efficient than larger packet sizes. The original design criteria was a 2% overhead at 50 byte packets, dropping to a maximum working overhead of 1% with larger packets, and a typical working overhead of .5-.7% for most practical uses.
Ogg is a byte-aligned container with no context-dependent, optional or variable-length fields. Ogg requires no repacking of codec data. The page structure is written out in-line as packet data is submitted to the streaming abstraction. In addition, it is possible to implement both Ogg mux and demux as MT-hot zero-copy abstractions (as is done in the Tremor sourcebase).
Ogg is designed for efficient and immediate stream capture with high confidence. Although packets have no size limit in Ogg, pages are a maximum of just under 64kB meaning that any Ogg stream can be captured with confidence after seeing 128kB of data or less [worst case; typical figure is 6kB] from any random starting point in the stream.
Ogg implements simple coarse- and fine-grained seeking by design.
Coarse seeking may be performed by simply 'moving the tone arm' to a new position and 'dropping the needle'. Rapid capture with accompanying timecode from any location in an Ogg file is guaranteed by the stream design. From the acquisition of the first timecode, all data needed to play back from that time code forward is ahead of the stream cursor.
Ogg implements full sample-granularity seeking using an interpolated bisection search built on the capture and timecode mechanisms used by coarse seeking. As above, once a search finds the desired timecode, all data needed to play back from that time code forward is ahead of the stream cursor.
Both coarse and fine seeking use the page structure and sequencing inherent to the Ogg format. All Ogg streams are fully seekable from creation; seekability is unaffected by truncation or missing data, and is tolerant of gross corruption. Seek operations are neither 'fuzzy' nor heuristic.
Seeking without use of an index is a major point of the Ogg design. There two primary reasons why Ogg transport forgoes an index:
In addition, it must be possible to create an Ogg stream in a single pass. Although an optional index can simply be tacked on the end of the created stream, some software groups object to end-positioned indexes and claim to be unwilling to support indexes not located at the stream beginning.
All this said, it's become clear that an optional index is a demanded feature. For this reason, the OggSkeleton now defines a proposed index.
Ogg multiplexes streams by interleaving pages from multiple elementary streams into a multiplexed stream in time order. The multiplexed pages are not altered. Muxing an Ogg AV stream out of separate audio, video and data streams is akin to shuffling several decks of cards together into a single deck; the cards themselves remain unchanged. Demultiplexing is similarly simple (as the cards are marked).
The goal of this design is to make the mux/demux operation as trivial as possible to allow live streaming systems to build and rebuild streams on the fly with minimal CPU usage and no additional storage or latency requirements.
Ogg streams belong to one of two categories, "Continuous" streams and "Discontinuous" streams.
A stream that provides a gapless, time-continuous media type with a fine-grained timebase is considered to be 'Continuous'. A continuous stream should never be starved of data. Examples of continuous data types include broadcast audio and video.
A stream that delivers data in a potentially irregular pattern or with widely spaced timing gaps is considered to be 'Discontinuous'. A discontinuous stream may be best thought of as data representing scattered events; although they happen in order, they are typically unconnected data often located far apart. One example of a discontinuous stream types would be captioning such as Ogg Kate. Although it's possible to design captions as a continuous stream type, it's most natural to think of captions as widely spaced pieces of text with little happening between.
The fundamental reason for distinction between continuous and discontinuous streams concerns buffering.
A continuous stream is, by definition, gapless. Ogg buffering is based on the simple premise of never allowing an active continuous stream to starve for data during decode; buffering works ahead until all continuous streams in a physical stream have data ready and no further.
Discontinuous stream data is not assumed to be predictable. The buffering design takes discontinuous data 'as it comes' rather than working ahead to look for future discontinuous data for a potentially unbounded period. Thus, the buffering process makes no attempt to fill discontinuous stream buffers; their pages simply 'fall out' of the stream when continuous streams are handled properly.
Buffering requirements in this design need not be explicitly declared or managed in the encoded stream. The decoder simply reads as much data as is necessary to keep all continuous stream types gapless and no more, with discontinuous data processed as it arrives in the continuous data. Buffering is implicitly optimal for the given stream. Because all pages of all data types are stamped with absolute timing information within the stream, inter-stream synchronization timing is always maintained without the need for explicitly declared buffer-ahead hinting.
Ogg does not replicate codec-specific metadata into the mux layer in an attempt to make the mux and codec layer implementations 'fully separable'. Things like specific timebase, keyframing strategy, frame duration, etc, do not appear in the Ogg container. The mux layer is, instead, expected to query a codec through a centralized interface, left to the implementation, for this data when it is needed.
Though modern design wisdom usually prefers to predict all possible needs of current and future codecs then embed these dependencies and the required metadata into the container itself, this strategy increases container specification complexity, fragility, and rigidity. The mux and codec code becomes more independent, but the specifications become logically less independent. A codec can't do what a container hasn't already provided for. Novel codecs are harder to support, and you can do fewer useful things with the ones you've already got (eg, try to make a good splitter without using any codecs. Such a splitter is limited to splitting at keyframes only, or building yet another new mechanism into the container layer to mark what frames to skip displaying).
Ogg's design goes the opposite direction, where the specification is to be as simple, easy to understand, and 'proofed' against novel codecs as possible. When an Ogg mux layer requires codec-specific information, it queries the codec (or a codec stub). This trades a more complex implementation for a simpler, more flexible specification.
The Ogg container itself does not define a metadata system for declaring the structure and interrelations between multiple media types in a muxed stream. That is, the Ogg container itself does not specify data like 'which steam is the subtitle stream?' or 'which video stream is the primary angle?'. This metadata still exists, but is stored by the Ogg container rather than being built into the Ogg container itself. Xiph specifies the 'Skeleton' metadata format for Ogg streams, but this decoupling of container and stream structure metadata means it is possible to use Ogg with any metadata specification without altering the container itself, or without stream structure metadata at all.
Every Ogg page is stamped with a 64 bit 'granule position' that serves as an absolute timestamp for mux and seeking. A few nifty little tricks are usually also embedded in the granpos state, but we'll leave those aside for the moment (strictly speaking, they're part of each codec's mapping, not Ogg).
As previously mentioned above, granule positions are mapped into absolute timestamps by the codec, rather than being a hard timestamp. This allows maximally efficient use of the available 64 bits to address every sample/frame position without approximation while supporting new and previously unknown timebase encodings without needing to extend or update the mux layer. When a codec needs a novel timebase, it simply brings the code for that mapping along with it. This is not a theoretical curiosity; new, wholly novel timebases were deployed with the adoption of both Theora and Dirac. "Rolling INTRA" (keyframeless video) also benefits from novel use of the granule position.
Ogg codecs place raw compressed data into packets. Packets are octet payloads containing the data needed for a single decompressed unit, eg, one video frame. Packets have no maximum size and may be zero length. They do not generally have any framing information; strung together, the unframed packets form a logical bitstream of codec data with no internal landmarks.
Logical bitstream packets are grouped and framed into Ogg pages along with a unique stream serial number to produce a physical bitstream. An elementary stream is a physical bitstream containing only a single logical bitstream. Each page is a self contained entity, although a packet may be split and encoded across one or more pages. The page decode mechanism is designed to recognize, verify and handle single pages at a time from the overall bitstream.
Ogg Bitstream Framing specifies the page format of an Ogg bitstream, the packet coding process and elementary bitstreams in detail.
Multiple logical/elementary bitstreams can be combined into a single multiplexed bitstream by interleaving whole pages from each contributing elementary stream in time order. The result is a single physical stream that multiplexes and frames multiple logical streams. Each logical stream is identified by the unique stream serial number stamped in its pages. A physical stream may include a 'meta-header' (such as the Ogg Skeleton) comprising its own Ogg page at the beginning of the physical stream. A decoder recovers the original logical/elementary bitstreams out of the physical bitstream by taking the pages in order from the physical bitstream and redirecting them into the appropriate logical decoding entity.
Ogg Bitstream Multiplexing specifies proper multiplexing of an Ogg bitstream in detail.
Multiple Ogg physical bitstreams may be concatenated into a single new stream; this is chaining. The bitstreams do not overlap; the final page of a given logical bitstream is immediately followed by the initial page of the next.
Each logical bitstream in a chain must have a unique serial number within the scope of the full physical bitstream, not only within a particular link or segment of the chain.
Within Ogg, each stream must be declared (by the codec) to be continuous- or discontinuous-time. Most codecs treat all streams they use as either inherently continuous- or discontinuous-time, although this is not a requirement. A codec may, as part of its mapping, choose according to data in the initial header.
Continuous-time pages are stamped by end-time, discontinuous pages are stamped by begin-time. Pages in a multiplexed stream are interleaved in order of the time stamp regardless of stream type. Both continuous and discontinuous logical streams are used to seek within a physical stream, however only continuous streams are used to determine buffering depth; because discontinuous streams are stamped by start time, they will always 'fall out' at the proper time when buffering the continuous streams. See 'Examples' for an illustration of the buffering mechanism.
Multiplexing requirements within Ogg are straightforward. When constructing a single-link (unchained) physical bitstream consisting of multiple elementary streams:
The initial header for each stream appears in sequence, each header on a single page. All initial headers must appear with no intervening data (no auxiliary header pages or packets, no data pages or packets). Order of the initial headers is unspecified. The 'beginning of stream' flag is set on each initial header.
All auxiliary headers for all streams must follow. Order is unspecified. The final auxiliary header of each stream must flush its page.
Data pages for each stream follow, interleaved in time order.
The final page of each stream sets the 'end of stream' flag. Unlike initial pages, terminal pages for the logical bitstreams need not occur contiguously; indeed it may not be possible for them to do so.
Each grouped bitstream must have a unique serial number within the scope of the physical bitstream.
Multiplexed and/or unmultiplexed bitstreams may be chained consecutively. Such a physical bitstream obeys all the rules of both chained and multiplexed streams. Each link, when unchained, must stand on its own as a valid physical bitstream. Chained streams do not mix or interleave; a new segment may not begin until all streams in the preceding segment have terminated.
Each codec is allowed some freedom in deciding how its logical bitstream is encapsulated into an Ogg bitstream (even if it is a trivial mapping, eg, 'plop the packets in and go'). This is the codec's mapping. Ogg imposes a few mapping requirements on any codec.
The framing specification defines 'beginning of stream' and 'end of stream' page markers via a header flag (it is possible for a stream to consist of a single page). A correct stream always consists of an integer number of pages, an easy requirement given the variable size nature of pages.
The first page of an elementary Ogg bitstream consists of a single, small 'initial header' packet that must include sufficient information to identify the exact CODEC type. From this initial header, the codec must also be able to determine its timebase and whether or not it is a continuous- or discontinuous-time stream. The initial header must fit on a single page. If a codec makes use of auxiliary headers (for example, Vorbis uses two auxiliary headers), these headers must follow the initial header immediately. The last header finishes its page; data begins on a fresh page.
As an example, Ogg Vorbis places the name and revision of the Vorbis CODEC, the audio rate and the audio quality into this initial header. Vorbis comments and detailed codec setup appears in the larger auxiliary headers.
Granule positions must be translatable to an exact absolute time value. As described above, the mux layer is permitted to query a codec or codec stub plugin to perform this mapping. It is not necessary for an absolute time to be mappable into a single unique granule position value.
Codecs are not required to use a fixed duration-per-packet (for example, Vorbis does not). the mux layer is permitted to query a codec or codec stub plugin for the time duration of a packet.
Although an absolute time need not be translatable to a unique granule position, a codec must be able to determine the unique granule position of the current packet using the granule position of a preceding packet.
Packets and pages must be arranged in ascending granule-position and time order.
Below, we present an example of a multiplexed and chained bitstream:
In this example, we see pages from five total logical bitstreams multiplexed into a physical bitstream. Note the following characteristics: