All Products
Document Center


Last Updated: Jul 09, 2019

Basic terms

Before getting started with ApsaraVideo for VOD, understand the following basic terms: file format, container format, and codec.

File format

Most file names in an operating system contain file name extensions, for example, .doc, .jpg, and .avi. File name extensions indicate the file formats. They are used to associate files with applications in the operating system, so that the files can be recognized and opened by corresponding applications. Common file name extensions for videos include .avi, .mpg, and .mp4. Video files can be associated with and played by the video player installed on your computer.

Container format

A container format describes the specification for packaging compressed video and audio tracks and metadata (including basic information such as the title and subtitles) to a file.

Container formats are divided into storage-oriented formats and streaming media-oriented formats.

  • Common storage-oriented container formats include AVI, ASF (WMA or WMV), MP4, MKV, and RMVB (RM or RA).
  • Common streaming media-oriented container formats include FLV, TS (which must be used with streaming media network transmission protocols such as HLS and RTMP), and MP4 (which must be used with HTTP).

The following section describes some key streaming media network transmission protocols and streaming media-oriented container formats.

  • MP4: a classic video container format that is well supported on most terminals, including mobile devices (iOS and Android) and web browsers on PCs. However, the file header of an MP4 file is large and complex. If the duration of an MP4 file is long, for example, lasting several hours, the large file header affects the video loading speed. Therefore, the MP4 format is more suitable for short videos.

    An MP4 file consists of boxes (formerly known as atoms) that include all media description metadata, such as the media arrangement information and time information. The metadata provides references to media data such as video frames, and the arrangement of media data in the boxes are descried in the metadata of the primary file. If the duration of a video is long, the file header is large and the loading is slow.

  • HTTP Live Streaming (HLS): an HTTP-based streaming media network transmission protocol developed by Apple. It uses the TS container format by default to divide a stream into multiple TS fragments. It also defines an M3U8 index file (text file) to control playback. Different from the MP4 format, the TS format does not require a long time to buffer header data, and is applicable to VOD scenarios. HLS is well supported on mobile devices (iOS and Android), but is incompatible with Internet Explorer on PCs. Therefore, HLS requires a specifically developed player on PCs. (We recommend that you use ApsaraVideo Player for Web.)
  • FLV: a standard introduced by Adobe. It is well supported by Flash Player on PCs, but can be supported on mobile devices only after a dedicated player is implemented. (We recommend that you use ApsaraVideo Player.) Browsers on most mobile devices, especially Apple’s iOS devices, do not support FLV.
  • Dynamic Adaptive Streaming over HTTP (DASH): uses the fragmented MP4 (fMP4) format to divide an MP4 video into multiple fragments. Each fragment can have its own codec settings, such as the definition and bitrate. A player can play the required fragment to achieve bitrate adaptation and seamless switching between different definitions, providing better playback experience. In DASH, the media presentation description (MPD) file provides functions similar to those of the M3U8 file in HLS. Many video websites, such as YouTube and Netflix, use DASH to stream their content.
  • HTTP Live Streaming with fragmented MP4 (HLS+fMP4): It is essentially the HLS protocol. On WWDC 2016, Apple announced that the new HLS standard supports the fMP4 format in a similar way as the TS format. This means that a video file can be coded and packaged in HLS and DASH simultaneously.

We recommend that you use HLS (including HLS+fMP4) and DASH because they are the most commonly used adaptive streaming media technologies.


A codec is a program or device that can compress and decompress digital videos. Generally, such compression is lossy compression. A codec can also refer to a compression technology for converting a video from one format to another. Common codecs are as follows:

  1. H.26X family: Its development is led by ITU. The family includes H.261, H.262, H.263, H.264, and H.265.

    • H.261: mainly used in earlier videoconferencing and videophone products.
    • H.263: mainly used in video conferences, video calls, and online videos.
    • H.264: also known as MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC). It is a video compression standard and a format widely used for recording, compressing, and publishing high-precision videos.
    • H.265: also known as High Efficiency Video Coding (HEVC). It is a video compression standard designed as a successor to H.264 or MPEG-4 AVC. Compared with H.264 or MPEG-4 AVC, HEVC improves video quality and can achieve twice the compression ratio (that is, a bitrate reduction by 50% at the same level of video quality). It supports resolutions up to 8192 × 4320, including 8K UHD, which is the current development trend.
  2. MPEG family: Its development is led by Moving Picture Experts Group (MPEG) affiliated to ISO. The family includes the following video compression standards:

    • MPEG-1 Part 2: mainly used in VCD and some online videos. The video quality provided by this standard is similar to that of VHS.
    • MPEG-2 Part 2: is the equivalent of H.262 and mainly used in DVD, SVCD, and most digital video broadcast systems and cable distribution systems.
    • MPEG-4 Part 2: used in network transmission, broadcast, and media storage. Compared with MPEG-2 and H.263 V1, this standard provides better compression performance.
    • MPEG-4 Part 10: has the same technical content as H.264 of ITU-T. ITU-T and MPEG work together to develop the H.264 or MPEG-4 AVC standard. ITU-T named this standard H.264, while ISO and IEC named it MPEG-4 AVC.
  3. Audio Video Coding Standard (AVS): a series of digital audio and video coding standards formulated by Audio and Video Coding Standard Workgroup of China. Currently, two generations of AVS standards have been formulated.

    • The first-generation AVS includes “Information Technology, Advanced Audio Video Coding, Part 2: Video” (AVS1) and “Information Technology, Advanced Audio Video Coding Part 16: Radio Television Video” (AVS+). AVS+ provides the same compression efficiency as H.264 or MPEG-4 AVC High Profile.
    • The second-generation AVS (AVS2) is mainly designed for efficiently compressing UHD (4K or above), high dynamic range (HDR) videos. The compression efficiency of AVS2 is doubled that of AVS+ and H.264 or MPEG-4 AVC, and surpasses that of H. 265 or HEVC.
  4. Other series: include VP8 and VP9 (developed under the lead of Google) and RealVideo (introduced by RealNetworks). They are rarely used in online videos and are not described in this topic.

When selecting a codec, you need to fully consider its compatibility with terminals, such as mobile applications and web browsers. Use the most common and widely supported codecs whenever possible. ApsaraVideo for VOD supports the following video codecs: H.264 or MPEG-4 AVC (default) and H.265 or HEVC. In addition, ApsaraVideo for VOD supports the following audio codecs: MP3 (default), AAC, VORBIS, and FLAC.

Basic transcoding terms


Video transcoding refers to the process of converting a coded video stream to another video stream to adapt to different network bandwidths, terminal processing capabilities, and user needs. Transcoding is essentially a process of decoding and coding. Streams before and after transcoding may use the same or different video coding standards.

Container format conversion

Container format conversion refers to the process of converting an audio or video file from one container format to another, for example, from AVI to MP4. The compressed video and audio streams are obtained from the file in the original container format and then packaged into a file in the target container format. No coding or decoding is involved in this process.Compared with transcoding, container format conversion has the following features:

  • Fast processing. The process of decoding and coding audio and video files is complex, which occupies most of the transcoding time. Container format conversion does not require coding or decoding, greatly reducing the processing time.
  • No loss of audio or video quality. Without the decoding (decompression) and coding (compression) process, container format conversion has no impact on the original audio or video quality.

The new file is almost the same as the original file in the resolution and bitrate. Therefore, the new file is also called the original-quality file when it is played.

Video coding terms


The bitrate refers to the data traffic that video files use per unit time. It is the most important item for image quality control in video coding. The bitrate is measured in bits per second (bit/s). The commonly used units are kbit/s and Mbit/s.Generally, the higher the bitrate of a video file at the same resolution, the smaller the compression ratio and the higher the image quality. The higher the bitrate, the higher the sample rate per unit time, the higher the data stream accuracy, the closer the processed file is to the original file, the better the image quality, the higher the video definition, and the higher the requirement on the decoding capability of the playback device.

However, the higher the bitrate, the larger the file size. Their relationship is reflected by the following formula: File size = Time × Bitrate/8. For example, if a common 60-minute 720p online video file has a bitrate of 1 Mbit/s, its size is about 450 MB (3,600 seconds × 1 Mbit/s/8).


The resolution refers to the capability to distinguish details of a video. It is the number of pixels in each dimension. For example, 1280 × 720 means the width is 1280 pixels and the height is 720 pixels. The resolution determines the image detail fineness of a video. In general, a video with the higher resolution contains more pixels and has clearer images.

The resolution is a main factor that determines the bitrate. Videos with different resolutions use different bitrates. In general, the higher the resolution of a video, the higher the required bitrate, but it is not always the case. Each resolution corresponds to a proper range of bitrates. The so-called “proper range” means that if the bitrate is below the lower limit of this range, the resolution is low and the video quality is poor. However, if the bitrate is higher than the upper limit of this range, the network traffic and storage space are wasted, while the video quality improves a little or even does not improve.

Frame rate

The frame rate is used to measure the number of video display frames per unit time, that is, the number of frames of images refreshed per second. The unit is frame per second (FPS) or Hz.

A higher frame rate can achieve a smoother and more lifelike video. In general, 25~30 FPS is acceptable. When the frame rate increases to 60 FPS, the interaction and lifelikeness are significantly improved. However, frame rates over 75 FPS do not noticeably increase the smoothness. Using a frame rate higher than the refresh rate of your display is only a waste of graphic processing capability because the display is unable to refresh itself at that frame rate. The higher the frame rate at the same resolution, the higher the requirement on the processing capability of the graphics card.


A group of pictures (GOP) is a group of continuous pictures in an MPEG-coded video or video stream. It starts with an I-frame and ends with the next I-frame. A GOP contains the following types of frames:

  • I-frame (intra coded picture): the keyframe. An I-frame contains all the information needed to produce the picture for that frame. It is decoded independently of all other pictures. It can be regarded as a static picture. The first frame in the video sequence is always an I-frame, and each GOP starts with an I-frame.
  • P-frame (predictive coded picture): A P-frame must be coded with reference to the preceding I-frame. A P-frame contains motion-compensated difference information relative to the previous frame (which may be an I-frame or a P-frame). During decoding, the difference defined by the current P-frame is superimposed with the previously cached picture to generate the final picture. Compared with I-frames, P-frames occupy fewer data bits. However, P-frames are sensitive to transmission errors because of their complex dependencies on the previous P and I reference frames.
  • B-frame (bidirectionally predictive coded picture): A B-frame contains motion-compensated difference information relative to the previous and subsequent frames. During decoding, the data of the current B-frame is superimposed with both the previously cached picture and the decoded subsequent picture to generate the final picture. B-frames provide a high compression ratio, but require high decoding performance.

The GOP value indicates the interval of keyframes, that is, the distance between two Instantaneous Decoding Refresh (IDR) frames or the maximum number of frames in a frame group. Generally, at least one keyframe is required for each second of video. Adding more keyframes improves video quality, but results in increased bandwidth consumption and higher network load. The GOP value (number of frames) divided by the frame rate is the time interval. For example, the default GOP value of ApsaraVideo for VOD is 250 frames and the frame rate is 25 FPS. Therefore, the time interval is 10 seconds.

The GOP value must be within a proper range to achieve a balance among the video quality, file size (indicating the bandwidth consumption), and seeking effect (indicating the speed of responding to the drag and fast-forward operations).

  • Increasing the GOP value reduces the file size. However, if the GOP value is too large, the last frames of a GOP are distorted, affecting the video quality.
  • The GOP value is also a key factor in determining the speed of responding to seeking in a video. During seeking, the player needs to locate the closest keyframe before the specified position. If the GOP value is too large, the distance between the specified position and the closest keyframe is long, and more predictive frames need to be decoded. As a result, the time (buffering time) required to respond to seeking is long.
  • Coding P-frames and B-frames is more complex than coding I-frames. A large GOP value results in many P-frames and B-frames, reducing the coding efficiency.
  • However, if the GOP value is too small, the bitrate of the video must be increased to ensure that the image quality is not reduced. This increases the bandwidth consumption.

IDR frame alignment

An IDR frame is a special type of I-frame. P-frames and B-frames after a normal I-frame can reference other I-frames before this I-frame. However, no frame after an IDR frame can reference other frames before the IDR frame. For the purpose of conveniently controlling coding and decoding, the first I-frame of a frame sequence is specified as an IDR frame.

An IDR frame tells the codec to immediately refresh the reference frame buffer. In this way, errors in the frames before the IDR frame are not propagated to the frames after the IDR frame. An IDR frame and frames after it are coded as a new sequence. An IDR frame can also be used for random access, while a normal I-frame does not support this feature. A player often allows users to seek to a random position (that is, to drag the progress slider to a random position). In this case, it is the most convenient for the player to play from an IDR frame near the specified position. This avoids complicated reverse parsing because all frames after the IDR frame do not reference other I-frames before it.

When transcoding a video to multiple videos in different bitrates, you can enable IDR frame alignment. In this way, the IDR frames of all output videos are accurately aligned in terms of the time point and frame content. Then, a player can smoothly switch among the videos in different bitrates without obvious lags.


A profile defines a set of capabilities targeting a specific class of applications. H.264 includes the following profiles:

  • Baseline Profile: uses I-frames and P-frames and supports only progressive videos and context-adaptive variable-length coding (CAVLC). It is used in low-cost applications or applications that require additional data loss robustness, for example, some videoconferencing and mobile applications.
  • Main Profile: uses I-frames, P-frames, and B-frames, supports progressive and interlaced videos, and supports CAVLC and context-adaptive binary arithmetic coding (CABAC). It is used in mainstream consumer electronic products, such as MP4 players, portable video players, PSPs, and iPods with relatively low decoding capabilities.
  • High Profile: supports 8 × 8 inter-prediction, custom quantization, lossless video coding, and more YUV formats (such as 4:4:4) besides features of the main profile. It is used in broadcast and disc storage applications, particularly in high-definition television applications. For example, this profile is adopted by the Blu-ray Disc storage format.

Bit rate

The bit rate refers to the number of bits transmitted per second. The unit is bit per second (bit/s). The higher the bit rate, the larger amount of the data transmitted. In the video field, the bit rate is equivalent to the bitrate.The bit rate indicates the number of bits required for representing coded (compressed) audio or video data per second. The bit is the smallest binary unit, with a value of either 0 or 1. Similar to the bitrate, the higher the bit rate, the better the audio or video quality, but the larger the coded file. The smaller the bit rate, the smaller the file size.

Bitrate control method

Bitrate control methods refer to the methods of controlling the bitrate of a coded stream. Common bitrate control methods are as follows:

  • Variable bitrate (VBR): When this method is used, the bitrate is not fixed. When compressing a video or audio file, the video or audio compression software dynamically determines the bitrate to be used based on the complexity of the video or audio data. This method considers both the quality and the file size.

  • Constant bitrate (CBR): When this method is used, the bitrate is fixed throughout the coded stream. Compared with VBR- and ABR-compressed files, CBR-compressed files are larger in size, and may not have significantly improved quality.

  • Average bitrate (ABR): a variation of VBR with interpolation parameters added. It is created by LAME to get rid of the poor size-quality ratio of CBR and the unpredictable file size of VBR. In a given file size, ABR divides a stream into parts in the unit of 50 frames (at about 30 FPS), and uses relatively low bitrates to code less complex segments and high bitrates to code more complex parts. ABR can be considered as a compromise between VBR and CBR.

    The bitrate can reach the specified value within a certain time range, but the peak bitrate in some parts may exceed the specified bitrate. The average bitrate is constant.ABR is a modified version of VBR. It ensures that the average output bitrate is within a proper range and dynamically codes videos within this range based on the complexity. ABR is the default bitrate control method used by Alibaba Cloud.

Audio coding terms


For more information about the definition of audio codecs, see the description of codecs in the preceding section. Audio codecs are classified into lossy codecs and lossless codecs. Based on the sampling theorem, an audio codec can only generate signals that are “infinitely close” to natural signals. Therefore, all audio codecs are essentially lossy codecs. In the computer field, pulse-code modulation (PCM) is a conventional lossless codec because it achieves the highest fidelity. Common audio codecs on the Internet, such as MP3 and AAC, are all lossy codecs.

Sample rate

The sample rate, or sample frequency, defines the number of samples that are extracted from continuous-time signals every second to form discrete-time signals. The unit is Hz. The sample rate refers to the sample frequency when analog signals are converted to digital signals, that is, the number of samples per unit time. The higher the sample rate, the more real and natural the sound.


For more information, see the description of the bitrate in the video coding terms.

Sound channel

A sound channel refers to the independent audio signal which is collected or played when the sound is recorded or played in different spatial positions. The number of sound channels refers to the number of sound sources during recording or the number of speakers during playback.

Other terms

UTC (ISO 8601 standard time format)

Coordinated Universal Time (UTC) is also known as the world unified time, world standard time, and international coordinated time. The acronyms of the term in English (CUT) and in French (TUC) are different, and the acronym UTC is used as a compromise. UTC is a time metering system that is based on atomic seconds and is as close as possible to the universal time. Mainland China adopts Data elements and interchange formats - Information interchange - Representation of dates and times (ISO 8601:1988 or GB/T 7408-1994), and refers to UTC as international coordinated time.

Unless otherwise specified, all time fields that are returned and time parameters in API requests in ApsaraVideo for VOD are in the UTC format by default. The time format is YYYY-MM-DDThh:mm:ssZ in accordance with ISO 8601. For example, 2017-01-11T12:00:00Z indicates 20:00:00 on January 11, 2017 in China Standard Time. The difference between China Standard Time and UTC is eight hours. Therefore, UTC+8 indicates China Standard Time.