Terms - ApsaraVideo VOD - Alibaba Cloud Documentation Center

This topic describes the terms that are commonly used in ApsaraVideo VOD, such as video format, codec, and transcoding.

File format

File names in operating systems have suffixes that are file name extensions, such as .doc, .jpg, and .avi. File name extensions allow your operating system to determine which application to use to open each file. This way, files can be recognized and opened by the corresponding applications. Common file name extensions for videos include .avi, .mpg, and .mp4. Video files can be associated with and played by the video players that are installed on your PC.

Container format

A container format is a file format that allows compressed video streams, audio streams, and metadata to be packaged into a single file based on specific specifications. The metadata includes basic information about videos, such as titles and subtitles.

Container formats are used for storing or streaming videos.

The following container formats are used for storing videos: AVI, ASF (WMA or WMV), MP4, MKV, and RMVB (RM or RA).
The following container formats are used for streaming videos: Flash Video (FLV), Transport Stream (TS), and MP4. The TS format must be used with streaming protocols, such as HTTP Live Streaming (HLS) and Real-Time Messaging Protocol (RTMP). The MP4 format must be used with the HTTP protocol.

The following section describes the container formats that are used for video streaming based on streaming protocols:

MP4: a classic video container format that is supported by multiple clients, including mobile devices (iOS and Android) and web browsers on PCs. However, the file header of an MP4 file has a large size and a complex structure. If the duration of an MP4 file is long, such as a few hours, the time that is required to load the video increases due to the large file header. Therefore, the MP4 format is more suitable for short videos.
An MP4 file consists of boxes (formerly known as atoms) that include all media metadata, such as the media arrangement and time information. The metadata provides references to media data such as video frames, and the arrangement of media data in the boxes is described in the metadata of the first primary file. A longer video duration indicates a larger file header, which increases the time that is required to load a video.
HLS: an HTTP-based network transmission protocol for media streaming developed by Apple Inc. By default, this protocol uses the TS container format to divide a stream into multiple TS fragments. This protocol also defines an M3U8 index file (text file) to control playback. HLS requires a shorter period of time to buffer header data and is suitable for playing on-demand videos. HLS is supported on mobile devices (iOS and Android) but is incompatible with Internet Explorer on PCs. Therefore, HLS requires a custom player that is developed for PCs. We recommend that you use ApsaraVideo Player for Web.
FLV: a standard format that is developed by Adobe. This format is supported by Flash Player on PCs, but is not supported on mobile devices unless a player app is installed. Browsers on most mobile phones such as iPhones do not support FLV. We recommend that you use ApsaraVideo Player.
Dynamic Adaptive Streaming over HTTP (DASH): a streaming standard that uses the fragmented MP4 (fMP4) format to divide an MP4 video into multiple fragments. Fragments can have different codec settings, such as resolution and bitrate. You can play different fragments to implement adaptive bitrate streaming and seamlessly switch between different resolutions. This provides a better playback experience. In DASH, the media presentation description (MPD) file is analogous to an HLS M3U8 file. Many video websites such as YouTube and Netflix use DASH to stream video content.
HLS with fMP4: the HLS protocol in essence. During the Apple Worldwide Developers Conference (WWDC) in 2016, Apple Inc. announced that the new HLS standard supports the fMP4 format in a similar manner as the TS format. This allows video files to be transcoded and packaged in DASH and HLS formats at the same time.
HLS including HLS with fMP4 and DASH are the most commonly used adaptive streaming technologies. We recommend that you use HLS and DASH.

Codec

A codec is a program or device that can compress or decompress (or decode) digital videos. In most cases, such compression is lossy compression. A codec also refers to a compression technology that is used to convert a video from one format to another. The following section describes common codecs:

H.26X: developed by International Telecommunication Union (ITU). This family includes H.261, H.262, H.263, H.264, and H.265.
- H.261: used in earlier video conferencing and video calls.
- H.262: supports the encoding and transmission of standard definition (SD) videos and mainly used in fields such as broadcasting, DVD, and digital TV.
- H.263: used in video conferencing, video calls, and online videos.
- H.264: also known as MPEG-4 Part 10, or MPEG-4 Advanced Video Coding (MPEG-4 AVC). It is a video compression standard and a format widely used for recording, compressing, and publishing high-precision videos.
- H.265: also known as High Efficiency Video Coding (HEVC). H.265 is a video compression standard that is the successor to H.264. Compared with H.264, H.265 improves video quality by providing twice the compression ratio. This reduces the bitrate by 50% without compromising the video quality. H.265 supports resolutions of up to 8192 × 4320, which is the trending definition.
MPEG: developed by Moving Picture Experts Group (MPEG), an alliance of working groups established by International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC). MPEG includes the following video compression standards:
- MPEG-1 Part 2: used in VCD and some online videos. The video quality provided by this standard is similar to that of VHS.
- MPEG-2 Part 2: the equivalent of H.262 and is used in DVD, SVCD, and most digital video broadcasting systems and cable distribution systems.
- MPEG-4 Part 2: used in network transmission, broadcasting, and media storage. This standard provides better compression performance than MPEG-2 and H.263 V1.
- MPEG-4 Part 10: provides the same technical content as H.264 of ITU-T. ITU-T and MPEG work together to develop the H.264 or MPEG-4 AVC standard. ITU-T named this standard H.264, while ISO and IEC named it MPEG-4 AVC.
Audio Video Coding Standard (AVS): a series of digital audio and video coding standards formulated by the Audio Video Coding Standards Workgroup of China. Two generations of AVS standards have been formulated.
- The first-generation AVS standard includes "Information Technology: Advanced Coding of Audio and Video, Part 2: Video" (AVS1) and "Information Technology: Advanced Coding of Audio and Video, Part 16: Radio Television Video" (AVS+). AVS+ provides the same compression efficiency as H.264 or MPEG-4 AVC High Profile.
- The second-generation AVS (AVS2) standard is designed for efficient compression of UHD (4K or higher) and high dynamic range (HDR) videos. The compression efficiency of AVS2 is twice that of AVS+ and H.264 or MPEG-4 AVC, and surpasses that of H.265 or HEVC.
Other codecs such as VP8 and VP9 that are developed by Google and RealVideo that is launched by RealNetworks are rarely used for online videos.

Before you select a codec, consider the compatibility of the codec with playback clients such as apps and web browsers. We recommend that you use the most common and widely supported codecs. ApsaraVideo VOD supports the following video codecs: H.264 or MPEG-4 AVC, and H.265 or HEVC. H.264 or MPEG-4 AVC is used as the default video codec. ApsaraVideo VOD also supports the following audio codecs: MP3, AAC, VORBIS, and FLAC. MP3 is used as the default audio codec.

Transcoding

Video transcoding refers to the process of converting a compressed video stream to another video stream to adapt to different network bandwidths, terminal processing capabilities, and user requirements. Transcoding is a process in which decoding and encoding are performed. Streams before and after transcoding may use the same or different video codecs.

Container format conversion

Container format conversion refers to the process of converting a compressed audio or video stream from one container format to another without encoding or decoding. For example, convert an AVI video to MP4. The compressed video and audio streams are obtained from the file in one container format and then packaged into a file in another container format. Container format conversion provides the following features:

Fast processing. Decoding and encoding audio and video files are complex and occupy most of the transcoding time. Container format conversion does not require encoding or decoding. This reduces the processing time.
Lossless audio or video quality. Container format conversion does not compress audio or video files because encoding and decoding are not involved.

The resolution and bitrate of the converted file are nearly the same as those of the original file. Therefore, the converted file is considered to be played in the original quality.

Bitrate

Bitrate refers to the data traffic that video files use per unit of time. It is the most important item for image quality control in video encoding. Bitrate is measured in bits per second (bit/s), and often used in the units of Kbit/s and Mbit/s. For videos that have the same resolution, a higher bitrate indicates a smaller compression ratio and higher video quality. A higher bitrate indicates a higher sampling rate per unit of time and a higher data stream accuracy. Therefore, the quality and definition of the processed video file are close to those of the original file. The processed file requires excellent decoding capabilities from the playback device.

The higher the bitrate, the larger the file. You can calculate the file size based on the following formula: File size = Time × Bitrate/8. For example, if a 60-minute 720p online video file has a bitrate of 1 Mbit/s, the size of the file is calculated based on the following formula: 3,600 seconds × 1 Mbit/s/8 = 450 MB.

Resolution

Resolution is the number of pixels contained in each dimension. For example, a 1,280 × 720 video indicates that the width of the video is 1,280 pixels and the height is 720 pixels. Video resolution determines the amount of details in your video. Video resolution determines how realistic and clear the video appears. A video that has a higher resolution contains more pixels and has clearer images.

Resolution is a key factor that determines the bitrate. Videos that have different resolutions use different bitrates. In most cases, higher resolutions require higher bitrates. Each resolution corresponds to a recommended range of bitrates. If you specify a resolution and a bitrate that are lower than the lower limit of the recommended range, the video quality is poor. If you specify a resolution and a bitrate that are higher than the upper limit of the recommended range, the video occupies more storage space and requires higher traffic to be loaded, but the video quality is not significantly improved.

Frame rate

The frame rate is used to measure the number of frames that are displayed per unit of time in a video, or the number of frames that are refreshed per second in an image. The unit of frame rate is frame per second (FPS) or Hz.

The higher the frame rate, the smoother and more lifelike the video appears. In most cases, 25 to 30 fps is sufficient. 60 fps can deliver an immersive and realistic playback experience. If you increase the frame rate to more than 75 fps, the improvement of playback experience is less significant. If you specify a frame rate higher than the refresh rate of your monitor, the monitor cannot properly display the frames and the processing potential of your graphics card is wasted. Higher frame rates at the same resolution require greater processing capabilities from the graphics card.

GOP

A Group of Pictures (GOP) is a group of continuous images in an MPEG-encoded video or video stream. A GOP starts with an I-frame and ends with the next I-frame. A GOP contains the following image types:

I-frame (intra coded picture): the keyframe. An I-frame contains all information that is required to produce the picture for that frame. It is independently decoded and can be regarded as a static picture. The first frame in the video sequence is always an I-frame, and each GOP starts with an I-frame.
P-frame (predictive coded picture): A P-frame must be encoded based on the preceding I-frame. A P-frame contains motion-compensated difference information relative to the previous I-frame or P-frame. During decoding, the difference defined by the current P-frame is superimposed with the previously cached image to generate the final image. P-frames occupy fewer data bits compared with I-frames. However, P-frames are sensitive to transmission errors because of the complex dependencies on the previous I-frame or P-frame.
B-frame (bidirectionally predictive coded picture): A B-frame contains motion-compensated difference information relative to the previous and subsequent frames. During decoding, the data of the current B-frame is superimposed with the previously cached image and the decoded subsequent image to generate the final image. B-frames provide a high compression ratio and require high decoding performance.

The GOP value indicates the interval of keyframes, which is the distance between two Instantaneous Decoding Refresh (IDR) frames or the maximum number of frames in a frame group. At least one keyframe is required for each second of video. More keyframes improve video quality but increase bandwidth consumption and network loads. The interval is calculated by dividing the GOP value (number of frames) by the frame rate. For example, the default GOP value of ApsaraVideo VOD is 250 frames and the frame rate is 25 FPS. The time interval is calculated based on the following formula: 250/25 = 10 seconds.

The GOP value must be within an appropriate range to achieve a balance among the video quality, file size (bandwidth consumption), and seeking effect (speed of response to the drag and fast-forward operations).

Increasing the GOP value reduces the file size. However, if the GOP value is too large, the last frames of a GOP are distorted, and the video quality is reduced.
The GOP value is also a key factor in determining the speed of response to seeking in a video. During seeking, the player locates the closest keyframe before the specified position. A larger GOP value indicates a longer distance between the specified position and the closest keyframe, which results in more predictive frames that need to be decoded. In this case, the loading time is extended and the seeking operation requires a long period of time to complete.
Encoding P-frames and B-frames is more complex compared with encoding I-frames. A large GOP value results in many P-frames and B-frames. This decreases the encoding efficiency.
However, if the GOP value is too small, the bitrate of the video must be increased to ensure that the image quality is not reduced. This process increases bandwidth consumption.

Scan mode

Progressive scanning: All lines of each frame are drawn in sequence to fill the screen.
Interlaced scanning: The even scan lines are drawn on the screen and then the odd scan lines are drawn on the screen. Two of these even and odd scan line fields make up one video frame.

IDR frame alignment

An IDR frame is a special type of I-frame. P-frames and B-frames after a normal I-frame can reference other I-frames before the IDR frame. However, no frame after an IDR frame can reference other frames before the IDR frame. To control the encoding and decoding processes, the first I-frame of a frame sequence is specified as an IDR frame.

An IDR frame tells the codec to immediately refresh the reference frame buffer. This way, errors in the frames before the IDR frame are not propagated to frames after the IDR frame. The IDR frame and the frames after it are coded as a new sequence. An IDR frame can also be used for random access, Normal I-frames do not support random access. Most players allow users to seek (drag the slider) to a specific position on the progress bar. In this case, the most efficient playback method is to play the video from the IDR frame nearest to the specified position. This prevents complicated reverse parsing because all frames after the IDR frame do not reference other I-frames before it.

When you transcode a video to multiple videos in different bitrates, you can enable IDR frame alignment. This way, the IDR frames of all output videos are accurately aligned based on the time and frame content. Then, a player can smoothly switch among the videos in different bitrates without obvious lags.

Profile

A profile defines a set of capabilities that focus on a specific class of applications. H.264 provides the following profiles:

Baseline Profile: uses I-frames and P-frames and supports only progressive videos and context-adaptive variable-length coding (CAVLC). This profile is used in low-end applications or applications that require additional fault tolerance, such as instant messaging applications that support video calls on mobile devices.
Main Profile: uses I-frames, P-frames, and B-frames, and supports progressive and interlaced videos. It also supports CAVLC and context-adaptive binary arithmetic coding (CABAC). This profile is used in mainstream consumer electronic products, such as MP4 players, portable video players, PSPs, and iPods that have relatively low decoding capabilities.
High Profile: supports 8 × 8 inter-prediction, custom quantization, lossless video coding, and more YUV formats, such as 4:4:4. This profile also supports the features of the main profile. It is used in broadcast and disc storage applications, especially in high-definition television applications. For example, Blu-ray Disc storage format adopts this profile.

Bit rate

The bit rate refers to the number of bits transmitted per second. The unit is bit per second (bit/s). A higher bit rate indicates a larger amount of data that is transmitted. In this video field, the bit rate is equivalent to the bitrate. The bit rate indicates the number of bits that are required to represent coded (compressed) audio or video data per second. The bit is the smallest binary unit. You can set a bit to 0 or 1. Similar to bitrate, a higher bit rate indicates better audio or video quality and a larger file size. A smaller bit rate indicates a smaller file size.

Bit rate control method

Bit rate control methods refer to the methods that are used to control the bit rate of a coded stream. The following items describe the bit rate control methods:

Variable bit rate (VBR): When this method is used, the bit rate is not fixed. When you compress a video or audio file, the video or audio compression software determines the bit rate based on the complexity of the video or audio data. This method balances the quality and the file size.
Constant bit rate (CBR): When this method is used, the bit rate is fixed for the coded stream. CBR-compressed files are larger in size compared with VBR-compressed and ABR-compressed files. However, the quality of CBR-compressed files is not significantly better compared with the quality of VBR-compressed or ABR-compressed files.
Average bit rate (ABR): This method is a variation of VBR with interpolation parameters. LAME created this method to solve the size and quality mismatch of CBR-compressed files and the unpredictable file sizes of VBR. In a given file size, ABR divides a stream into parts in the unit of 50 frames (at approximately 30 frames per second), and uses relatively low bitrates to code the less complex segments and high bitrates to code the more complex parts. ABR can be regarded as a compromise between VBR and CBR.
The bitrate can reach the specified value within a specific time range, but the peak bitrate in some parts can exceed the specified bitrate. The average bitrate remains constant. ABR is a modified version of VBR. ABR ensures that the average output bitrate is within an appropriate range and codes videos within this range based on the complexity. Alibaba Cloud uses ABR as the default bitrate control method.

Encoding format

For more information, see the description of codecs in the preceding section. Audio codecs are classified into lossy codecs and lossless codecs. Based on the sampling theory, an audio codec can generate only signals that are infinitely close to natural signals. Therefore, all audio codecs are lossy codecs. In the computer field, pulse-code modulation (PCM) is a conventional lossless codec because it achieves the highest fidelity among all audio codecs. Common audio codecs on the Internet, such as MP3 and AAC, are all lossy codecs.

Sampling rate

The sampling rate, or sampling frequency, defines the number of samples that are extracted from continuous-time signals every second to form discrete-time signals. Unit: Hz. The sample rate refers to the number of samples per unit of time for an analog signal converted into a digital signal. A higher sample rate provides a more real and natural sound.

Bitrate

For more information, see the description of the bit rate in the preceding section.

Sound channel

A sound channel refers to the independent audio signal that is collected or played when the sound is recorded or played in different spatial positions. The number of sound channels refers to the number of sound sources during recording or the number of speakers during playback.

UTC (ISO 8601 standard time format)

Coordinated Universal Time (UTC) is also known as the world unified time, world standard time, and international coordinated time. The acronym of the term in English is CUT and TUC in French. The acronym UTC is used as a compromise. UTC is a time metering system based on atomic seconds, which is close to the universal time. The Chinese mainland adopts the standard of Data Elements and Inter Change Formats-Information Interchange-Representation of Dates and Times (ISO 8601:1988 or GB/T 7408-1994), and refers to UTC as the international coordinated time.

By default, all time fields that are returned and time parameters in API requests in ApsaraVideo VOD are in UTC. The time is displayed in the YYYY-MM-DDThh:mm:ssZ format and follows the ISO8601 standard. For example, 2017-01-11T12:00:00Z indicates 20:00:00 on January 11, 2017 in UTC+8 (China Standard Time). The difference between China Standard Time and UTC is 8 hours. Therefore, UTC+8 indicates China Standard Time.