ffmpeg audio decoding

1. ffmpeg video decoding I
2. ffmpeg video decoding II
3. ffmpeg audio decoding I


The video decoding process has been introduced earlier. This article starts audio decoding at the beginning. There are also two articles, one is parsed by parser, and the other is processed according to the conventional process.

Some basic knowledge

  1. sample_rate:

    That is, the sampling frequency defines the number of samples extracted from continuous signals and composed of discrete signals per second, which is expressed in Hertz (Hz). The reciprocal of sampling frequency is the sampling period or sampling time, which is the time interval between samples. Generally speaking, sampling frequency refers to how many signal samples the computer collects per second.

  2. Number of samples (frame_size):

    The size of a frame of audio.

  3. Sampling format (sample_fmt):

    Storage format of audio sample.

    You can use 8-bit unsigned integers, 16 bit signed integers, 32-bit signed integers, and single precision floating-point numbers. Double precision floating-point numbers represent a sample. However, 24 bit signed integers are not used because these different formats use the native C type, which has no 24 bit length.

    We can use the following command to view the formats supported by ffmpeg:

    ffplay -sample_fmts

    Of course, you can also view the source code. Here, the source code of the structure of SampleFmtInfo (including information about AVSampleFormat conversion) is posted:

     static const SampleFmtInfo sample_fmt_info[AV_SAMPLE_FMT_NB] = {
     [AV_SAMPLE_FMT_U8]   = { .name =   "u8", .bits =  8, .planar = 0, .altform = AV_SAMPLE_FMT_U8P  },
     [AV_SAMPLE_FMT_S16]  = { .name =  "s16", .bits = 16, .planar = 0, .altform = AV_SAMPLE_FMT_S16P },
     [AV_SAMPLE_FMT_S32]  = { .name =  "s32", .bits = 32, .planar = 0, .altform = AV_SAMPLE_FMT_S32P },
     [AV_SAMPLE_FMT_S64]  = { .name =  "s64", .bits = 64, .planar = 0, .altform = AV_SAMPLE_FMT_S64P },
     [AV_SAMPLE_FMT_FLT]  = { .name =  "flt", .bits = 32, .planar = 0, .altform = AV_SAMPLE_FMT_FLTP },
     [AV_SAMPLE_FMT_DBL]  = { .name =  "dbl", .bits = 64, .planar = 0, .altform = AV_SAMPLE_FMT_DBLP },
     [AV_SAMPLE_FMT_U8P]  = { .name =  "u8p", .bits =  8, .planar = 1, .altform = AV_SAMPLE_FMT_U8   },
     [AV_SAMPLE_FMT_S16P] = { .name = "s16p", .bits = 16, .planar = 1, .altform = AV_SAMPLE_FMT_S16  },
     [AV_SAMPLE_FMT_S32P] = { .name = "s32p", .bits = 32, .planar = 1, .altform = AV_SAMPLE_FMT_S32  },
     [AV_SAMPLE_FMT_S64P] = { .name = "s64p", .bits = 64, .planar = 1, .altform = AV_SAMPLE_FMT_S64  },
     [AV_SAMPLE_FMT_FLTP] = { .name = "fltp", .bits = 32, .planar = 1, .altform = AV_SAMPLE_FMT_FLT  },
     [AV_SAMPLE_FMT_DBLP] = { .name = "dblp", .bits = 64, .planar = 1, .altform = AV_SAMPLE_FMT_DBL  },

    Where name is the format name, bits is the number of bits occupied in the computer, plannar is the file storage method, and altform is the corresponding name of the obtained file according to different storage methods (for example, u8 is the format of plannar=0, which is u8p when converted to plannar=1).

    Sample has two types of storage methods: planar and packed. In planar, each channel occupies a storage plane alone; In packed, the sample interleaving of all channels is stored in the same plane.

  4. Channel information:

    Channels is the number of audio channels 1 2 3 4 5
    channel_layout is an audio channel format type, such as single channel, dual channel

    For mono sound files, the sampling data is an eight bit short integer (short int 00H-FFH);

    For two channel stereo sound files, each sampling data is a 16 bit integer (int), and the high eight bits (left channel) and low eight bits (right channel) represent two channels respectively.

    If it is a stereo, the sampling is double, and the file is almost twice as large.

Audio information

If audio, sample: fltp; Sampling rate: 44100; Channel: 2.
av_get_bytes_per_sample(fltp) == 4;

  1. AAC (nb_samples and frame_size = 1024)
    Then you can get the size of one frame of audio:
    4 * 2 * 1024 = 8192 (bytes)
    The playback time of a frame is
    1024*1000000/44100= 46.43ms

  2. MP3 (nb_samples and frame_size = 1152)
    Then you can get the size of one frame of audio:
    4 * 2 * 1152 = 9216 (bytes)
    The playback time of a frame is
    1152*1000000/44100= 52.24ms

flow chart

The code flow is shown in the flow chart. Let's explain the functions of some of them.

  1. av_parser_init
    This is a parser. We instantiate this parser according to the decoder and use it later when parsing data.
  2. av_parser_parse2
    The original data we get from the input file (not applicable to the api of ffmpeg) cannot be used directly. At this time, we need to parse the original data using the parser instantiated above, divide the data into frames, and prepare for decoding the data later.
  3. avcodec_send_packet
    Send the parsed data we just got to the decoder for decoding.
  4. avcodec_receive_frame
    Get the decoded data.

Source code

#pragma once

extern "C"
#include "libavcodec/avcodec.h"

//Buffer size (CACHE 5 frames of data)
#define AUDIO_INBUF_SIZE 40960  
	name   depth
	u8        8
	s16      16
	s32      32
	flt      32
	dbl      64
	u8p       8
	s16p     16
	s32p     32
	fltp     32
	dblp     64
	s64      64
	s64p     64
	//The audio file format decoded by this code is as follows:
	//AAC File (1024 bytes per frame), dual channel (2), FLTP (32 bits, 4 bytes)
	//AAC File frame_size and Nb_ The size of samples is 1024
	//Byte size of one frame of audio
	//1024*2*4=8192 byte

using namespace std;

#define INPUT_FILE_NAME "lh_online.aac"
#define OUTPUT_FILE_NAME "lh_online.pcm"

static int get_format_from_sample_fmt(const char** fmt,
	enum AVSampleFormat sample_fmt)
	int i;
	struct sample_fmt_entry {
		enum AVSampleFormat sample_fmt; const char* fmt_be, * fmt_le;
	} sample_fmt_entries[] = {
		{ AV_SAMPLE_FMT_U8,  "u8",    "u8"    },
		{ AV_SAMPLE_FMT_S16, "s16be", "s16le" },
		{ AV_SAMPLE_FMT_S32, "s32be", "s32le" },
		{ AV_SAMPLE_FMT_FLT, "f32be", "f32le" },
		{ AV_SAMPLE_FMT_DBL, "f64be", "f64le" },
	*fmt = NULL;

	for (i = 0; i < FF_ARRAY_ELEMS(sample_fmt_entries); i++) {
		struct sample_fmt_entry* entry = &sample_fmt_entries[i];
		if (sample_fmt == entry->sample_fmt) {
			*fmt = AV_NE(entry->fmt_be, entry->fmt_le);
			return 0;

	av_log(NULL, AV_LOG_ERROR, "sample format %s is not supported as output format\n", av_get_sample_fmt_name(sample_fmt));
	return -1;

static void decode(AVCodecContext* dec_ctx, AVFrame* frame, AVPacket* pkt,
	FILE* ofile)
	int i, ch;
	int ret, data_size;
	ret = avcodec_send_packet(dec_ctx, pkt);
	if (ret < 0) {
		av_log(NULL, AV_LOG_ERROR, "Error sending packet to decoder.\n");

	while (ret >= 0) {
		ret = avcodec_receive_frame(dec_ctx, frame);
		if (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF)
		else if (ret < 0) {
			av_log(NULL, AV_LOG_ERROR, "Error sending a packet for decoding.\n");

		printf("frame_number: %d \n", dec_ctx->frame_number);
		//Get the size of each channel in each sampling point
		data_size = av_get_bytes_per_sample(dec_ctx->sample_fmt);
		if (data_size < 0) {
			av_log(NULL, AV_LOG_ERROR, "Failed to calculate data size.\n");
		//Traverse sampling points
		for (i = 0; i < frame->nb_samples; i++) {
			//Traversal channel
			for (ch = 0; ch < dec_ctx->channels; ch++) {
				fwrite(frame->data[ch] + data_size * i, 1, data_size, ofile);

int main(int argc, char* argv[])
	const AVCodec* codec;
	AVCodecParserContext* parser;
	AVCodecContext* c = NULL;
	FILE* ifile, * ofile;
	AVFrame* frame;
	AVPacket* pkt;
	uint8_t* data;
	size_t   data_size;
	int ret,len;
	enum AVSampleFormat sfmt;
	const char* fmt;

	//Initialize inbuf numeric defaults

	//Get decoder (the file to be read here is AAC, so)
	codec = avcodec_find_decoder(AV_CODEC_ID_AAC);
	if (!codec) {
		av_log(NULL, AV_LOG_ERROR, "Codec not found.\n");

	//Register parser
	parser = av_parser_init(codec->id);
	if (!parser) {
		av_log(NULL, AV_LOG_ERROR, "parser not found.\n");

	//Assign parser context
	c = avcodec_alloc_context3(codec);
	if (!c) {
		av_log(NULL, AV_LOG_ERROR, "Could not allocate video codec context.\n");
	//Open decoder
	if (avcodec_open2(c, codec, NULL) < 0) {
		av_log(NULL, AV_LOG_ERROR, "Could not open codec.\n");

	//Allocate AVPacket
	pkt = av_packet_alloc();
	if (!pkt) {

	//Assign AVFrame
	frame = av_frame_alloc();
	if (!frame) {

	//Open input file
	ifile = fopen(INPUT_FILE_NAME, "rb");
	if (!ifile) {
		av_log(NULL, AV_LOG_ERROR, "Could not open \s.\n", INPUT_FILE_NAME);
	//Open input file
	ofile = fopen(OUTPUT_FILE_NAME, "wb+");
	if (!ofile) {
		av_log(NULL, AV_LOG_ERROR, "Could not open \s.\n", OUTPUT_FILE_NAME);

	//Read data from the input stream ifile into the array pointed to by inbuf
	data = inbuf;
	data_size = fread(inbuf, 1, AUDIO_INBUF_SIZE, ifile);

	while (data_size > 0) {
		//Use the registered parser to divide the data into frames
		ret = av_parser_parse2(parser, c, &pkt->data, &pkt->size,
			data, data_size,
		if (ret < 0) {
			fprintf(stderr, "Error while parsing\n");
		//Reset data location according to usage
		data += ret;
		data_size -= ret;
		//Send to decode
		if (pkt->size)
			decode(c, frame, pkt, ofile);

		//Judge whether the remaining data in the buffer is less than the size of one frame of audio
		//If less than, continue to read from the file, and then send it to decoding
		if (data_size < AUDIO_REFILL_THRESH) {
			memmove(inbuf, data, data_size);
			data = inbuf;
			len = fread(data + data_size, 1,
				AUDIO_INBUF_SIZE - data_size, ifile);
			if (len > 0)
				data_size += len;

	//flush decoder
	decode(c, frame, NULL, ofile);
	//At this point, the decoding is finished. We will use ffplay to play the audio later
	//The decoded pcm data does not have these basic data. We need to obtain them from metadata
	//Print basic information
	printf("channels: %d \n", c->channels);  
	//sampling rate
	printf("sample_rate: %d  \n", c->sample_rate);  
	//Bytes occupied by one frame of audio are sold on a commission basis
	printf("buffer: %d  \n", av_samples_get_buffer_size(NULL, c->channels, c->frame_size, c->sample_fmt, 1));
	//Sampling format
	sfmt = c->sample_fmt;
	printf("sample_fmt: %s  \n", av_get_sample_fmt_name(sfmt));
	//If it is planar, convert it to packed format
	if (av_sample_fmt_is_planar(sfmt)) {
		const char* packed = av_get_sample_fmt_name(sfmt);
		sfmt = av_get_packed_sample_fmt(sfmt);
	if (get_format_from_sample_fmt(&fmt, sfmt) < 0) {
		av_log(NULL, AV_LOG_ERROR, "Could not get forma \s.\n", av_get_sample_fmt_name(sfmt));

	//Print playback command
	printf("Play the output audio file with the command:\n"
		"ffplay -f %s -ac %d -ar %d %s\n",
		fmt, c->channels, c->sample_rate,OUTPUT_FILE_NAME);

	//Resource release


	return 0;

This example demonstrates a process of decoding aac files into pcm files.
The printing information is as follows:

It can be seen that the file to be decoded is a file with 2 channels, sampling rate of 44100HZ and sampling format of fltp, with a total of 1478 frames.

Next, use the command to play the decoded audio:

ffplay -f f32le -ac 2 -ar 44100 lh_online.pcm


At this time, you should be able to hear the audio sound, and you're done.

So far, the way to decode audio based on parser parser is over.
The next article, like the video, will talk about a purely API based approach, which should be much more convenient than this.

Posted by jllydgnt on Sat, 06 Nov 2021 05:14:50 -0700