Playing with FFmpeg C code in Elixir

using NIFs to segfault the BEAM

Posted by Daniel Serrano on March 10, 2019

Foreign Function Interface

Foreign function interface is a well-known mechanism whereby a program written in language X interoperates with another program written in language Y.

This approach is useful in cases where, developing with language X, you realise you now have a very hard and complex problem that has been solved by a library written in that other language Y. One of such examples is dealing with video. FFmpeg is arguably the most common library for dealing with streams of video, and it is written in C. You can analyse metadata, manipulate the video by slicing it, extracting only the audio from it, you name it.

When I started looking into ways I could speed up an Elixir app that uses FFmpeg under the hood, I looked into NIFs as a way to avoid shelling out, and thus prevent all the different problems that come with it from happening, but particularly the time spent on it. With that said, NIFs aren’t a silver bullet either (far from it), but we’ll go into that later on.

To interact with C code from Elixir using NIFs (Native Implemented Functions), you’ll want to use erl_nif, the C library developed by the Erlang team to marshall/unmarshall Erlang terms back and forth into/from C-land.

Hello World

The “Hello World” example goes something like this. On the C side:

/* helloworld.c */

#include <erl_nif.h>

/* function that returns ERL_NIF_TERM, i.e., an Erlang term in C-land */
static ERL_NIF_TERM hello(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]) {
  return enif_make_string(env, "Hello world, from C!", ERL_NIF_LATIN1);
}

/* declare functions to export (and corresponding arity) */
static ErlNifFunc nif_funcs[] = {
  {"hello", 0, hello}
};

/* actually export the functions previously declared */
ERL_NIF_INIT(Elixir.HelloWorld, nif_funcs, NULL, NULL, NULL, NULL);

Compile it to helloworld.so with:

cc -fPIC -I$ERL_ROOT/include -dynamiclib -undefined dynamic_lookup -o helloworld.so helloworld.c

Note: $ERL_ROOT is used here to tell your C compiler where Erlang is installed in your machine. I’ve used homebrew to install Erlang, so it is under /usr/local/lib/erlang/erts-10.2.3/.

On the Elixir side:

# helloworld.ex

defmodule HelloWorld do
  # when module is loaded, load NIFs
  @on_load :load_nifs

  # call Erlang function to load NIF with specific name
  # in our case ./helloworld, previously compiled
  def load_nifs do
    :erlang.load_nif('./helloworld', 0)
  end

  # leave a default implementation in case NIF is not available
  def hello do
    raise "NIF hello/0 not implemented"
  end
end

Then run it with:

$> iex
iex> c "helloworld.ex"
[HelloWorld]
iex> HelloWorld.hello()
'Hello world, from C!'

And there you have it, a “Hello World” from the other side. The force is strong among us.

FFmpeg

Now, for FFmpeg, it gets a bit trickier. We will need to know what to look for. For this toy example we will want to get info that is available when we run ffprobe, which ships with ffmpeg and allows to get some metadata info on a given video file.

$> ffprobe sample.mp4
ffprobe version 4.1 Copyright (c) 2007-2018 the FFmpeg developers
  built with Apple LLVM version 8.0.0 (clang-800.0.42.1)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1_6 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags=-I/System/Library/Frameworks/JavaVM.framework/Versions/Current/Headers/ --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libmp3lame --enable-libopus --enable-librubberband --enable-libsnappy --enable-libtesseract --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librtmp --enable-libspeex --enable-videotoolbox --disable-libjack --disable-indev=jack --enable-libaom --enable-libsoxr
  libavutil      56. 22.100 / 56. 22.100
  libavcodec     58. 35.100 / 58. 35.100
  libavformat    58. 20.100 / 58. 20.100
  libavdevice    58.  5.100 / 58.  5.100
  libavfilter     7. 40.101 /  7. 40.101
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  3.100 /  5.  3.100
  libswresample   3.  3.100 /  3.  3.100
  libpostproc    55.  3.100 / 55.  3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'sample.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isomavc1
    creation_time   : 2012-03-13T08:58:06.000000Z
    encoder         : HandBrake 0.9.6 2012022800
  Duration: 00:00:10.03, start: 0.000000, bitrate: 629 kb/s
    Chapter #0:0: start 0.000000, end 10.000000
    Metadata:
      title           : Chapter 1
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, smpte170m/smpte170m/bt709), 320x176, 300 kb/s, 25 fps, 25 tbr, 90k tbn, 180k tbc (default)
    Metadata:
      creation_time   : 2012-03-13T08:58:06.000000Z
      encoder         : JVT/AVC Coding
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
    Metadata:
      creation_time   : 2012-03-13T08:58:06.000000Z
    Stream #0:2(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s
    Metadata:
      creation_time   : 2012-03-13T08:58:06.000000Z
    Stream #0:3(und): Data: bin_data (text / 0x74786574), 0 kb/s
    Metadata:
      creation_time   : 2012-03-13T08:58:06.000000Z

Note: The sample video and accompanying code can be found here.

At the bottom there, you can see that we have 4 streams (#0:0 through #0:3). We’re going to try and get that same information by calling the C code directly, instead of shelling out to call ffprobe itself.

So let’s start with our Elixir code and get that out of the way. The hardest part is going to happen next, with the C code!

defmodule FFbindings do
  @on_load :load_nifs

  def load_nifs do
    :erlang.load_nif('./ffbindings', 0)
  end

  def file_info(path) when is_binary(path) do
    path
    |> String.to_charlist()
    |> get_file_info()
  end

  def file_info(path) when is_list(path) do
    path
    |> get_file_info()
  end

  def file_info(path), do: raise "invalid type for path: #{inspect(path)}"

  defp get_file_info(_path) do
    raise "NIF get_file_info/1 not implemented"
  end
end

So nothing special here. We get the file path as either a binary or a charlist and convert it to always deal with charlists. We pass that on to our NIF in C-land which will get us the metadata for the video. The interface for the NIF becomes pretty obvious by now. We will want it to receive a charlist and return a map with the video metadata.

The NIF signature on the C side is always the same:

  • ErlNifEnv *env, which represents the environment where we’re hosting our Erlang terms
  • int argc, indicating how many arguments were passed to our NIF from the Elixir side
  • const ERL_NIF_TERM argv[], containing each of the arguments passed to our NIF from the Elixir side

You can see it in action here:

#define MAXBUFLEN 1024

static ERL_NIF_TERM
ffmpeg_get_file_info(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[]) {
  ...

  char path[MAXBUFLEN];
  (void)memset(&path, '\0', sizeof(path));
  enif_get_string(env, argv[0], path, sizeof(path), ERL_NIF_LATIN1);

  ...
}

First thing we have to do is extract the file path. For that, we reserve 1024 bytes to be able to store an arbitrarily long path, and hopefully that should be enough. We then set each of its characters to be '\0', the special string terminator in C (so the string is correct when we copy argv[0] over it). All that’s left is actually translating argv[0] (of type ERL_NIF_TERM) into a “C string”.

The way we translate an Erlang ERL_NIF_TERM representing an Erlang charlist to a char* is using the erl_nif.h function enif_get_string():

  • pass it the env (same as before)
  • the ERL_NIF_TERM representing an Erlang charlist, argv[0]
  • the destination char* variable, path
  • the size of the destination char* variable, sizeof(path)
  • the character encoding to be used (only ERL_NIF_LATIN1 is available for now in Erlang)

Note: Follow the entire code here.

Next, we’ll want to, after some sanity checks, build the map where we’ll store the video metadata. I’ll skip the validations for the sake of brevity and move right on to extracting the video format:

ERL_NIF_TERM fileinfo;
ERL_NIF_TERM key;
ERL_NIF_TERM val;

fileinfo = enif_make_new_map(env);

/* format */
key = enif_make_string(env, "format", ERL_NIF_LATIN1);
val = enif_make_string(env, av_context->iformat->long_name, ERL_NIF_LATIN1);
enif_make_map_put(env, fileinfo, key, val, &fileinfo);

To unmarshall the file path charlist from Erlang into C, we’ve previously used enif_get_string(). In order to marshall it, i.e., transform a C char* into an Erlang charlist, we’ll want to use enif_make_string() instead. It just turns out that erl_nif.h follows this convention for pretty much any data type. Use enif_get_...() to convert from Erlang terms to C data types, and enif_make_...() to convert back from C to Erlang.

So, back to the code. We’re creating a new Erlang map in C-land, and for that we need to use enif_make_new_map(), which yields an ERL_NIF_TERM (representing a map). Then, we use two other ERL_NIF_TERMs to represent two charlists which we’ll use as the first key and value of our map of video metadata:

  • the key is the Erlang charlist 'format', created with enif_make_string()
  • the value is given by using libavformat (part of FFmpeg) to fetch the format name (present in av_context->iformat->long_name [1] [2])

At this point we would see the following if we were to return fileinfo back to Erlang and end this here:

iex> FFbindings.file_info("bunny.mp4")
%{
  'format' => 'QuickTime / MOV'
}

But we’re not done yet. 😈

We’d still like to, similar to what we saw before, get specific information for each one of the existing streams in the video. For that, we can iterate av_context->streams and fetch the needed info from each of those streams.

ERL_NIF_TERM streams[av_context->nb_streams];

int i;
for(i = 0; i < av_context->nb_streams; i++) {
  av_stream = av_context->streams[i];
  ERL_NIF_TERM stream = enif_make_new_map(env);

  /* type */
  key = enif_make_string(env, "type", ERL_NIF_LATIN1);
  val = enif_make_string(env, av_get_media_type_string(av_stream->codecpar->codec_type), ERL_NIF_LATIN1);
  enif_make_map_put(env, stream, key, val, &stream);

  ...
}

You can see it’s pretty much the same thing as before, aside from the fact that we now need to use a somewhat esoteric function from FFmpeg to get the type correctly, but that’s a minor detail. Building the Erlang terms (i.e., charlists) from C char*s is the same. Storing the key-value pair in the new stream map (one for each stream) is also analogous.

ERL_NIF_TERM streams[av_context->nb_streams];

int i;
for(i = 0; i < av_context->nb_streams; i++) {
  ...

  streams[i] = stream;
}

key = enif_make_string(env, "streams", ERL_NIF_LATIN1);
val = enif_make_list_from_array(env, streams, av_context->nb_streams);
enif_make_map_put(env, fileinfo, key, val, &fileinfo);

In the end, we create an Erlang list of streams and add it to the fileinfo metadata map for the video.

In the full code snippet you can also check how you could extract the duration of each stream (with some very questionable C floating point to integer arithmetic), and present that to the user as well.

Gluing it all together, you get something like this:

iex> FFbindings.file_info("sample.mp4")
%{
  'format' => 'QuickTime / MOV',
  'streams' => [
    %{'duration' => 10, 'type' => 'video'},
    %{'duration' => 10, 'type' => 'audio'},
    %{'duration' => 10, 'type' => 'audio'},
    %{'duration' => 10, 'type' => 'data'}
  ]
}

This is cool and all, but as I’ve hinted at previously in this post NIFs are not a silver bullet. There are two major issues with them.

Slowing down the BEAM

Long-running NIFs may starve Elixir/Erlang-defined code.

It’s important to understand the concept of reductions and how the Erlang VM uses them to make everything work in a highly concurrent manner. The BEAM is a preemptive virtual-machine that yields processing after a process has consumed more than a certain number of reductions (i.e., units of work) since the last time it was selected for execution. When you call native code though you’re outside the BEAM and so this concept of reductions disappears. With that, time becomes the best measurement. Or does it?

The Erlang team suggests a well-behaving native function is to return to its caller within 1 millisecond. Now, this can be hard to achieve, especially when you don’t fully control the native library you’re calling, which would be our case with FFmpeg. If we were to be in control, you could use yielding NIFs or threaded NIFs, which use clever ways to chunk the work or let the Erlang side know of the state of the native processing (enif_consume_timeslice to bump reductions from the C side). When you don’t have that control though, and the native function runs for a long time, you can use the concept of a “dirty NIF” (either CPU or I/O bound). You specifically tell the Erlang VM it will have to look out for that NIF.

Then, if you have dirty schedulers support, the BEAM will create two extra types of schedulers besides the ordinary scheduler threads: a dirty scheduler for CPU-bound jobs and another one for I/O bound jobs, one of which will be used by your dirty NIF. In this way ordinary scheduler threads are not harassed by misbehaving native code.

Note: The Erlang runtime without SMP support does not support dirty schedulers even when the dirty scheduler support is explicitly enabled.

Segfaulting the BEAM

Crashing the BEAM is easier than you’d think.

In the Erlang docs you can read that a NIF “is executed as a direct extension of the native code of the VM”. What this means is that even though the BEAM can’t reason about native code, it is intimately connected with it (given “it is dynamically linked into the emulator process”). Hence, if you have a problem in your native code, you might crash the entire VM, losing all of your long-running processes and state with it. This is definitely a deal breaker when it comes to delivering not only performant, but also reliable software. You don’t want some C memory safety issue grinding your entire Erlang VM to a halt.

If you’re feeling brave, you might want to try out rustler, which implements a “safe Rust bridge for creating Erlang NIF functions”. It looks good and apparently Discord is even using it in Production! With all its memory safety guarantees and data race free threading, Rust seems like the perfect fit for writing safe NIFs.

You can alternatively look for other ways of interacting with code outside the BEAM. In this post I’ve covered NIFs, but there are also Ports, port drivers, JInterface and C nodes.

Conclusion

NIFs are not a silver bullet. Use them carefully. Or don’t use them at all? From what I could read online, the talks I’ve watched and people I’ve talked to, it seems to me that Ports is what you want when you’re working with something outside of the BEAM.

Some useful links that helped me write this blog post follow: