March 23, 2013

Using riak to understand erlang/OTP application structure

If you’re learning Erlang and want to build or understand real applications, or you’ve heard about Erlang and want to see a few of the concepts that underpin it’s ability to let you to create massively scalable, robust, distributed applications — this post is just for you my friend.

Shortly we will again be digging into the source code of riak — this time to see how a real-world, distributed erlang/OTP application is structured. This follows on nicely from my last post as we journey down from a deployable erlang package toward its inner-workings.

We will see:

What state the erlang VM is in when an application and it’s dependencies are started
How the application is actually started
What the supervision tree is, and how it is kicked off An application is a promise

erlang applications are often dependent on functionality in other applications — but not how you’re used to in other languages like C# or Java. In erlang, an application’s dependencies are often other applications that actually need to be **running — **i.e providing a service, and not just re-using code.

Think of it this way: an erlang application is a promise that the instance of the erlang VM running it will also be running all of its dependent applications that need to be running.

It you’re wondering why an application needing dependencies on other running applications is commonplace, just think of it as a way to increase fault-tolerance and provide other distributed-system advantages.

**riak’s promise… **

** In the last post I showed riak’s release script. This contains the instructions about which applications to start. Here’s an extract:

https://gist.github.com/NTCoding/5229174.js

Some dependencies, such as folsom, have the optional 3rd argument “load”, which means “don’t start the application, I just need to call some of It’s functions” (code re-use). All of the others assume the default value and so will be started up when riak is started.

So a partial view of riak’s promise can be visualised like this:

click to view full sizeStarting an application

When the time comes to start an erlang/OTP application the convention is to look in the {application_name}_app.src file which points to a module implementing the application behaviour — akin (loosely) to a class inheriting a base class called “application” in C#, Java etc.

All applications that needed to be started as part of the promise will also be started in this same way when the main application is started.

The naming convention for these entry-point modules is usually {application_name}_app.erl — rebar will even guide you down this path.

**How riak does it **

Funnily enough riak has no code; it just ensures all the applications it needs (it’s promise) will be running with some slight configuration. So there’s no entry-point to look at.

Cheer up, though, because we can definitely look at the entry-point of the riak_core application instead.

In riak core’s case, according to the convention, the start point of the application should be riak_core_app.erl. Here’s the first 20 or so lines of that file just to demonstrate its existence:

https://gist.github.com/NTCoding/5229196.jsThe supervision tree

Being built for concurrency, and thus fundamentally asynchronous with thousands or even millions of processes at any one time, erlang needs a facility to keep all of this under control. Supervision trees are the answer.

A supervision tree contains two broad types of nodes — supervisors and workers. Put simply, supervisors start and then watch children. When a child dies, the supervisor will restart it using a pre-configured strategy.

A common convention is that each application has a top-level supervisor with the name {application_name}_sup.erl. This supervisor is normally started in the {application_name}_app.erl file discussed previously.

Here’s the snippet from riak_core_app.erl that starts the tree in riak_core:

https://gist.github.com/NTCoding/5229244.js

**riak_core’s supervision tree **

In a second we’ll look at riak_core’s supervision tree, but let me just fill you in on the basics of OTP supervisors.

Supervisor is another OTP behaviour, so supervisors are erlang modules that “inherit” this behaviour. Part of the inherited api is the function called start_link(), as we saw being invoked in the previous code example.

In riak_core’s supervisor, the function is implemented like this:

https://gist.github.com/NTCoding/5229285.js

Take my word for it (or learn erlang) that the above code will look for a function inside the same module (riak_core_sup) called init() and invoke it. Inside init() we can see the supervisor starting it’s children:

https://gist.github.com/NTCoding/5229254.js

Each line starting with “?CHILD” contains the name of an erlang module to start and supervise, with a flag indicating if it’s a worker or supervisor.

The bottom line says “if any one of these modules dies (has an error it can’t recover from) just restart it.”

Those modules tagged as “supervisor” will look similar to this, with a list of modules they are going to start up and watch. But they might have a different restart strategy, such as: if one dies, restart them all” (one_for_all).

Here’s a partial view of riak_core’s supervision tree:

click to view full size