Erlang: Building Fault-Tolerant Distributed Systems
AI TOOLS March 17, 2026, 11:30 p.m.

Erlang: Building Fault-Tolerant Distributed Systems

Erlang was born out of the need for highly reliable telecom systems, and its design still shines when you need fault‑tolerant distributed applications. In this article we’ll explore the core concepts that make Erlang resilient, walk through a few hands‑on examples, and see how real‑world services leverage these ideas to stay up even when parts of the system fail.

Erlang’s Concurrency Model

At the heart of Erlang lies the lightweight process. Unlike OS threads, an Erlang process consumes a few kilobytes of memory and can be spawned in the order of millions per node. These processes communicate exclusively via asynchronous message passing, which eliminates shared‑state bugs.

Each process has a mailbox that stores incoming messages in FIFO order. The receive block pattern‑matches messages, allowing you to write clear, declarative logic for handling different events.

Why immutable data matters

All Erlang data structures are immutable. When a process “updates” a value, it actually creates a new copy, leaving the original untouched. This immutability guarantees that no other process can inadvertently corrupt state, a cornerstone of fault tolerance.

Building a Simple GenServer

GenServer is the workhorse behaviour for implementing server‑like processes. It abstracts the boilerplate of message loops, letting you focus on business logic. Below is a minimal key‑value store built with GenServer.

-module(kv_store).
-behaviour(gen_server).

%% API
-export([start_link/0, put/2, get/1]).

%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

put(Key, Value) ->
    gen_server:cast(?MODULE, {put, Key, Value}).

get(Key) ->
    gen_server:call(?MODULE, {get, Key}).

%% Callbacks
init([]) ->
    {ok, #{}}.

handle_cast({put, Key, Value}, State) ->
    NewState = maps:put(Key, Value, State),
    {noreply, NewState}.

handle_call({get, Key}, _From, State) ->
    Reply = maps:get(Key, State, undefined),
    {reply, Reply, State}.

terminate(_Reason, _State) ->
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

Notice how the server’s state is a simple map that lives only inside the process. If the process crashes, the state disappears, but we can recover it using supervisors.

Supervisors: The Safety Net

Supervisors monitor child processes and apply a predefined restart strategy when something goes wrong. The classic “one‑for‑one” strategy restarts only the failing child, keeping the rest of the tree untouched.

Let’s define a supervisor that starts our kv_store and a worker that periodically logs the store size.

-module(app_sup).
-behaviour(supervisor).

%% API
-export([start_link/0]).

%% supervisor callbacks
-export([init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    SupFlags = #{strategy => one_for_one, intensity => 5, period => 10},
    ChildSpecs = [
        #{id => kv_store,
          start => {kv_store, start_link, []},
          restart => permanent,
          shutdown => 5000,
          type => worker,
          modules => [kv_store]},
        #{id => logger,
          start => {kv_logger, start_link, []},
          restart => permanent,
          shutdown => 5000,
          type => worker,
          modules => [kv_logger]}
    ],
    {ok, {SupFlags, ChildSpecs}}.

The kv_logger process could look like this:

-module(kv_logger).
-behaviour(gen_server).

-export([start_link/0]).

-export([init/1, handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
    timer:send_interval(5000, log_size),
    {ok, undefined}.

handle_info(log_size, _State) ->
    Size = case whereis(kv_store) of
        undefined -> 0;
        Pid -> gen_server:call(kv_store, {size})
    end,
    io:format("~p entries in KV store~n", [Size]),
    {noreply, undefined};

handle_info(_Msg, State) ->
    {noreply, State}.

terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

Now, if kv_logger crashes due to a bad pattern match, the supervisor will restart only that logger, leaving the kv_store untouched. This isolation is the essence of Erlang’s “let it crash” philosophy.

Pro tip: Keep your supervision tree shallow. Deep nesting makes it harder to reason about restart cascades and can hide latency spikes during mass restarts.

Distributed Nodes and Clustering

Erlang’s runtime can connect multiple nodes into a cluster with a single command line flag (-name or -sname). Once connected, any process can be addressed using the {Name, Node} tuple, enabling seamless distribution.

Let’s spin up two nodes, node_a@host and node_b@host, and have them share a replicated counter.

% counter.erl
-module(counter).
-behaviour(gen_server).

-export([start_link/0, inc/0, value/0]).

-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({global, ?MODULE}, ?MODULE, 0, []).

inc() ->
    gen_server:cast({global, ?MODULE}, inc).

value() ->
    gen_server:call({global, ?MODULE}, get).

init(Initial) ->
    {ok, Initial}.

handle_cast(inc, Count) ->
    {noreply, Count + 1}.

handle_call(get, _From, Count) ->
    {reply, Count, Count}.

handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

The {global, ?MODULE} registration makes the process visible across the entire cluster. When node_a increments the counter, node_b sees the updated value instantly.

To test, run:

# In terminal 1
erl -name node_a@127.0.0.1 -setcookie secret

# In terminal 2
erl -name node_b@127.0.0.1 -setcookie secret

Then, on either shell:

counters:start_link().
counters:inc().
counters:value().

Because the process is globally registered, both nodes operate on the same state, illustrating Erlang’s built‑in support for distributed state without external databases.

Handling Network Partitions

Network partitions are inevitable in real deployments. Erlang’s net_kernel monitors node connectivity and emits nodedown and nodeup messages. By handling these in a dedicated gen_server, you can trigger graceful degradation or rebalancing.

-module(cluster_watcher).
-behaviour(gen_server).

-export([start_link/0]).
-export([init/1, handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) ->
    net_kernel:monitor_nodes(true, [nodedown_reason]),
    {ok, #{}}.

handle_info({nodeup, Node}, State) ->
    io:format("Node ~p joined the cluster~n", [Node]),
    {noreply, State};

handle_info({nodedown, Node, Reason}, State) ->
    io:format("Node ~p left: ~p~n", [Node, Reason]),
    % Insert custom logic: e.g., redistribute work
    {noreply, State};

handle_info(_Msg, State) -> {noreply, State}.

terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

This watcher can be added to your supervision tree, ensuring the cluster stays aware of topology changes and can react without manual intervention.

Pro tip: Use global:trans/2 when you need atomic operations across nodes. It provides a distributed lock that prevents race conditions during concurrent updates.

Real‑World Use Cases

Telecommunications: The original domain for Erlang, where switches must handle millions of calls per second with near‑zero downtime. By structuring the switch as a hierarchy of supervisors, a single faulty call‑handling process never brings down the whole exchange.

Messaging Platforms: WhatsApp famously runs on Erlang, supporting billions of daily messages with a tiny engineering team. Its use of lightweight processes for each user session enables massive concurrency while keeping latency low.

IoT Gateways: Devices often suffer intermittent connectivity. An Erlang‑based gateway can spawn a dedicated process per device, automatically restarting failed connections without affecting the rest of the fleet.

All these scenarios share a common pattern: isolate state, supervise it, and let the runtime handle failures. When you adopt this pattern, you trade complex defensive coding for a simpler, more declarative architecture.

Best Practices for Fault‑Tolerant Design

  • Keep processes small: A process should do one thing well—e.g., handle a single client session or manage a specific resource.
  • Prefer cast over call for fire‑and‑forget actions: This reduces back‑pressure and keeps the system responsive.
  • Use supervisors everywhere: Even helper processes (like timers) benefit from being supervised.
  • Leverage OTP behaviours: GenServer, GenStatem, and GenEvent provide battle‑tested scaffolding.
  • Monitor external resources: Wrap database connections or HTTP clients in gen_servers so failures are contained.

Testing Fault Tolerance

Erlang’s observer tool visualizes process trees, message queues, and node health in real time. Pair it with sys:trace/2 to capture unexpected exits during load tests. Simulating crashes is as easy as calling exit(Pid, kill) on a child process and watching the supervisor react.

Pro tip: Automate crash injection in your CI pipeline. A simple script that randomly kills processes in a test cluster can surface hidden race conditions before they hit production.

Putting It All Together: A Mini Chat Service

Let’s combine the concepts into a tiny chat server. Users connect via TCP, each connection spawns a chat_session process, and a room_manager supervises chat rooms. The room manager uses a global name so any node can route messages to the correct room.

% chat_session.erl
-module(chat_session).
-behaviour(gen_server).

-export([start_link/1, send_msg/2]).

-export([init/1, handle_info/2, handle_cast/2, terminate/2, code_change/3]).

start_link(Socket) ->
    gen_server:start_link(?MODULE, Socket, []).

send_msg(Pid, Text) ->
    gen_server:cast(Pid, {msg, Text}).

init(Socket) ->
    inet:setopts(Socket, [{active, once}]),
    {ok, #{socket => Socket}}.

handle_info({tcp, Socket, Data}, State) ->
    % Assume first word is room name
    [Room|Rest] = string:tokens(binary_to_list(Data), " "),
    Msg = string:join(Rest, " "),
    room_manager:publish(Room, Msg),
    inet:setopts(Socket, [{active, once}]),
    {noreply, State};

handle_info({tcp_closed, _Socket}, State) ->
    {stop, normal, State};

handle_info(_Other, State) -> {noreply, State}.

handle_cast({msg, Text}, #{socket := Sock}=State) ->
    gen_tcp:send(Sock, Text ++ "\n"),
    {noreply, State};

terminate(_Reason, #{socket := Sock}) ->
    gen_tcp:close(Sock),
    ok.

code_change(_OldVsn, State, _Extra) -> {ok, State}.
% room_manager.erl
-module(room_manager).
-behaviour(gen_server).

-export([start_link/0, join/2, publish/2]).

-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

start_link() ->
    gen_server:start_link({global, ?MODULE}, ?MODULE, #{}, []).

join(Room, Pid) ->
    gen_server:cast(?MODULE, {join, Room, Pid}).

publish(Room, Msg) ->
    gen_server:cast(?MODULE, {publish, Room, Msg}).

init(State) -> {ok, State}.

handle_cast({join, Room, Pid}, State) ->
    Users = maps:get(Room, State, []),
    NewState = maps:put(Room, [Pid|Users], State),
    {noreply, NewState};

handle_cast({publish, Room, Msg}, State) ->
    Users = maps:get(Room, State, []),
    lists:foreach(fun(User) -> chat_session:send_msg(User, Msg) end, Users),
    {noreply, State};

handle_info(_Info, State) -> {noreply, State}.

terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.

Both chat_session and room_manager are added to a top‑level supervisor. If a client disconnects abruptly, its session process exits, but the supervisor simply removes it from the room without affecting other participants. Adding a new node to the cluster automatically shares the same room_manager via the global registration, allowing users on different machines to chat together.

Conclusion

Erlang’s philosophy of “let it crash” combined with lightweight processes, powerful supervisors, and seamless distribution makes building fault‑tolerant systems remarkably straightforward. By keeping state isolated, leveraging OTP behaviours, and embracing the built‑in clustering primitives, you can design applications that stay alive even when individual components fail.

Whether you’re constructing a telecom switch, a massive messaging platform, or an IoT gateway, the patterns explored here scale from a single node to a global cluster with minimal code changes. Start experimenting with the examples, add supervision to every process, and watch your system become resilient by design.

Share this article