Watchdoging in Ada

By: Riccardo Bernardini

28 February 2020 at 16:46

This project was inspired by an article about how to write a thread watchdog in C. After reading it I thought "this would be a nice Ada project!"

So, here it is. This post is about my experience in writing it. My main motivation was to do an "exercise" in programming, but maybe it can be useful somewhere.

Task watchdog and how I did it

The problem is to monitor different tasks in a multi-task program and raise an alarm if a task stops working. A task proves that it is still alive by calling a specific function I_Am_Alive. If it fails to call it regularly, it is considered dead and an alarm is raised.

Three ingredients are involved in this

The watcher itself, that is, the task that check regularly if the other tasks are still alive.
A connection to the watchdog used by the controlled task to tell the watchdog Ehi, I am still alive!
An alarm handler that does something when the watchdog raises an alarm.

Let's examine the three ingredients

The watcher

Let's check the package interface. The key ingredient is the watcher

package Watchdogs.Connections is

   --
   -- Type representing the watcher, that is, the object that wakes up
   -- and checks if its tasks are still alive.
   --
   type Watcher_Type is private;

   --
   -- Create a new watcher type specifying an alarm handler and a sampling
   -- time, that is, the time interval between successive wake ups.
   --
   function Create (Alarm_Handler : Alarm_Handlers.Alarm_Handler_Access;
                    Sampling      : Duration := 1.0)
                    return Watcher_Type;

    -- Other stuff ...

end  Watchdogs.Connections;

Watcher_Type represents the object that does the "dirty work." The task to be controlled will make a connection to it and use it to communicate with the watcher.

A watcher must be created with the function Create that expects as first parameter an access (a pointer in C jargon) to an alarm handler.

The alarm handler

The definition of alarm handler is the following

package Watchdogs.Alarm_Handlers is
   --
   -- Interface for an alarm handler.  Every handler you want to implement
   -- must descend from this and implement Task_Exited and Task_Unresponsive.
   --
   type Alarm_Handler_Interface is limited interface;


   type Alarm_Handler_Access is
     access all Alarm_Handler_Interface'Class;

For the non experienced in Ada: interface means that Alarm_Handler_Interface is an abstract type and you cannot create variables of this type, it works like a interface template. You need to derive new concrete classes from it. limited means that you can derive limited types, that is, types that cannot be assigned (see in the following). Finally, Alarm_Handler_Interface'Class is a catch-all type that includes Alarm_Handler_Interface and every type derived from it. In other words, Alarm_Handler_Access can point to values of any type derived by Alarm_Handler_Interface.

The interface Alarm_Handler_Interface requires that any non-abstract descendant implements two procedure: Task_Unresponsive (called when a task does not respond anymore, maybe because is stuck somewhere) and Task_Exited, called when a task exists

   --
   -- Called when a task is unresponsive.  It receives the task identification
   -- (name and/or ID) and the latest registered checkpoint
   --
   procedure Task_Unresponsive (Handler    : in out Alarm_Handler_Interface;
                                ID         : Task_Identification.Task_Id;
                                Name       : String;
                                Checkpoint : Checkpoint_Type)
   is abstract
     with Pre'Class =>
       (Name /= "" or ID /= Task_Identification.Null_Task_Id);

   --
   -- Called when a task exits.  It receives the task identification
   -- (name and/or ID)
   --
   procedure Task_Exited (Handler    : in out Alarm_Handler_Interface;
                          ID         : Task_Identification.Task_Id;
                          Name       : String)
   is abstract
     with Pre'Class =>
       (Name /= "" or ID /= Task_Identification.Null_Task_Id);


end Watchdogs.Alarm_Handlers;

Both procedures expects as parameter an identification of the task, namely its Task_ID and/or a name. The ID or the name (but not both) can be empty. The procedure Task_Unresponsive also expect a Checkpoint value that can be used to know where the task got stuck, more about this later.

This approach allows the user to implement its own alarm handler that can do anything. For convenience, package Watchdogs.Alarm_Handlers.To_Stderr defines an alarm handler prêt à porter that just prints a message to the standard error.

This is an example (from main.adb) of how to create a watcher

   --
   -- Get a watcher
   --
   Watcher : constant Connections.Watcher_Type :=
               Connections.Create (Alarm_Handler => new To_Stderr.Handler_Type);

The connection

The type for a connection to the watcher is Watchdog_Connection. Its definition is

   --
   -- A watchdog connection allows a task to communicate with the
   -- watcher
   --
   type Watchdog_Connection (<>) is limited private;

If you have no experience with Ada you can find the syntax above a bit obscure. The (<>) means that Watchdog_Connection can have some unknown discriminant. Without entering in technical details, this prevents the user to declare a variable of this type without initialization. The limited part means that you cannot copy a value of type Watchdog_Connection, a value is born, lives and dies in the same variable. This is useful for values that carry an "external connection" and it makes no sense to copy.

A connection is created with the function Open


   --
   -- Open a connection with the watcher.  The task needs to introduce itself
   -- with a name or a Task_ID, possibly both.  Those values will be passed
   -- to the Alarm_Handler if the task becomes unresponsive.
   --
   function Open (Watchdog  : Watcher_Type;
                  Task_Name : String := "";
                  ID        : Task_Identification.Task_Id := Task_Identification.Null_Task_Id)
                  return Watchdog_Connection
     with
       Pre => (Task_Name /= "" or ID /= Task_Identification.Null_Task_Id);

The function expects the watcher to connect to and a way to identify the task, it can be a name, the task ID or both, but at least one value must be present, as specified by the pre-condition.

Again, if you have no experience with Ada you could wonder what the part with Pre => ... is. It is a pre-condition, a condition that must be satisfied when you call the function. It can be considered part of the documentation, but it has the advantage that the compiler (if instructed to do so) can add code that checks the pre-condition at run-time and raises an exception if not satisfied. A powerful bug trap...

When the task ends the connection is automatically closed (and Task_Exited called) by the destroyer of the connection.

After opening a connection the task must declare its being alive; it does it by calling I_Am_Alive

   type Checkpoint_Type is mod 2 ** 16;
   --
   -- Let the watcher know that we are still alive.  If this function is
   -- called in different points of the task it is possible to distinguish
   -- different calls via the Checkpoint parameter.  The reason for having
   -- it is just to know what is the latest instance of I_Am_Alive
   -- called before the task crash.  Its value will be given to the
   -- alarm handler.
   --
   procedure I_Am_Alive (Connection : in out Watchdog_Connection;
                         Checkpoint : Checkpoint_Type := 0);

The procedure I_Am_Alive can accept a Checkpoint parameter (a 16 bit unsigned integer, actually) that the task can use to distinguish between different calls of I_Am_Alive. If the task gets stuck the latest Checkpoint is given to the alarm handler together with the task identification (ID or name), allowing to identify where the task got stuck.

An example

This is a very simple example of how a connection is used. This is a simplified version of what you find in main.adb

  task body Foo is

      Connection : Connections.Watchdog_Connection :=
                     Connections.Open (Watchdog  => Watcher,
                                       Task_Name => "my name is foo, task foo",
                                       ID        => Task_Identification.Current_Task);

      Sleep_Time : Duration := 0.1;
   begin
      --
      -- Now we are connected with the watcher that will check that we
      -- call I_Am_Alive regularly
      --
      loop
         --
         -- At every iteration we increase the Sleep_Time so that sooner
         -- or later it will exceed the wake up time of the watcher
         --
         delay Sleep_Time;
         Sleep_Time := Sleep_Time + 0.2;

         --
         -- Tell the watcher we are alive
         --
         Connections.I_Am_Alive (Connection);
      end loop;
   end Foo;

Digging in the internals

The user API is nice and cool, but you want a bit of gory details about the implementation, right? OK, so let's checkout the private definition of the watcher from package Watchdogs.Connections

private
    type Watcher_Type is access Watchers.Watchdog_Core;

Uh?!? That's it? Just an access to a "core type"? That's cheating...

Well, let's check the definition of Watchdog_Core in Watchdogs.Connections

private package Watchdogs.Watchers is
   --
   -- Object doing all the work.  This exports an interface similar
   -- to the user visible Watcher_Type.  This object is multitask safe
   -- (with Ada it is just too easy...)
   --
   type Watchdog_Core is limited private;
private
   --
   -- Other stuff... 
   --

   type Watchdog_Core is limited
      record
         Doa_Table : Task_Table_Access;
         Watcher   : Watchdog_Task_Access;
         Handler   : Alarm_Handlers.Alarm_Handler_Access;
      end record;
end Watchdogs.Watchers;

Several comments are in order.

First, do you see the keyword private before package? This means that Watchdogs.Watchers is a private package and it cannot be made visible outside the hierarchy of Watchdogs. In particular, the library user (i.e., the programmer that uses the library) will not be able to access directly the resources provided by Watchdogs.Watchers, but only through Watchdogs.Connections that withs Watchdogs.Watchers with

   private with Watchdogs.Watchers;

The keyword private before with says "Listen, I need the resources in Watchdogs.Watchers, but I promise, cross my heart, that never ever I'll let the user see it". Indeed, if you check watchdogs-connections.ads you'll see that Watchdogs.Watchers is referred only in the private part of the package, out of reach of the prying hands of the user...

Second, the definition of Watcher_Type looks simple, just three fields. The last one is the access to the alarm handler (this is easy), what are the other two fields? Here the hard stuff lies... ;-)
Let's begin with the easy stuff: the field Watcher is a Watchdog_Task_Access that we guess being an access to Watchdog_Task, but what is the latter? Well, a task

   task type Watchdog_Task is
      --
      -- This task is the real watchdog: it wakes up every now and then,
      -- check the task table for dead tasks and, if necessary, call
      -- the alarm handler
      --
      entry Init (Sampling : Duration;
                  Table    : Task_Table_Access;
                  Handler  : Alarm_Handlers.Alarm_Handler_Access);
   end Watchdog_Task;

   type Watchdog_Task_Access is access Watchdog_Task;

Since we declared it as task type Watchdog_Task behaves as type and we can, for example, declare variables of this type. Declaring a variable of type Watchdog_Task would start a new task that proceeds in parallel. In Ada synchronization is done traditionally by message passing via the call to task entry. In this case the entry is just used to give the task few parameters. The task will wake up every Sampling seconds, check the unresponsive tasks and call, if necessary, the handler.

OK, cool, and what about Table in the parameter list and in the definition of Watcher_Type? Well, here is where most of the complexity is hidden. A Task_Table_Access is an access to a Task_Table that in turn has the following definition

   --
   -- The protected object Task_Table is the core data structure.
   -- It keeps which tasks are still alive and which ones did not
   -- confirm that they are alive.
   --
   protected type Task_Table is
      -- Register that the task associated with the connection
      -- just went by th checkpoint
      procedure Mark_Alive (Connection : Connection_ID;
                            Checkpoint : Checkpoint_Type);

      -- Get the set of tasks that did not declared themselves alive
      procedure Get_Dead_Tasks (Set : out Connection_To_Checkpoint_Tables.Map);

      -- Reset the state, setting all task as "to be confirmed alive"
      procedure Reset;

      -- Delete a task
      procedure Delete (Connection : Connection_ID);

      -- Allocate a new connection ID to a task
      procedure Get_New_Id (Connection : out Connection_ID;
                            Name       : String;
                            ID         : Task_Identification.Task_Id);

      function ID_Of (Connection : Connection_ID)
                      return Task_Identification.Task_Id;

      function Name_Of (Connection : Connection_ID)
                        return String;
   private
      --
      -- It works in this way: we keep two sets of "tasks:" Alive (the
      -- tasks that declared to be alive) and Dead (the task that still
      -- have to declare to be alive).  At timeout we read the Dead list
      -- and raise a warning for the tasks in list; successively we copy
      -- (with a Reset) Alive to Dead, restarting the iteration
      --
      Alive   : Connection_To_Checkpoint_Tables.Map;
      Dead    : Connection_To_Checkpoint_Tables.Map;
      Next_Id : Connection_ID := Connection_ID'First;

      -- Keep name and ID of the tasks associated with a connection
      Connection_Table : Connection_To_Task_Tables.Map;
   end Task_Table;

A Task_Table is the object that stores the state of the monitored tasks: if dead or alive and their identifications. It is a protected type which means that it is accessed according to a reader/writer model (many tasks can read it at the same time, but writers have exclusive access). The compiler will take care of inserting the required synchronization code.

This object is manipulated mainly by the task watcher whose body is

   task body Watchdog_Task is
      Task_Table      : Task_Table_Access;
      Alarm_Handler   : Alarm_Handlers.Alarm_Handler_Access;
      Sampling_Period : Duration;

      Dead_Tasks      : Connection_To_Checkpoint_Tables.Map;

      use Connection_To_Checkpoint_Tables;
   begin
      -- Accept calls to the Init entry
      accept Init (Sampling : Duration;
                   Table    : Task_Table_Access;
                   Handler  : Alarm_Handlers.Alarm_Handler_Access)
      do
         Sampling_Period := Sampling;
         Task_Table := Table;
         Alarm_Handler := Handler;
      end Init;

      loop           
         delay Sampling_Period;  -- get some sleep

         --
         -- Extract from the task table the task that did not
         -- claimed to be alive
         --
         Task_Table.Get_Dead_Tasks (Dead_Tasks);

         --
         -- Iterate over the list of dead tasks
         --
         for Pos in Dead_Tasks.Iterate loop
            declare
               Connection : constant Connection_ID := Key (Pos);
               Checkpoint : constant Checkpoint_Type := Element (Pos);
            begin
               --
               -- Call the alarm handler with the task data
               --
               Alarm_Handler.Task_Unresponsive
                 (ID         => Task_Table.Id_Of (Connection),
                  Name       => Task_Table.Name_Of (Connection),
                  Checkpoint => Checkpoint);

               --
               --  Remove the task from the table
               --
               Task_Table.Delete (Connection);
            end;
         end loop;

         Dead_Tasks.Clear;

         --
         -- All the tasks that were declared alive get marked as
         -- dead.  Let them prove that they are alive! :-)
         --
         Task_Table.Reset;
      end loop;

   end Watchdog_Task;

Conclusion

As I said, I wrote this for the fun of it and, indeed, fun it was (Yoda-style). I hope you found this interesting.

Normal view