Watchdoging in Ada
This project was inspired by an article about how to write a thread watchdog in
C. After reading it I thought "this would be a nice Ada project!"
So, here it is. This post is about my experience in writing it. My main motivation was to do an "exercise" in programming, but maybe it can be useful somewhere.
Task watchdog and how I did it
The problem is to monitor different tasks in a multi-task program and raise an alarm if a task stops working. A task proves that it is still alive by calling a specific function
I_Am_Alive. If it fails to call it regularly, it is considered dead and an alarm is raised.
Three ingredients are involved in this
- The watcher itself, that is, the task that check regularly if the other tasks are still alive.
- A connection to the watchdog used by the controlled task to tell the watchdog Ehi, I am still alive!
- An alarm handler that does something when the watchdog raises an alarm.
Let's examine the three ingredients
Let's check the package interface. The key ingredient is the watcher
package Watchdogs.Connections is -- -- Type representing the watcher, that is, the object that wakes up -- and checks if its tasks are still alive. -- type Watcher_Type is private; -- -- Create a new watcher type specifying an alarm handler and a sampling -- time, that is, the time interval between successive wake ups. -- function Create (Alarm_Handler : Alarm_Handlers.Alarm_Handler_Access; Sampling : Duration := 1.0) return Watcher_Type; -- Other stuff ... end Watchdogs.Connections;
Watcher_Type represents the object that does the "dirty work." The task to be controlled will make a connection to it and use it to communicate with the watcher.
A watcher must be created with the function
Create that expects as first parameter an access (a pointer in C jargon) to an alarm handler.
The alarm handler
The definition of alarm handler is the following
package Watchdogs.Alarm_Handlers is -- -- Interface for an alarm handler. Every handler you want to implement -- must descend from this and implement Task_Exited and Task_Unresponsive. -- type Alarm_Handler_Interface is limited interface; type Alarm_Handler_Access is access all Alarm_Handler_Interface'Class;
For the non experienced in Ada:
Alarm_Handler_Interfaceis an abstract type and you cannot create variables of this type, it works like a interface template. You need to derive new concrete classes from it.
limitedmeans that you can derive limited types, that is, types that cannot be assigned (see in the following). Finally,
Alarm_Handler_Interface'Classis a catch-all type that includes
Alarm_Handler_Interfaceand every type derived from it. In other words,
Alarm_Handler_Accesscan point to values of any type derived by
Alarm_Handler_Interface requires that any non-abstract descendant implements two procedure:
Task_Unresponsive (called when a task does not respond anymore, maybe because is stuck somewhere) and
Task_Exited, called when a task exists
-- -- Called when a task is unresponsive. It receives the task identification -- (name and/or ID) and the latest registered checkpoint -- procedure Task_Unresponsive (Handler : in out Alarm_Handler_Interface; ID : Task_Identification.Task_Id; Name : String; Checkpoint : Checkpoint_Type) is abstract with Pre'Class => (Name /= "" or ID /= Task_Identification.Null_Task_Id); -- -- Called when a task exits. It receives the task identification -- (name and/or ID) -- procedure Task_Exited (Handler : in out Alarm_Handler_Interface; ID : Task_Identification.Task_Id; Name : String) is abstract with Pre'Class => (Name /= "" or ID /= Task_Identification.Null_Task_Id); end Watchdogs.Alarm_Handlers;
Both procedures expects as parameter an identification of the task, namely its
Task_ID and/or a name. The ID or the name (but not both) can be empty. The procedure
Task_Unresponsive also expect a
Checkpoint value that can be used to know where the task got stuck, more about this later.
This approach allows the user to implement its own alarm handler that can do anything. For convenience, package
Watchdogs.Alarm_Handlers.To_Stderrdefines an alarm handler prêt à porter that just prints a message to the standard error.
This is an example (from
main.adb) of how to create a watcher
-- -- Get a watcher -- Watcher : constant Connections.Watcher_Type := Connections.Create (Alarm_Handler => new To_Stderr.Handler_Type);
The type for a connection to the watcher is
Watchdog_Connection. Its definition is
-- -- A watchdog connection allows a task to communicate with the -- watcher -- type Watchdog_Connection (<>) is limited private;
If you have no experience with Ada you can find the syntax above a bit obscure. The
Watchdog_Connectioncan have some unknown discriminant. Without entering in technical details, this prevents the user to declare a variable of this type without initialization. The
limitedpart means that you cannot copy a value of type
Watchdog_Connection, a value is born, lives and dies in the same variable. This is useful for values that carry an "external connection" and it makes no sense to copy.
A connection is created with the function
-- -- Open a connection with the watcher. The task needs to introduce itself -- with a name or a Task_ID, possibly both. Those values will be passed -- to the Alarm_Handler if the task becomes unresponsive. -- function Open (Watchdog : Watcher_Type; Task_Name : String := ""; ID : Task_Identification.Task_Id := Task_Identification.Null_Task_Id) return Watchdog_Connection with Pre => (Task_Name /= "" or ID /= Task_Identification.Null_Task_Id);
The function expects the watcher to connect to and a way to identify the task, it can be a name, the task ID or both, but at least one value must be present, as specified by the pre-condition.
Again, if you have no experience with Ada you could wonder what the part
with Pre => ...is. It is a pre-condition, a condition that must be satisfied when you call the function. It can be considered part of the documentation, but it has the advantage that the compiler (if instructed to do so) can add code that checks the pre-condition at run-time and raises an exception if not satisfied. A powerful bug trap...
When the task ends the connection is automatically closed (and
Task_Exited called) by the destroyer of the connection.
After opening a connection the task must declare its being alive; it does it by calling
type Checkpoint_Type is mod 2 ** 16; -- -- Let the watcher know that we are still alive. If this function is -- called in different points of the task it is possible to distinguish -- different calls via the Checkpoint parameter. The reason for having -- it is just to know what is the latest instance of I_Am_Alive -- called before the task crash. Its value will be given to the -- alarm handler. -- procedure I_Am_Alive (Connection : in out Watchdog_Connection; Checkpoint : Checkpoint_Type := 0);
I_Am_Alive can accept a
Checkpoint parameter (a 16 bit unsigned integer, actually) that the task can use to distinguish between different calls of
I_Am_Alive. If the task gets stuck the latest
Checkpoint is given to the alarm handler together with the task identification (ID or name), allowing to identify where the task got stuck.
This is a very simple example of how a connection is used. This is a simplified version of what you find in main.adb
task body Foo is Connection : Connections.Watchdog_Connection := Connections.Open (Watchdog => Watcher, Task_Name => "my name is foo, task foo", ID => Task_Identification.Current_Task); Sleep_Time : Duration := 0.1; begin -- -- Now we are connected with the watcher that will check that we -- call I_Am_Alive regularly -- loop -- -- At every iteration we increase the Sleep_Time so that sooner -- or later it will exceed the wake up time of the watcher -- delay Sleep_Time; Sleep_Time := Sleep_Time + 0.2; -- -- Tell the watcher we are alive -- Connections.I_Am_Alive (Connection); end loop; end Foo;
Digging in the internals
The user API is nice and cool, but you want a bit of gory details about the implementation, right? OK, so let's checkout the private definition of the watcher from package
private type Watcher_Type is access Watchers.Watchdog_Core;
Uh?!? That's it? Just an access to a "core type"? That's cheating...
Well, let's check the definition of
private package Watchdogs.Watchers is -- -- Object doing all the work. This exports an interface similar -- to the user visible Watcher_Type. This object is multitask safe -- (with Ada it is just too easy...) -- type Watchdog_Core is limited private; private -- -- Other stuff... -- type Watchdog_Core is limited record Doa_Table : Task_Table_Access; Watcher : Watchdog_Task_Access; Handler : Alarm_Handlers.Alarm_Handler_Access; end record; end Watchdogs.Watchers;
Several comments are in order.
First, do you see the keyword
package? This means that
Watchdogs.Watchers is a private package and it cannot be made visible outside the hierarchy of
Watchdogs. In particular, the library user (i.e., the programmer that uses the library) will not be able to access directly the resources provided by
Watchdogs.Watchers, but only through
private with Watchdogs.Watchers;
with says "Listen, I need the resources in
Watchdogs.Watchers, but I promise, cross my heart, that never ever I'll let the user see it". Indeed, if you check watchdogs-connections.ads you'll see that
Watchdogs.Watchers is referred only in the private part of the package, out of reach of the prying hands of the user...
Second, the definition of
Watcher_Type looks simple, just three fields. The last one is the access to the alarm handler (this is easy), what are the other two fields? Here the hard stuff lies... ;-)
Let's begin with the easy stuff: the field
Watcher is a
Watchdog_Task_Access that we guess being an access to
Watchdog_Task, but what is the latter? Well, a task
task type Watchdog_Task is -- -- This task is the real watchdog: it wakes up every now and then, -- check the task table for dead tasks and, if necessary, call -- the alarm handler -- entry Init (Sampling : Duration; Table : Task_Table_Access; Handler : Alarm_Handlers.Alarm_Handler_Access); end Watchdog_Task; type Watchdog_Task_Access is access Watchdog_Task;
Since we declared it as
Watchdog_Task behaves as type and we can, for example, declare variables of this type. Declaring a variable of type
Watchdog_Task would start a new task that proceeds in parallel. In Ada synchronization is done traditionally by message passing via the call to task
entry. In this case the entry is just used to give the task few parameters. The task will wake up every Sampling seconds, check the unresponsive tasks and call, if necessary, the handler.
OK, cool, and what about
Table in the parameter list and in the definition of
Watcher_Type? Well, here is where most of the complexity is hidden. A
Task_Table_Access is an
access to a
Task_Table that in turn has the following definition
-- -- The protected object Task_Table is the core data structure. -- It keeps which tasks are still alive and which ones did not -- confirm that they are alive. -- protected type Task_Table is -- Register that the task associated with the connection -- just went by th checkpoint procedure Mark_Alive (Connection : Connection_ID; Checkpoint : Checkpoint_Type); -- Get the set of tasks that did not declared themselves alive procedure Get_Dead_Tasks (Set : out Connection_To_Checkpoint_Tables.Map); -- Reset the state, setting all task as "to be confirmed alive" procedure Reset; -- Delete a task procedure Delete (Connection : Connection_ID); -- Allocate a new connection ID to a task procedure Get_New_Id (Connection : out Connection_ID; Name : String; ID : Task_Identification.Task_Id); function ID_Of (Connection : Connection_ID) return Task_Identification.Task_Id; function Name_Of (Connection : Connection_ID) return String; private -- -- It works in this way: we keep two sets of "tasks:" Alive (the -- tasks that declared to be alive) and Dead (the task that still -- have to declare to be alive). At timeout we read the Dead list -- and raise a warning for the tasks in list; successively we copy -- (with a Reset) Alive to Dead, restarting the iteration -- Alive : Connection_To_Checkpoint_Tables.Map; Dead : Connection_To_Checkpoint_Tables.Map; Next_Id : Connection_ID := Connection_ID'First; -- Keep name and ID of the tasks associated with a connection Connection_Table : Connection_To_Task_Tables.Map; end Task_Table;
Task_Table is the object that stores the state of the monitored tasks: if dead or alive and their identifications. It is a
protected type which means that it is accessed according to a reader/writer model (many tasks can read it at the same time, but writers have exclusive access). The compiler will take care of inserting the required synchronization code.
This object is manipulated mainly by the task watcher whose body is
task body Watchdog_Task is Task_Table : Task_Table_Access; Alarm_Handler : Alarm_Handlers.Alarm_Handler_Access; Sampling_Period : Duration; Dead_Tasks : Connection_To_Checkpoint_Tables.Map; use Connection_To_Checkpoint_Tables; begin -- Accept calls to the Init entry accept Init (Sampling : Duration; Table : Task_Table_Access; Handler : Alarm_Handlers.Alarm_Handler_Access) do Sampling_Period := Sampling; Task_Table := Table; Alarm_Handler := Handler; end Init; loop delay Sampling_Period; -- get some sleep -- -- Extract from the task table the task that did not -- claimed to be alive -- Task_Table.Get_Dead_Tasks (Dead_Tasks); -- -- Iterate over the list of dead tasks -- for Pos in Dead_Tasks.Iterate loop declare Connection : constant Connection_ID := Key (Pos); Checkpoint : constant Checkpoint_Type := Element (Pos); begin -- -- Call the alarm handler with the task data -- Alarm_Handler.Task_Unresponsive (ID => Task_Table.Id_Of (Connection), Name => Task_Table.Name_Of (Connection), Checkpoint => Checkpoint); -- -- Remove the task from the table -- Task_Table.Delete (Connection); end; end loop; Dead_Tasks.Clear; -- -- All the tasks that were declared alive get marked as -- dead. Let them prove that they are alive! :-) -- Task_Table.Reset; end loop; end Watchdog_Task;
As I said, I wrote this for the fun of it and, indeed, fun it was (Yoda-style). I hope you found this interesting.