2021-11-17 - Scheduler API


1. Background
-------------

The scheduler relies on two major parts:
  - the wait queue or timers queue, which contains an ordered tree of the next
    timers to expire

  - the run queue, which contains tasks that were already woken up and are
    waiting for a CPU slot to execute.

There are two types of schedulable objects in HAProxy:
  - tasks: they contain one timer and can be in the run queue without leaving
    their place in the timers queue.

  - tasklets: they do not have the timers part and are either sleeping or
    running.

Each thread will have both a timers queue and a run queue. A task or tasklet
may only be queued in a single of each at a time. The timers queue may only
be manipulated by its owner. A thread has two runqueues, one local and one
shared. The local one is only ever accessed by the thread, while the shared
one is there so that other threads can add tasks to be run on that thread, and
is locked.

In case of doubt, keep in mind that it's not permitted to manipulate another
thread's private task or tasklet, and that any task held by another thread
might vanish while it's being looked at.

Internally a large part of the task and tasklet struct is shared between
the two types, which reduces code duplication and eases the preservation
of fairness in the run queue by interleaving all of them. As such, some
fields or flags may not always be relevant to tasklets and may be ignored.


Tasklets do not use a thread mask but use a thread ID instead, to which they
are bound. If the thread ID is -1, the tasklet is not bound but may only
be run on the calling thread.

A task with a negative thread ID is a shared task, and can run on any thread.
If its thread ID is -1, it means it is currently not owned by any thread. When
a thread adds it to its timers queue, or its run queue, it will change the
task's thread ID to -2 - calling thread ID, indicating that it now temporarily
owns that task.

If any other thread wants to have that task scheduled at a later time, or
run, it will add it to that thread's run queue. If the goal is to add it to
the timers queue, the TASK_WOKEN_WQ flag will be used, and the target thread
will queue it to its timers queue.

The task thread ID will be set to -1 by the owner thread when it is no longer
in any queue, or running, indicating that that task no longer has any owner,
and any thread can run it.

To try to make sure a task won't stick on a thread forever, when a thread
removes a task from its timers queue because it should run, if it detects
that it has many more tasks than another random thread, it will just give
the ownership to that other thread, and get the task to run on it.

Tasklets do not use negative thread IDs to indicate a thread affinity, however
since tasks may occasionally be aliased as tasklets, some of the tasklet
handling functions take care of this specificity in order to preserve a full
compatibility between them.


2. API
------

There are few functions exposed by the scheduler. A few more ones are in fact
accessible but if not documented there they'd rather be avoided or used only
when absolutely certain they're suitable, as some have delicate corner cases.
In doubt, checking the sched.pdf diagram may help.

int total_run_queues()
        Return the approximate number of tasks in run queues. This is racy
        and a bit inaccurate as it iterates over all queues, but it is
        sufficient for stats reporting.

int task_in_rq(t)
        Return non-zero if the designated task is in the run queue (i.e. it was
        already woken up).

int task_in_wq(t)
        Return non-zero if the designated task is in the timers queue (i.e. it
        has a valid timeout and will eventually expire).

int thread_has_tasks()
        Return non-zero if the current thread has some work to be done in the
        run queue. This is used to decide whether or not to sleep in poll().

void task_wakeup(t, f)
        Will make sure task <t> will wake up, that is, will execute at least
        once after the start of the function is called. The task flags <f> will
        be ORed on the task's state, among TASK_WOKEN_* flags exclusively. In
        multi-threaded environments it is safe to wake up another thread's task
        and even if the thread is sleeping it will be woken up. Users have to
        keep in mind that a task running on another thread might very well
        finish and go back to sleep before the function returns. It is
        permitted to wake the current task up, in which case it will be
        scheduled to run another time after it returns to the scheduler.

struct task *task_unlink_wq(t)
        Remove the task from the timers queue if it was in it, and return it.
        It may only be done for the local thread, or a task owned by the
        current thread. It must not be done for another thread's task.

void task_queue(t)
        Place or update task <t> into the timers queue, where it may already
        be, scheduling it for an expiration at date t->expire. If t->expire is
        infinite, nothing is done, so it's safe to call this function without
        prior checking the expiration date. It is only valid to call this
        function for tasks local to this thread, or for shared tasks, not for
        other threads' local tasks.

void task_set_thread(t, id)
        Change task <t>'s thread ID to new value <id>. This may only be
        performed by the task itself while running. This is only used to let a
        task voluntarily migrate to another thread. Thread id -1 is used to
        indicate "any thread". It's ignored and replaced by zero when threads
        are disabled.

void tasklet_wakeup(tl, [flags])
        Make sure that tasklet <tl> will wake up, that is, will execute at
        least once. The tasklet will run on its assigned thread, or on any
        thread if its TID is negative. An optional <flags> value may be passed
        to set a wakeup cause on the tasklet's flags, typically TASK_WOKEN_* or
        TASK_F_UEVT*. When not set, 0 is passed (i.e. no flags are changed).

struct list *tasklet_wakeup_after(head, tl, [flags])
        Schedule tasklet <tl> to run immediately the current one if <head> is
        NULL, or after the last queued one if <head> is non-null. The new head
        is returned, to be passed to the next call. The purpose here is to
        permit instant wakeups of resumed tasklets that still preserve
        ordering between them. A typical use case is for a mux' I/O handler to
        instantly wake up a series of urgent streams before continuing with
        already queued tasklets. This may induce extra latencies for pending
        jobs and must only be used extremely carefully when it's certain that
        the processing will benefit from using fresh data from the L1 cache.
        An optional <flags> value may be passed to set a wakeup cause on the
        tasklet's flags, typically TASK_WOKEN_* or TASK_F_UEVT*. When not set,
        0 is passed (i.e. no flags are changed).

void tasklet_wakeup_here(tl, [flags])
        Make sure that tasklet <tl> will wake up on the calling thread, that
        is, will execute at least once. The tasklet must not be bound to any
        other thread; its tid must be either -1 or the caller's tid. Just in
        case, only use tasklet_wakeup() which will check the tasklet's assigned
        thread ID. An optional <flags> value may be passed to set a wakeup
        cause on the tasklet's flags, typically TASK_WOKEN_* or TASK_F_UEVT*.
        When not set, 0 is passed (i.e. no flags are changed).

void tasklet_wakeup_on(tl, thr, [flags])
        Make sure that tasklet <tl> will wake up on thread <thr>, that is, will
        execute at least once. Any valid thread number may be specified. If the
        tasklet is not yet queued, then it will be added to the designated
        thread's shared queue. While this method also works for wakeups on the
        calling thread, it's less efficient and whenever the calling thread
        knows it will wake up one of its own tasks, it ought to use
        tasklet_wakeup_here() or tasklet_wakeup() instead. An optional <flags>
        value may be passed to set a wakeup cause on the tasklet's flags,
        typically TASK_WOKEN_* or TASK_F_UEVT*. When not set, 0 is passed (i.e.
        no flags are changed).

struct tasklet *tasklet_new()
        Allocate a new tasklet and set it to run by default on the calling
        thread. The caller may change its tid to another one before using it.
        The new tasklet is returned.

struct task *task_new_anywhere()
        Allocate a new task to run on any thread, and return the task, or NULL
        in case of allocation issue. Note that such tasks will be marked as
        shared and will go through the locked queues, thus their activity will
        be heavier than for other ones. See also task_new_here().

struct task *task_new_here()
        Allocate a new task to run on the calling thread, and return the task,
        or NULL in case of allocation issue.

struct task *task_new_on(t)
        Allocate a new task to run on thread <t>, and return the task, or NULL
        in case of allocation issue.

void task_destroy(t)
        Destroy this task. The task will be unlinked from any timers queue,
        and either immediately freed, or asynchronously killed if currently
        running. This may only be done by one of the threads this task is
        allowed to run on. Developers must not forget that the task's memory
        area is not always immediately freed, and that certain misuses could
        only have effect later down the chain (e.g. use-after-free).

void tasklet_free()
        Free this tasklet, which must not be running, so that may only be
        called by the thread responsible for the tasklet, typically the
        tasklet's process() function itself.

void task_schedule(t, d)
        Schedule task <t> to run no later than date <d>. If the task is already
        running, or scheduled for an earlier instant, nothing is done. If the
        task was not in queued or was scheduled to run later, its timer entry
        will be updated. This function assumes that it will never be called
        with a timer in the past nor with TICK_ETERNITY. Only one of the
        threads assigned to the task may call this function.

The task's ->process() function receives the following arguments:

  - struct task *t: a pointer to the task itself. It is always valid.

  - void *ctx     : a copy of the task's ->context pointer at the moment
                    the ->process() function was called by the scheduler. A
                    function must use this and not task->context, because
                    task->context might possibly be changed by another thread.
                    For instance, the muxes' takeover() function do this.

  - uint state    : a copy of the task's ->state field at the moment the
                    ->process() function was executed. A function must use
                    this and not task->state as the latter misses the wakeup
                    reasons and may constantly change during execution along
                    concurrent wakeups (threads or signals).

The possible state flags to use during a call to task_wakeup() or seen by the
task being called are the following; they're automatically cleaned from the
state field before the call to ->process()

  - TASK_WOKEN_INIT    each creation of a task causes a first wakeup with this
                       flag set. Applications should not set it themselves.

  - TASK_WOKEN_TIMER   this indicates the task's expire date was reached in the
                       timers queue. Applications should not set it themselves.

  - TASK_WOKEN_IO      indicates the wake-up happened due to I/O activity. Now
                       that all low-level I/O processing happens on tasklets,
                       this notion of I/O is now application-defined (for
                       example stream-interfaces use it to notify the stream).

  - TASK_WOKEN_SIGNAL  indicates that a signal the task was subscribed to was
                       received. Applications should not set it themselves.

  - TASK_WOKEN_MSG     any application-defined wake-up reason, usually for
                       inter-task communication (e.g filters vs streams).

  - TASK_WOKEN_RES     a resource the task was waiting for was finally made
                       available, allowing the task to continue its work. This
                       is essentially used by buffers and queues. Applications
                       may carefully use it for their own purpose if they're
                       certain not to rely on existing ones.

  - TASK_WOKEN_OTHER   any other application-defined wake-up reason.

  - TASK_WOKEN_WQ      The task should be added in our own timers queue.
                       If no other TASK_WOKEN* flag was set, then don't
                       actually run the task, it's just been woken up so that
                       it can be added to the queue.

  - TASK_F_UEVT1       one-shot user-defined event type 1. This is application
                       specific, and reset to 0 when the handler is called.

  - TASK_F_UEVT2       one-shot user-defined event type 2. This is application
                       specific, and reset to 0 when the handler is called.

  - TASK_F_UEVT3       one-shot user-defined event type 3. This is application
                       specific, and reset to 0 when the handler is called.

In addition, a few persistent flags may be observed or manipulated by the
application, both for tasks and tasklets:

  - TASK_SELF_WAKING   when set, indicates that this task was found waking
                       itself up, and its class will change to bulk processing.
                       If this behavior is under control temporarily expected,
                       and it is not expected to happen again, it may make
                       sense to reset this flag from the ->process() function
                       itself.

  - TASK_RT            when set, indicates that the task or tasklet has strong
                       real-time constraints. A task with this flag will bypass
                       the priority ordering of the run queue in order to
                       guarantee a wakeup time as close as possible to the
                       scheduled one even under load. A task or tasklet with
                       this flag will be executed in its own queue before all
                       other queues are consulted. When tune.sched.low-latency
                       is set, no new tasks will be dequeued from the priority
                       queue if such a task is present in the run queue. Only
                       one such task or tasklet may be executed per round so
                       this must be restricted to a few timing-critical tasks
                       only (those for which a one millisecond skew is not
                       acceptable). Such tasks/tasklets are not meant to be
                       woken up by other threads than the one they are supposed
                       to run on, otherwise(their constraints may not be
                       honored.

  - TASK_HEAVY         when set, indicates that this task does so heavy
                       processing that it will become mandatory to give back
                       control to I/Os otherwise big latencies might occur. It
                       may be set by an application that expects something
                       heavy to happen (tens to hundreds of microseconds), and
                       reset once finished. An example of user is the TLS stack
                       which sets it when an imminent crypto operation is
                       expected.

  - TASK_F_USR1        This is the first application-defined persistent flag.
                       It is always zero unless the application changes it. An
                       example of use cases is the I/O handler for backend
                       connections, to mention whether the connection is safe
                       to use or might have recently been migrated.

Finally, when built with -DDEBUG_TASK, an extra sub-structure "debug" is added
to both tasks and tasklets to note the code locations of the last two calls to
task_wakeup() and tasklet_wakeup().
