[DGD] Kernel library question

Thu Aug 23 15:10:40 CEST 2018

bart at wotf.org wrote:

>[...]
> I ran into some issues with this on wotf, which inspired me to implement
> suspending call_outs.
>
> However, wotf does not rebuild all objects in a single task, rather, it does
> an atomic recompile of critical components in a single task, and if that works
> out, it does a recompile of the remaining objects, and allows those to fail if
> they don't compile (and keep those that failed in the update queue so they can
> easily be recompiled after fixing possible errors).
>
> The 'critical' part is enough to give an admin access to the 'mud shell' so
> they can run code or recompile things etc.
>
> Anyway, recompiling the remaining objects happens with a chain of call_outs,
> and while it is running, all other call_outs are suspended (intercepted when
> they fire, and rescheduled).
>
> Not suspending call_outs did result in new code calling outdated code (and the
> other way around) every so often, which at times caused problems.
>
> As everything gets tested on a dev instance first, objects not compiling is
> rather exceptional, so while the risk of old code calling new code (or the
> other way around) still exists when an object fails to recompile, this tends
> to be more exceptional than a call_out firing and causing trouble while the
> recompile is still running.
>
> Anyway, I am curious what issues you ran into with call_out suspension beyond
> the one mentioned (which imo should be solvable, without needing some
> centralized administration that would cause lots of potential rollbacks for
> hydra).
>
> What I'm looking at for lpcdb (which currently uses the same approach as wotf,
> ie the old lpmud style find/remove_call_out api, but this is problematic for
> that codebase) is to use an lwo as proxy for the call_out handle, and pass
> this lwo to the call_out handler in the auto object. That allows rescheduling
> the call_out, informing the proxy object of the new handle, and hide that from
> any code using the proxy call_out handle. This way i won't need to keep track
> of the mapping between 'virtual' call_out handles and the ones used by DGD in
> some shared (between call_outs) data structure.
>
> While lpcdb does not run on hydra, at least not on the limited version (it
> needs the closures extension, and unless I only have a small user database,
> will easily outgrow the number of objects allowed in the limited version), I
> wonder if this approach would work with hydra without causing excessive rollbacks.

The basic problem I was trying to solve with suspended callouts was a
Hydra issue: how can I perform a task which changes a lot of objects,
making it susceptible to rollback, but which, when broken up into smaller
tasks, runs the risk of allowing other intervening tasks to run with
partially changed data or code?  The prime example being a global
recompile which changes an API.

A global recompile is expensive, and a global recompile which is rolled
back several times is even more expensive.  So the idea was to suspend
callouts, do the recompile in a single task which will not be rolled
back because there are no competing tasks running, and then release
the callout suspension.

There were several problems with the implementation.  There were bugs,
especially with saving callouts that were triggered during suspension;
there was extra overhead for callout management even when callouts
were not suspended; callouts which were triggered while suspended would
still run a task in the object, and could thereby still prevent a task
which modifies that object along many others from completing without
rollback; and suspended callouts were saved in a central object, which
would also be used by any other callouts triggered during suspension,
meaning that when there are thousands of callouts trying to run and
getting suspended, a large number of them will be rolled back.  The
cure wwas starting to look worse, or at least no better, than the
disease.

Furthermore, as I gained insight into the issue, I also found different
ways to prevent rollbacks within Hydra, for cases where I had assumed
it to be inevitable.  Also, it is quite simple to start a task that
must modify many objects with an action that will cause an immediate
rollback if the completion of the task is not already guaranteed, for
example by writing to a file.  Even though this does not prevent a
rollback, the rollback will occur before the expensive global recompile,
rather than after it.

In the end, this was a major factor in stopping development of the
kernel library, taking a snapshot of it, and altering that radically
in backward-incompatible ways to better suit Hydra, as part of the
Cloud Server library.

Regards,
Felix Croes