lib/background: upgrade Routine interface with context and errors#62136
lib/background: upgrade Routine interface with context and errors#62136
Routine interface with context and errors#62136Conversation
7e51934 to
9d4d372
Compare
eab1493 to
f32a05b
Compare
eseliger
left a comment
There was a problem hiding this comment.
A couple questions before I feel comfortable with this:
(primarily to indicate deadline, and of course, cancellation if desired)
I am not too sure about the cancellation part here, what does cancelling a stop mean? I don't think most processes are fully reversible. You would probably need to continue a full stop and then call start again, but that seems overly complex.
For deadline, we have a hardcoded variable called goroutine.GracefulShutdownTimeout that implementors should ideally stick to. It feels a bit random to me that Start does not take a context (I think for good reasons, the factory function for the routine should handle that) but stop does, that could cause wildly different contexts to be passed with different dependencies stored on the context, like an internal actor or else.
Do you think documenting that you should stick to that graceful shutdown period where possible, vs documenting that you should respect the context deadline would be much different?
Right now, we have two ways here now, which will realistically never be consolidated if this PR doesn't do it so wondering if we could simplify here 😬
The Name() method seems to often return not very useful information. How about instead we make a NamedRoutine interface that has Name() and embeds Routine and when you want to implement name, you can implement Name() but otherwise we return nothing? That seems cleaner than relying on the current hard-coded values that provide just about as much value as "" or <undefined>
A few more inline questions
| // Monitor will start the given background routines in their own goroutine. If | ||
| // the given context is canceled or a signal is received, the Stop method of | ||
| // each routine will be called. This method blocks until the Stop methods of | ||
| // each routine have returned. Two signals will cause the app to shut down |
There was a problem hiding this comment.
Not sure if this is from a formatted but meta comment: The change of line breaks makes this diff kinda hard to read 🙈
There was a problem hiding this comment.
🥹 sorry yes it's basically just re-formatted.
| // each routine have returned. Two signals will cause the app to shut down | ||
| // immediately. | ||
| func Monitor(ctx context.Context, routines ...Routine) { | ||
| func Monitor(ctx context.Context, routines ...Routine) error { |
There was a problem hiding this comment.
Monitor is meant to be called as the last thing in an application as it calls os.Exit, I'm not sure if the error here wouldn't be misleading 🙈
There was a problem hiding this comment.
I think the way it has always worked is that if a signal is caught, this function can return, but on multiple signals or reaching a deadline, this function can choose to exit immediately. In that world and the new stop-error semantic, we can still have a case where Monitor never calls os.Exit directly, and instead returns an indicator of whether the shutdown was clean or not. I think that is the new desired behaviour, and the caller can then report an error indicating shutdown was unclean
There was a problem hiding this comment.
To add on Robert's comment: While making this PR compile, I actually do notice that although it is often the last thing being called, many places are just returning nil right after calling it, but now it can actually return any real error now.
One step back, the reason the error is propagated to caller is because I do not think we should assume how caller what to handle the error, or where to print/send the error.
| } | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), goroutine.GracefulShutdownTimeout) | ||
| ctx, cancel := context.WithTimeout(ctx, goroutine.GracefulShutdownTimeout) |
There was a problem hiding this comment.
This is now potentially overwriting the timeout that is passed, that feels a bit counter intuitive 🤔
There was a problem hiding this comment.
I actually couldn't find a lot online about this, isn't this the case with every context, the child could always be overwriting a parent WithTimeout at any point it uses WithTimeout?
There was a problem hiding this comment.
By the new semantics though I think this is the correct behaviour, and the user of httpserver should stop with an appropriate timeout/deadline of their choice (which IMO makes more sense than the magic timeout that exists here right now)
There was a problem hiding this comment.
The way I understand how context.WithTimeout (or in general, how context.With) works is that it always creates a new context (with the given timeout). In this particular case, it shouldn't change any semantic meaning, i.e. whoever gets the ctx on the left still have the same goroutine.GracefulShutdownTimeout as the timeout as before.
There was a problem hiding this comment.
whoever gets the ctx on the left still have the same goroutine.GracefulShutdownTimeout as the timeout as before.
I think that's where the confusion is:
- Caller calls
foobar, telling it the deadline isgoroutine.GracefulShutdownTimeout - 30sfor example foobarusesWithTimeoutand actually has a deadline ofgoroutine.GracefulShutdownTimeout
There was a problem hiding this comment.
IDK how we would enforce it - not saying everything needs updating in this PR, but I do think "what timeout is actually enforced" is tricky
| s.logger.Info("Background sync successfully stopped", | ||
| log.Duration("elapsed", time.Since(start))) | ||
| return stopErr |
There was a problem hiding this comment.
feels a bit strange to say successful and then potentially return an error 😬
There was a problem hiding this comment.
I think this log message was always bad (oops), the only thing this shutdown should care about is the redis-locking part, and in this case releasing the lock was successful
There was a problem hiding this comment.
It feels strange for sure, but as stated in the PR description, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors. 😁 If we want to improve the case here, I'd rather doing a follow-up, since that is technically changing some behavior who knows may or may not rely on it.
"Cancellation" is just happen to be the term that Go context's package uses, which IMO it just means caller wants to signal that "stop what you're doing and walk away". How exactly routines needs to be stopped, is not a concern of the caller, nor does caller imply things must be reversed. In other words, I think it is in routines' interests to do graceful shutdown whenever possible (either it's abort DB transaction, clean up temp files, or simply exit the loop, etc.). Deadline is also just a form of cancellation, with a (generally longer) grace period.
The approach of sticking to
By nested routines, not the best illustration but here it is 😂 "routine 1" wants to stop within 30 seconds, "routine 1-2" needs to some work before its sub-routines can be shutdown, say, that takes 5 seconds, which realistically only left 25 seconds for "routine 1-2-{1,2,3}" to shutdown (not 30 seconds). It is not a problem for routines can be stopped freely, but for routines that are absolutely critical to "save the work" before quitting, getting the wrong timeout is fatal.
@bobheadxi asked the same question and proposed this same idea in the original PR, please see my response for why I did not adopt and let me know WDYT (reply here)! |
|
@eseliger - thanks for the review! I should have addressed all the comments, PTAL! |
|
Thanks again for the reviews! Merging as-is (3 other repos are eagerly waiting to import rather than replicate the same logic proposed here 😅), happy to address post-merge feedback! |
Update `lib/background` from the upstream https://github.com/sourcegraph/sourcegraph/pull/62136. Also noticed that this repository is missing the golangci-lint config, thus added one.
This PR is a result/followup of the improvements we've made in the SAMS repo that allows call sites to pass down a context (primarily to indicate deadline, and of course, cancellation if desired) and collects the error returned from
background.RoutinesStopmethod.Note that I did not adopt returning error from
Stopmethod because I realize in monorepo, the more common (and arguably the desired) pattern is to hang on the call ofStartmethod untilStopis called, so it is meaningless to collect errors fromStartmethods as return values anyway, and doing that would also complicate the design and semantics more than necessary.All usages of the the
background.Routineandbackground.CombinedRoutinesare updated, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors.The only file that contains the core change is the
lib/background/background.go.Test plan
CI