Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

lib/background: upgrade Routine interface with context and errors#62136

Merged
unknwon merged 9 commits intomainfrom
jc/lib-background
May 24, 2024
Merged

lib/background: upgrade Routine interface with context and errors#62136
unknwon merged 9 commits intomainfrom
jc/lib-background

Conversation

@unknwon
Copy link
Contributor

@unknwon unknwon commented Apr 23, 2024

This PR is a result/followup of the improvements we've made in the SAMS repo that allows call sites to pass down a context (primarily to indicate deadline, and of course, cancellation if desired) and collects the error returned from background.Routines Stop method.

Note that I did not adopt returning error from Stop method because I realize in monorepo, the more common (and arguably the desired) pattern is to hang on the call of Start method until Stop is called, so it is meaningless to collect errors from Start methods as return values anyway, and doing that would also complicate the design and semantics more than necessary.

All usages of the the background.Routine and background.CombinedRoutines are updated, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors.

The only file that contains the core change is the lib/background/background.go.

Test plan

CI

@cla-bot cla-bot bot added the cla-signed label Apr 23, 2024
@unknwon unknwon force-pushed the jc/lib-background branch 3 times, most recently from 7e51934 to 9d4d372 Compare April 24, 2024 21:22
@unknwon unknwon force-pushed the jc/lib-background branch 10 times, most recently from eab1493 to f32a05b Compare May 4, 2024 14:27
Copy link
Member

@eseliger eseliger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions before I feel comfortable with this:

(primarily to indicate deadline, and of course, cancellation if desired)

I am not too sure about the cancellation part here, what does cancelling a stop mean? I don't think most processes are fully reversible. You would probably need to continue a full stop and then call start again, but that seems overly complex.

For deadline, we have a hardcoded variable called goroutine.GracefulShutdownTimeout that implementors should ideally stick to. It feels a bit random to me that Start does not take a context (I think for good reasons, the factory function for the routine should handle that) but stop does, that could cause wildly different contexts to be passed with different dependencies stored on the context, like an internal actor or else.
Do you think documenting that you should stick to that graceful shutdown period where possible, vs documenting that you should respect the context deadline would be much different?
Right now, we have two ways here now, which will realistically never be consolidated if this PR doesn't do it so wondering if we could simplify here 😬


The Name() method seems to often return not very useful information. How about instead we make a NamedRoutine interface that has Name() and embeds Routine and when you want to implement name, you can implement Name() but otherwise we return nothing? That seems cleaner than relying on the current hard-coded values that provide just about as much value as "" or <undefined>


A few more inline questions

Comment on lines +28 to +31
// Monitor will start the given background routines in their own goroutine. If
// the given context is canceled or a signal is received, the Stop method of
// each routine will be called. This method blocks until the Stop methods of
// each routine have returned. Two signals will cause the app to shut down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is from a formatted but meta comment: The change of line breaks makes this diff kinda hard to read 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥹 sorry yes it's basically just re-formatted.

// each routine have returned. Two signals will cause the app to shut down
// immediately.
func Monitor(ctx context.Context, routines ...Routine) {
func Monitor(ctx context.Context, routines ...Routine) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitor is meant to be called as the last thing in an application as it calls os.Exit, I'm not sure if the error here wouldn't be misleading 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way it has always worked is that if a signal is caught, this function can return, but on multiple signals or reaching a deadline, this function can choose to exit immediately. In that world and the new stop-error semantic, we can still have a case where Monitor never calls os.Exit directly, and instead returns an indicator of whether the shutdown was clean or not. I think that is the new desired behaviour, and the caller can then report an error indicating shutdown was unclean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add on Robert's comment: While making this PR compile, I actually do notice that although it is often the last thing being called, many places are just returning nil right after calling it, but now it can actually return any real error now.

One step back, the reason the error is propagated to caller is because I do not think we should assume how caller what to handle the error, or where to print/send the error.

}

ctx, cancel := context.WithTimeout(context.Background(), goroutine.GracefulShutdownTimeout)
ctx, cancel := context.WithTimeout(ctx, goroutine.GracefulShutdownTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now potentially overwriting the timeout that is passed, that feels a bit counter intuitive 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually couldn't find a lot online about this, isn't this the case with every context, the child could always be overwriting a parent WithTimeout at any point it uses WithTimeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the new semantics though I think this is the correct behaviour, and the user of httpserver should stop with an appropriate timeout/deadline of their choice (which IMO makes more sense than the magic timeout that exists here right now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand how context.WithTimeout (or in general, how context.With) works is that it always creates a new context (with the given timeout). In this particular case, it shouldn't change any semantic meaning, i.e. whoever gets the ctx on the left still have the same goroutine.GracefulShutdownTimeout as the timeout as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoever gets the ctx on the left still have the same goroutine.GracefulShutdownTimeout as the timeout as before.

I think that's where the confusion is:

  1. Caller calls foobar, telling it the deadline is goroutine.GracefulShutdownTimeout - 30s for example
  2. foobar uses WithTimeout and actually has a deadline of goroutine.GracefulShutdownTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, we could do something similar to SAMS pub/sub routine, where we can check the parent deadline, and apply whoever is shorter here?

CleanShot 2024-05-22 at 17 00 12@2x

(example here only checks whether parent imposes a deadline and leave a 5-second buffer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK how we would enforce it - not saying everything needs updating in this PR, but I do think "what timeout is actually enforced" is tricky

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @eseliger ?

Comment on lines 252 to +254
s.logger.Info("Background sync successfully stopped",
log.Duration("elapsed", time.Since(start)))
return stopErr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels a bit strange to say successful and then potentially return an error 😬

Copy link
Member

@bobheadxi bobheadxi May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log message was always bad (oops), the only thing this shutdown should care about is the redis-locking part, and in this case releasing the lock was successful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange for sure, but as stated in the PR description, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors. 😁 If we want to improve the case here, I'd rather doing a follow-up, since that is technically changing some behavior who knows may or may not rely on it.

Copy link
Member

@bobheadxi bobheadxi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, minor questions + Erik's comments :) Thank you for doing this!

@unknwon
Copy link
Contributor Author

unknwon commented May 22, 2024

@eseliger:

(primarily to indicate deadline, and of course, cancellation if desired)

I am not too sure about the cancellation part here, what does cancelling a stop mean? I don't think most processes are fully reversible. You would probably need to continue a full stop and then call start again, but that seems overly complex.

"Cancellation" is just happen to be the term that Go context's package uses, which IMO it just means caller wants to signal that "stop what you're doing and walk away". How exactly routines needs to be stopped, is not a concern of the caller, nor does caller imply things must be reversed. In other words, I think it is in routines' interests to do graceful shutdown whenever possible (either it's abort DB transaction, clean up temp files, or simply exit the loop, etc.).

Deadline is also just a form of cancellation, with a (generally longer) grace period.

For deadline, we have a hardcoded variable called goroutine.GracefulShutdownTimeout that implementors should ideally stick to. It feels a bit random to me that Start does not take a context (I think for good reasons, the factory function for the routine should handle that) but stop does, that could cause wildly different contexts to be passed with different dependencies stored on the context, like an internal actor or else. Do you think documenting that you should stick to that graceful shutdown period where possible, vs documenting that you should respect the context deadline would be much different? Right now, we have two ways here now, which will realistically never be consolidated if this PR doesn't do it so wondering if we could simplify here 😬

The approach of sticking to goroutine.GracefulShutdownTimeout would only work consistently and reliably within the context of monorepo, where we know what we are doing. Now, the lib/background is clearly aiming to be more generically helpful package, there are two major cons of it:

  1. It is not a safe assumption anymore, the needs and constraints of services can be vastly different, maybe longer, maybe shorter.
  2. A hardcoded timeout more or less restricts that routines cannot/should not be nested, otherwise, it's subtle and not-obvious-illusion that nested routines are still having the same timeout, which in practice, let's say 50/50 true.

By nested routines, not the best illustration but here it is 😂

routine 1/
├─ routine 1-1/
├─ routine 1-2/
│  ├─ routine 1-2-1
│  ├─ routine 1-2-2
│  ├─ routine 1-2-3
├─ routine 1-2

"routine 1" wants to stop within 30 seconds, "routine 1-2" needs to some work before its sub-routines can be shutdown, say, that takes 5 seconds, which realistically only left 25 seconds for "routine 1-2-{1,2,3}" to shutdown (not 30 seconds).

It is not a problem for routines can be stopped freely, but for routines that are absolutely critical to "save the work" before quitting, getting the wrong timeout is fatal.

The Name() method seems to often return not very useful information. How about instead we make a NamedRoutine interface that has Name() and embeds Routine and when you want to implement name, you can implement Name() but otherwise we return nothing? That seems cleaner than relying on the current hard-coded values that provide just about as much value as "" or <undefined>

@bobheadxi asked the same question and proposed this same idea in the original PR, please see my response for why I did not adopt and let me know WDYT (reply here)!

@unknwon unknwon requested a review from eseliger May 22, 2024 18:48
@unknwon
Copy link
Contributor Author

unknwon commented May 22, 2024

@eseliger - thanks for the review! I should have addressed all the comments, PTAL!

@unknwon
Copy link
Contributor Author

unknwon commented May 24, 2024

Thanks again for the reviews! Merging as-is (3 other repos are eagerly waiting to import rather than replicate the same logic proposed here 😅), happy to address post-merge feedback!

@unknwon unknwon merged commit 2589fef into main May 24, 2024
@unknwon unknwon deleted the jc/lib-background branch May 24, 2024 14:04
unknwon referenced this pull request in sourcegraph/sourcegraph-accounts-sdk-go May 24, 2024
Update `lib/background` from the upstream
https://github.com/sourcegraph/sourcegraph/pull/62136.

Also noticed that this repository is missing the golangci-lint config,
thus added one.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants