Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

fix: move licensing checks to worker process#54854

Merged
kopancek merged 9 commits intomainfrom
milan/worker_job
Jul 24, 2023
Merged

fix: move licensing checks to worker process#54854
kopancek merged 9 commits intomainfrom
milan/worker_job

Conversation

@kopancek
Copy link
Contributor

@kopancek kopancek commented Jul 12, 2023

Moving licensing checks to a singleton worker job. This way we make sure that only one license check API call is being made on each 12-hour tick.

Previously this lived in a frontend service. If multiple frontend services were running, this would fire multiple requests, which is undesirable from many points of view - waste of resources, waste of bandwidth, potential concurrency issues.

Having it fire multiple times would also break an assumption that we have in the license check handler on dotcom. For checking if same site_id is used on multiple instances, we assume that the time difference between license check calls would be roughly 10-12 hours. Calling twice in the same second breaks this assumption.

Test plan

Tested locally that the worker starts

Since worker is a singleton, this way we make sure that only
one license check goroutine is running.
})
goroutine.Go(func() {
productsubscription.StartCheckForAnomalousLicenseUsage(logger, db)
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The producsubscription package is enterprise/internal, did not want to get deep into that can of worms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh its probably fine to move that to internal/, we're moving loads of stuff already so you could do a precursor PR to move it before this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are more licensing changes upcoming, which would be a good opportunity to deal with the enterprise code split. I don't think spending additional effort here right now is worth it.

@kopancek kopancek marked this pull request as ready for review July 12, 2023 15:04
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jul 12, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff d37d8d9...faf06a3.

Notify File(s)
@efritz cmd/worker/internal/licensecheck/BUILD.bazel
cmd/worker/internal/licensecheck/check.go
cmd/worker/internal/licensecheck/check_test.go
cmd/worker/internal/licensecheck/worker.go
cmd/worker/shared/BUILD.bazel
cmd/worker/shared/main.go
@sourcegraph/delivery doc/admin/workers.md
@unknwon enterprise/cmd/frontend/internal/licensing/init/init.go
internal/licensing/BUILD.bazel
internal/licensing/constants.go
internal/licensing/licensing.go
internal/licensing/licensing_test.go

@Strum355
Copy link
Contributor

Since worker is a singleton

Worker isnt a singleton, its designed to be able to have multiple running if you want to scale out other jobs. Cant remember how jobs are chosen though, @efritz do you remember how its configured and what the default is e.g. run all jobs by default

@efritz
Copy link
Contributor

efritz commented Jul 12, 2023

See WORKER_JOB_ALLOWLIST and WORKER_JOB_BLOCKLIST

@kopancek
Copy link
Contributor Author

kopancek commented Jul 13, 2023

See WORKER_JOB_ALLOWLIST and WORKER_JOB_BLOCKLIST

So based on the docs depending on how the worker service is scaled, this can still run concurrently.

It seems like DB backed job is the only sure way of doing this, other than using a redis backed distributed lock.

cc @eseliger @mrnugget

@Strum355
Copy link
Contributor

Why are we moving this to worker btw? (i could search slack but itd be better for prosperity to have it mentioned here 😄)

@mrnugget
Copy link
Contributor

Why are we moving this to worker btw? (i could search slack but itd be better for prosperity to have it mentioned here 😄)

Because it ran in a goroutine in frontend in conf.Watch, meaning: multiple go routines were spawned when there were multiple frontend pods. Worker can be used to singleton-rize this

@kopancek
Copy link
Contributor Author

Why are we moving this to worker btw? (i could search slack but itd be better for prosperity to have it mentioned here 😄)

The main reason was, that today if there are multiple instances of frontend, we fire concurrent requests to license check API.

There is some logic on the dotcom instance that relies on these requests to come at roughly 12 hour intervals. Receiving multiple requests causes this logic to break. There is other logic on the instance itself that relies on these calls to be done in 12 hour intervals. All of this can lead to concurrent DB/Redis writes, so there is a potential for race conditions, although I did not yet come up with a scenario where the concurrency itself would break the business logic.

All in all, doing the goroutine logic in a singleton would avoid the concurrency problem and seems safer. We can achieve it with a DB backed worker, but I wanted to try some quicker approaches first, because refactoring to DB backed worker would take more time and effort.

@mrnugget
Copy link
Contributor

As far as I know, customers don't just duplicate worker instances but scale out specific workers by adding worker instance with worker names in ALLOWLIST. We have multiple jobs that don't make sense to be scaled out to more worker instances.

Just look through these: https://github.com/sourcegraph/sourcegraph/blob/d58eccf2cdfd792c979d2d57130535316669aff3/enterprise/cmd/worker/shared/shared.go#L34-L77

I think it's fine if you add a doc entry for your worker to https://docs.sourcegraph.com/admin/workers with the note that this should not be scaled out.

@eseliger
Copy link
Member

Agree with @mrnugget, worker can either be scaled vertically (non-issue anyways), or horizontally, by splitting out jobs across multiple machines. That does by no means mean that three workers with the same set of jobs should exist, it would be misconfiguration.

@mrnugget
Copy link
Contributor

@kopancek what's the status here? We kinda forgot this in sprint planning

@kopancek kopancek requested a review from a team July 21, 2023 12:59
kopancek added 4 commits July 21, 2023 15:12
the worker uses database.DB, but it is forbidden to import
that from cmd/ directories. Because the worker lived in the
licensing package before, licensing package started being
dependent on db, which it previously was not.

This caused dev/check/go-dbconn-import.sh to fail

Moving the files to a different package within worker
space has helped without affecting the functionality.
@kopancek kopancek marked this pull request as draft July 24, 2023 09:07
@kopancek kopancek marked this pull request as ready for review July 24, 2023 09:07
@kopancek kopancek merged commit 6732a5f into main Jul 24, 2023
@kopancek kopancek deleted the milan/worker_job branch July 24, 2023 11:38
@github-actions
Copy link
Contributor

The backport to 5.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-54854-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6732a5fe357e77fd52854e8b44738c6557243f4b
# Push it to GitHub
git push --set-upstream origin backport-54854-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1

Then, create a pull request where the base branch is 5.1 and the compare/head branch is backport-54854-to-5.1.

@github-actions github-actions bot added backports failed-backport-to-5.1 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels Jul 24, 2023
@kopancek kopancek mentioned this pull request Jul 25, 2023
kopancek added a commit that referenced this pull request Jul 25, 2023
Moving licensing checks to a singleton worker job. This way we make sure
that only one license check API call is being made on each 12-hour tick.

Previously this lived in a frontend service. If multiple frontend
services were running, this would fire multiple requests, which is
undesirable from many points of view - waste of resources, waste of
bandwidth, potential concurrency issues.

Having it fire multiple times would also break an assumption that we
have in the license check handler on dotcom. For checking if same
site_id is used on multiple instances, we assume that the time
difference between license check calls would be roughly 10-12 hours.
Calling twice in the same second breaks this assumption.

Tested locally that the worker starts

(cherry picked from commit 6732a5f)
kopancek added a commit that referenced this pull request Jul 25, 2023
Update changelog with #54854 

## Test plan

N/A, just changelog update
github-actions bot pushed a commit that referenced this pull request Jul 25, 2023
Update changelog with #54854

## Test plan

N/A, just changelog update

(cherry picked from commit aab863b)
unknwon pushed a commit that referenced this pull request Jul 25, 2023
Backport #54854 into 5.1

## Test plan

backport PR, N/A
MaedahBatool pushed a commit that referenced this pull request Jul 28, 2023
Moving licensing checks to a singleton worker job. This way we make sure
that only one license check API call is being made on each 12-hour tick.

Previously this lived in a frontend service. If multiple frontend
services were running, this would fire multiple requests, which is
undesirable from many points of view - waste of resources, waste of
bandwidth, potential concurrency issues.

Having it fire multiple times would also break an assumption that we
have in the license check handler on dotcom. For checking if same
site_id is used on multiple instances, we assume that the time
difference between license check calls would be roughly 10-12 hours.
Calling twice in the same second breaks this assumption.

## Test plan

Tested locally that the worker starts
MaedahBatool pushed a commit that referenced this pull request Jul 28, 2023
Update changelog with #54854 

## Test plan

N/A, just changelog update
davejrt pushed a commit that referenced this pull request Aug 9, 2023
Update changelog with #54854 

## Test plan

N/A, just changelog update
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

backports cla-signed release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants