Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

PermsSyncer: track recent job states in redis#44258

Merged
bobheadxi merged 14 commits intomainfrom
11-10-PermsSyncer_track_recent_job_states_in_redis
Nov 16, 2022
Merged

PermsSyncer: track recent job states in redis#44258
bobheadxi merged 14 commits intomainfrom
11-10-PermsSyncer_track_recent_job_states_in_redis

Conversation

@bobheadxi
Copy link
Member

@bobheadxi bobheadxi commented Nov 11, 2022

Tracks perms sync job outcomes, notably per-provider states (https://github.com/sourcegraph/sourcegraph/pull/44257), in Redis. We set an aggressive TTL of 5 minutes by default because in practice, only the details for the last few sync jobs are relevant, and this prevents us from putting too much pressure on Redis on larger instances.

This can be extended in the future to be database-backed, but for now I think Redis is the right choice - the volume of entries can be quite high when accumulated so we want something easy to expire, and realistically only the details for the last few sync jobs are relevant. Additionally, the records store and reader in package synclogs can easily be rewritten without too much impact to write to a database in the future.

The goal is to eventually allow us to render this data via the GraphQL API similarly to the external service sync status API (https://github.com/sourcegraph/sourcegraph/pull/44387), hence the placement of the struct definitions in a shared package

Part of https://github.com/sourcegraph/customer/issues/1534
Stacked on https://github.com/sourcegraph/sourcegraph/pull/44257

Test plan

Add a GitHub code host with "authorization": {}, and add a private repo to the list

Wait a few minutes

➜  redis-cli KEYS 'v2:authz/sync-job-records:*'
1) "v2:authz/sync-job-records:1668564699013272000"                 
v2:authz/sync-job-records:repo-8
➜  redis-cli GET v2:authz/sync-job-records:1668564699013272000 | jq
{
  "request_type": "repo",
  "request_id": 8,
  "completed": "2022-11-16T02:11:39.013272Z",
  "status": "SUCCESS",
  "message": "",
  "providers": [
    {
      "provider_id": "https://github.com/",
      "provider_type": "github",
      "status": "SUCCESS",
      "message": "FetchRepoPerms"
    }
  ]
}

@bobheadxi
Copy link
Member Author

bobheadxi commented Nov 11, 2022

@bobheadxi bobheadxi requested review from 0xnmn and unknwon November 11, 2022 01:20
Copy link
Contributor

@unknwon unknwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach does seem to make sense to me, will do a thorough review once marked as ready 👍

@0xnmn
Copy link
Contributor

0xnmn commented Nov 11, 2022

I have reviewed all the PRs. Will pushing the Provider Status to Redis be a very high load task that we need to do it via a separate routine? Can't it be done at each point where we are recording the provider status in first place? We do push many items to prometheus during the process and pushing things to redis should be quick.

I am worried about the many places and functions this touches as it is passed up to so many levels making it a bit cumbersome to refactor later on.

@bobheadxi
Copy link
Member Author

Will pushing the Provider Status to Redis be a very high load task that we need to do it via a separate routine? Can't it be done at each point where we are recording the provider status in first place?

I may be reading this wrong, but sum(increase(src_repoupdater_perms_syncer_success_syncs{type="user"}[1m])) on s2 indicates we have 20-30 syncs per minute for user-centric perms sync (this metric doesn't seem to record repo-centric perms sync, which would account for more), so I thought it might be safer to do this async - that said, it is similar to the load we place on the DB so maybe a more frequent write at this rate could be fine, got a bit ahead of myself here 😛

We do push many items to prometheus during the process

note that generating prometheus metrics does not push items to prometheus - prometheus will scrape the service for aggregated metrics on a regular time interval, every 30 seconds - and is not a comparable process to making a network request to set a record in Redis (though it is comparable to a DB write as I mentioned above)

@bobheadxi
Copy link
Member Author

I am worried about the many places and functions this touches as it is passed up to so many levels making it a bit cumbersome to refactor later on.

IMO some of these are improvements (e.g. https://github.com/sourcegraph/sourcegraph/pull/44251 reducing number of return values), but happy to discuss the other changes further on their respective PRs if you have any ideas or thoughts on the need for this! The core issue is outlined in https://github.com/sourcegraph/sourcegraph/pull/44257 (we have no way of telling the health of a provider programmatically)

@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_report_exact_state_of_each_provider_in_each_job branch from aa65fb7 to 764cca8 Compare November 14, 2022 18:37
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from b08b48f to 13fbda3 Compare November 14, 2022 18:37
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_report_exact_state_of_each_provider_in_each_job branch from 764cca8 to 1e27ce1 Compare November 14, 2022 19:45
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch 2 times, most recently from 6e3a7db to 512228b Compare November 14, 2022 19:52
@bobheadxi bobheadxi marked this pull request as ready for review November 14, 2022 21:38
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from 512228b to d0e7fca Compare November 14, 2022 21:39
@bobheadxi bobheadxi requested review from a team, 0xnmn and unknwon November 14, 2022 21:39
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Nov 14, 2022

Codenotify: Notifying subscribers in CODENOTIFY files for diff 2bdefd3...0d4cdd2.

Notify File(s)
@indradhanush enterprise/cmd/repo-updater/internal/authz/integration_test.go
enterprise/cmd/repo-updater/internal/authz/perms_syncer.go
enterprise/cmd/repo-updater/internal/authz/perms_syncer_test.go
enterprise/cmd/repo-updater/internal/authz/provider_state.go
@unknwon enterprise/cmd/repo-updater/internal/authz/integration_test.go
enterprise/cmd/repo-updater/internal/authz/perms_syncer.go
enterprise/cmd/repo-updater/internal/authz/perms_syncer_test.go
enterprise/cmd/repo-updater/internal/authz/provider_state.go
enterprise/internal/authz/syncjobs/records_store.go
enterprise/internal/authz/syncjobs/records_store_test.go
enterprise/internal/authz/syncjobs/status.go

@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_report_exact_state_of_each_provider_in_each_job branch from 638f992 to 63c8511 Compare November 14, 2022 23:40
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from c96b9b1 to ba45511 Compare November 14, 2022 23:40
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from ba45511 to 3cde488 Compare November 15, 2022 16:43
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_report_exact_state_of_each_provider_in_each_job branch from 18c457a to c408d9c Compare November 15, 2022 16:45
@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from 3cde488 to fa813d6 Compare November 15, 2022 16:45
@bobheadxi bobheadxi requested a review from a team November 16, 2022 02:24
@mrnugget
Copy link
Contributor

This feels like a temporary solution to get a quick win, IMHO. Why not build the proper thing? Can't we migrate the sync jobs to be actual database-backed worker jobs? That would be the ideal outcome. And if that's our goal then this seems like we make it harder for ourselves in the future to migrate to the proper solution.

@bobheadxi
Copy link
Member Author

bobheadxi commented Nov 16, 2022

This feels like a temporary solution to get a quick win, IMHO. Why not build the proper thing? Can't we migrate the sync jobs to be actual database-backed worker jobs?

Unfortunately, I don't have the capacity to drive such an extensive migration yet, but we do need an endpoint available as part of https://github.com/sourcegraph/customer/issues/1534

Additionally, I'm not very convinced a database-backed job framework is a good choice here. Perms sync seems designed to queue and run ephemerally, such that if you just run it enough times it'll eventually get you to the right state, unless there's a major issue, in which case those will be surfaced in the last X jobs. There appears to be no real need to go back in time, nor for the jobs to be resilient because they get queued all the time and the point in time in which they get queued isn't very important as long as they run eventually

@bobheadxi bobheadxi force-pushed the 11-10-PermsSyncer_track_recent_job_states_in_redis branch from 09bc56d to e6bf445 Compare November 16, 2022 16:27
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Nov 16, 2022

Codenotify: Notifying subscribers in OWNERS files for diff 2bdefd3...0d4cdd2.

No notifications.

@bobheadxi bobheadxi merged commit d2e4f01 into main Nov 16, 2022
@bobheadxi bobheadxi deleted the 11-10-PermsSyncer_track_recent_job_states_in_redis branch November 16, 2022 18:11
bobheadxi added a commit that referenced this pull request Nov 16, 2022
Adds a new permissionsSyncJobs query that returns the most recent outcomes of permissions sync jobs, with per-provider details included (see #44258). This will be used to validate permissions syncing is starting correctly and the desired providers are being involved and in a healthy state.

We don't implement pagination since the endpoint is intended as a stream of recent events and retention is low, but the query schema is set up so that it can be extended to support pagination in the future.

Co-authored-by: Erik Seliger <erikseliger@me.com>
@mrnugget
Copy link
Contributor

Additionally, I'm not very convinced a database-backed job framework is a good choice here. Perms sync seems designed to queue and run ephemerally, such that if you just run it enough times it'll eventually get you to the right state, unless there's a major issue, in which case those will be surfaced in the last X jobs. There appears to be no real need to go back in time, nor for the jobs to be resilient because they get queued all the time and the point in time in which they get queued isn't very important as long as they run eventually

Disagree. We want to display in the UI: last sync at, next sync at. We want to enqueue a next sync. We want to see the errors of the last sync. We want to have different priorities for syncs. All of that can be done with a database and a database-backed worker.

Querying the list of last sync jobs is another argument for that, IMHO.

@bobheadxi
Copy link
Member Author

True, some of those components are in place already - for example we track the outcome of the last sync for a particular entity already - but next sync, priority, state of the queue etc isn't transparent at all.

Only discussion I've found is https://github.com/sourcegraph/sourcegraph/issues/16197 - I'll open up an issue for IAM

@bobheadxi
Copy link
Member Author

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants