Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Fix Github public repos to properly tag archived repos#41461

Merged
varsanojidan merged 11 commits intomainfrom
iv/fix-github-public-archived-repos-bug
Oct 13, 2022
Merged

Fix Github public repos to properly tag archived repos#41461
varsanojidan merged 11 commits intomainfrom
iv/fix-github-public-archived-repos-bug

Conversation

@varsanojidan
Copy link
Contributor

@varsanojidan varsanojidan commented Sep 7, 2022

Fixes https://github.com/sourcegraph/customer/issues/1285

The gist of this change is that the Github API for listing public repos doesn't include an archived field (see this issue), so we have no way of knowing which repo is archived.

I had made a change to this previously where if {"exclude":[{"archived": true}]} was set, that it would use the Github search API instead which could tell us all of the public non-archived repos.

My oversight here was that its still important to know whether a repo is archived or not even if we aren't just outright excluding them (see linked issue up top), therefore I have changed how list all public repos works for the Github code host.

First it will use the search API to query for all of the public repos that are archived. It will use this to generate a hashmap. Then it will query the regular Github API to list all of the public repos, and before sending the result back, check each repo against the hashmap to get it's true archived state.

Test plan

Tested locally against dogfood github to validate search functionality properly filters Github public archived repos now.

Config:

{
  "GITHUB": [
    {
      "url": "https://ghe.sgdev.org",
      "token": "redacted",
      "repositoryQuery": [
        "public"
      ]
    }
  ]
}

image

@varsanojidan varsanojidan requested a review from a team September 7, 2022 19:55
@cla-bot cla-bot bot added the cla-signed label Sep 7, 2022
@varsanojidan varsanojidan marked this pull request as ready for review September 7, 2022 19:57
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Sep 7, 2022

Codenotify: Notifying subscribers in CODENOTIFY files for diff 49e3285...2535f9f.

Notify File(s)
@indradhanush internal/repos/github.go
internal/repos/github_test.go
@ryanslade internal/repos/github.go
internal/repos/github_test.go
@sashaostrikov internal/repos/github.go
internal/repos/github_test.go

@varsanojidan varsanojidan marked this pull request as draft September 7, 2022 20:13
Copy link
Contributor

@sashaostrikov sashaostrikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approved in advance, just have a question about s.excludeArchived case

@varsanojidan varsanojidan marked this pull request as ready for review September 19, 2022 20:00
@varsanojidan
Copy link
Contributor Author

cc: @keegancsmith @mrnugget tagging both of you in to see if this bug fix is good to go in for 4.0

Copy link
Contributor

@mrnugget mrnugget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we need to handle the goroutine being blocked and leaking, as pointed out by @ryanslade: https://github.com/sourcegraph/sourcegraph/pull/41461/files#r965618434

As for getting this into 4.0: are you confident in this change? does the additional request here add significant load to GitHub hosts? did you test this manually and trigger the code paths here manually?

@varsanojidan
Copy link
Contributor Author

varsanojidan commented Sep 20, 2022

@mrnugget

As for getting this into 4.0: are you confident in this change? does the additional request here add significant load to GitHub hosts? did you test this manually and trigger the code paths here manually?

  • I have tested this manually against our dogfood instance with several hundreds of thousands of repos, and was able to see it able to label the two public archived repos correctly.
  • As for increased stress on the code host, it would depend on how many public archived repos they have. I was thinking there were 2 ways to structure this, either the way above, or just using the Github search API for everything (not just the archived repos). The search API is not as efficient as the regular list public repos API, which is why I ultimately thought it was better to go with the solution above. If the amount of public archived repos is enormous, this would add the extra stress of listing those archived repos twice (once through search and once through the list public repos API). We do something similar for bitbucket.

Im open to thoughts/suggestions, lmk what you think.

@mrnugget
Copy link
Contributor

So in case the customer doesn't have any archived public repositories, we'd only have to do a single search request more when syncing the external service, right?

Sounds like you're confident in this, so: sounds good to me.

@varsanojidan
Copy link
Contributor Author

So in case the customer doesn't have any archived public repositories, we'd only have to do a single search request more when syncing the external service, right?

Yep exactly, for example, in the case of our dogfood instance, since we only had 2 public-archived repos, it was only a single extra API call. Thanks for all the feedback!

@varsanojidan varsanojidan removed the request for review from keegancsmith September 20, 2022 16:59
Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @keegancsmith @mrnugget tagging both of you in to see if this bug fix is good to go in for 4.0

How long has this been a bug? If I am not mistaken this is a very old issue, which makes me unsure of the safety of cherry-picking this in. GitHub API acts differently on github.com vs enterprise vs enterprise versions. I worry that changing something core (like listing all GitHub repos on an enterprise instance) is risky.

To further spell out the risk in practical terms, all instances which only ever synced all Github repos from an instance will now hit the GitHub search API for the first time. This should be fine, but is a risk. Normally we aren't as risk averse, except this is a major version increase.

The main benefit of this change is those instances now will correctly have the archived filter working. This has not been working for a long time though for them?

Now that I have been a grumpy release engineer, let me say I've reviewed the code and I feel comfortable with this change. Just wanted to be explicit about risks and how release engineering would think about this. The funny part is if you had just merged this in before the 4.0 branch cut we wouldn't of had any of these discussions :P

@varsanojidan
Copy link
Contributor Author

Yeah this was my concern as well, though the scope is a bit smaller, the change to hitting the search API would just be for if the "public" keyword is specified in repositoryQuery.

The main benefit of this change is those instances now will correctly have the archived filter working. This has not been working for a long time though for them?

Yeah, it's been quite a pain in the butt for a few customers for a long time now unfortunately 😅.

I think its probably fine to also go in the release after 4.0, I'll double check with PM.

@varsanojidan varsanojidan enabled auto-merge (squash) October 13, 2022 15:31
@varsanojidan varsanojidan merged commit 7c94209 into main Oct 13, 2022
@varsanojidan varsanojidan deleted the iv/fix-github-public-archived-repos-bug branch October 13, 2022 15:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants