gitserver: Introduce FS layer to encapsulate repo name conversions#60627
gitserver: Introduce FS layer to encapsulate repo name conversions#60627
Conversation
| repoName := protocol.NormalizeRepo(c.repo) | ||
|
|
There was a problem hiding this comment.
Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).
Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.
| }) | ||
| defer endObservation(1, observation.Args{}) | ||
|
|
||
| repoName := protocol.NormalizeRepo(args.Repo) |
There was a problem hiding this comment.
See the same description as for the Exec request:
Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).
Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.
| // In case the repo has already been deleted from the database we need to pass | ||
| // the old name in order to land on the correct gitserver instance | ||
| undeletedName := api.UndeletedRepoName(repo) | ||
|
|
||
| client, err := c.ClientForRepo(ctx, undeletedName) |
There was a problem hiding this comment.
We actually cannot do that. It means that gitserver can not correctly set the DB state of this repo to uncloned after it removed the repo.
Scenario:
repo name is DELETED-123123-github.com/src-cli, in the DB the repo.name column contains exactly that value. The 123123 is the timestamp of the time it was deleted at. It cannot be reconstructed.
Now we tell gitserver to delete this repo, it can do that alright, because it correctly finds the undeleted version github.com/src-cli on disk (the new version can still do that, because we now Undelete the repo name at the name->fs path conversion layer).
It now attempts to update github.com/src-cli, which fails. We have not correctly set the state in the DB.
Now, since we don't scramble the name, this works as expected.
| // Normalizing the name in case the caller didn't. | ||
| name := string(protocol.NormalizeRepo(repoName)) |
There was a problem hiding this comment.
Normalizing is an operation to make the reponame safe to map to the filesystem on disk. It is not needed here. It didn't hurt that it was here, but it is certainly confusing. It also means that the addrforrepo logic is more complex than it has to be, making it harder to potentially reimplement in in the DB layer or elsewhere.
What we do need to do though is undeleting the RepoName here. We want to map the requests to the same shard for DELETED-123123-github.com/src-cli and github.com/src-cli, so we can delete a repo from disk after the latter has been marked as deleted.
Note that this only affects the sharding decision, NOT the repo name sent over.
|
|
||
| // Prepare the file system. | ||
| if err := gitserverfs.InitGitserverFileSystem(logger, config.ReposDir); err != nil { | ||
| fs := gitserverfs.New(observationCtx, config.ReposDir) |
There was a problem hiding this comment.
Feels nice to not have to pass the reposDir around everywhere, and instead have a subsystem that manages it.
| // See https://github.com/sourcegraph/sourcegraph/issues/54317 for details. | ||
| s.logger.Warn("Disabling 'echo' metric") | ||
| } | ||
|
|
There was a problem hiding this comment.
These live on the FS implementation now.
| // search handles the core logic of the search. It is passed a matchesBuf so it doesn't need to | ||
| // concern itself with event types, and all instrumentation is handled in the calling function. | ||
| func (s *Server) search(ctx context.Context, args *protocol.SearchRequest, onMatch func(*protocol.CommitMatch) error) (limitHit bool, err error) { | ||
| args.Repo = protocol.NormalizeRepo(args.Repo) |
There was a problem hiding this comment.
No need to "normalize" here, it just scrambles the repo name so we cannot properly map it back to the DB entity.
The FS handles getting the path correctly.
| repo := string(protocol.NormalizeRepo(req.Repo)) | ||
| repoDir := filepath.Join(s.reposDir, repo) | ||
| repoGitDir := filepath.Join(repoDir, ".git") |
There was a problem hiding this comment.
this actually didn't use the helper method at all, which could've caused an inconsistency. We now properly use RepoDir on the fs here.
| } else { | ||
| cmd = exec.CommandContext(ctx, "git", "push", "--force", remoteURL.String(), fmt.Sprintf("%s:%s", cmtHash, ref)) | ||
| cmd.Dir = repoGitDir | ||
| repoGitDir.Set(cmd) |
There was a problem hiding this comment.
This is important as it also sets some addtl env vars on the git command.
| if req.PushRef == nil { | ||
| cmd = exec.CommandContext(ctx, "git", "update-ref", "--", ref, cmtHash) | ||
| cmd.Dir = repoGitDir | ||
| repoGitDir.Set(cmd) |
There was a problem hiding this comment.
This is important as it also sets some addtl env vars on the git command.
| // ReposDir is the directory where the repositories are stored. | ||
| ReposDir string | ||
| // P4Home is the path to the directory that 'p4' will use as $HOME | ||
| // and where it will store cache data. | ||
| P4Home string | ||
|
|
There was a problem hiding this comment.
Ideally, we would move to the same structure that we adopted for git recently: Have a backend interface that proxies all these calls. That way, we can feed the FS at backend creation time and don't need to make it a concern of the caller to bring the right FS as an argument of the function.
Anyhow, that's for later. For now, we pass the FS down explicitly.
| func (s *Server) RepoUpdate(req *protocol.RepoUpdateRequest) protocol.RepoUpdateResponse { | ||
| logger := s.logger.Scoped("handleRepoUpdate") | ||
| var resp protocol.RepoUpdateResponse | ||
| req.Repo = protocol.NormalizeRepo(req.Repo) |
There was a problem hiding this comment.
Normalizing the repo here means that the CloneRepo below would not be able to get the VCSSyncer for the repo, failing the clone. We now keep the name as-is.
cmd/gitserver/internal/server.go
Outdated
| // the actual running clone for the DB state of last_error. | ||
| defer func() { | ||
| // Use a different context in case we failed because the original context failed. | ||
| // TODO: Should not store anything when tryacquire returned false. |
There was a problem hiding this comment.
Unrelated but wanted to leave a note for myself :D
cmd/gitserver/internal/server.go
Outdated
| return err | ||
| } | ||
|
|
||
| repo = protocol.NormalizeRepo(repo) |
There was a problem hiding this comment.
Normalizing here means that getRemoteURL right below would fail. We don't do that anymore.
| // We may have a deleted repo, we need to extract the original name both to | ||
| // ensure that the shard check is correct and also so that we can find the | ||
| // directory. | ||
| repo.Name = api.UndeletedRepoName(repo.Name) |
There was a problem hiding this comment.
The addrforrepo func handles that. Altering the name means the state syncer will not correctly update the record in the DB.
| _, cloning := locker.Status(dir) | ||
| cloned, err := fs.RepoCloned(repo.Name) | ||
| if err != nil { | ||
| // Failed to determine cloned state, we have to skip this record for now. |
There was a problem hiding this comment.
q: does this make sense as the action to take, skip over it?
| @@ -14,33 +14,5 @@ func NormalizeRepo(input api.RepoName) api.RepoName { | |||
| repo = path.Clean("/" + repo) | |||
There was a problem hiding this comment.
Ideally, this function doesn't need to exist at all. Whenever it makes a transformation, it will break the reponame->dir->reponame mapping.
For now, until we spend more time on a better mapping table, for security reasons we will keep this simplified version, but the goal really is to get rid of it.
| // TODO: For soft-deleted, it might be nice to attempt to update the clone status, | ||
| // but that can only work when we can map a repo on disk back to a repo in DB | ||
| // when the name has been modified to have the DELETED- prefix. | ||
| err = fs.RemoveRepo(repoName) |
| // | ||
| // The main use of RepositoryLocker is to prevent concurrent clones. However, | ||
| // it is also used during maintenance tasks such as recloning/migrating/etc. | ||
| type RepositoryLocker interface { |
There was a problem hiding this comment.
As the locker currently doesn't live in the FS layer, it seems improper to make it talk in terms of repo dirs, which are only normalized at the point we access the FS. Since we're going much more high level for the locking, switched to reponame for now.
Ultimately, we might want to merge FS and locker into one thing (and lose repo clone progress reporting, but can use it for locking cleanup operations as well).
| { | ||
| Name: "NonexistingRepo", | ||
| Request: &v1.ExecRequest{ | ||
| Repo: "github.com/gorilla/doesnotexist", | ||
| Args: [][]byte{[]byte("diff")}, | ||
| }, | ||
| ExpectedCode: codes.NotFound, | ||
| ExpectedError: "repo not found", | ||
| ExpectedDetails: []any{&v1.RepoNotFoundPayload{ | ||
| Repo: "github.com/gorilla/doesnotexist", | ||
| CloneInProgress: false, | ||
| }}, | ||
| }, | ||
| { | ||
| Name: "UnclonedRepo", | ||
| Request: &v1.ExecRequest{ | ||
| Repo: "github.com/nicksnyder/go-i18n", | ||
| Args: [][]byte{[]byte("diff")}, | ||
| }, | ||
| ExpectedCode: codes.NotFound, | ||
| ExpectedError: "repo not found", | ||
| ExpectedDetails: []any{&v1.RepoNotFoundPayload{ | ||
| Repo: "github.com/nicksnyder/go-i18n", | ||
| CloneInProgress: true, | ||
| }}, | ||
| }, |
There was a problem hiding this comment.
These are gRPC layer concerns and with the new FS mocks needed reimplementing but didn't actually test anything important, so skipped over.
defc5ff to
212c4ad
Compare
As posted on slack yesterday: When talking to gitserver, in various random layers we call a function called `protocol.NormalizeRepo`. This function potentially “normalizes” the repo name, aka. it changes it. Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically. Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB. And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the *repo name*. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..). So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”. So it all depends on the repo name, it _has_ to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB. When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called `https://bitbucket.sgdev.org/SOUR/src-cli.git.git` (notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies for `api.UndeletedRepoName`, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB. In gitserver we do the following: When we get a repo name, we normalize to get the dir on disk. So `[bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git)` becomes `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git`. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth). When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around. `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git` becomes `[bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli)`. Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth. So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in `.git`, and repos called `bitbucket/x` and `bitBucket/x` are suddenly not updating the right repo in the DB. --- There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with `api.RepoName` which is often passed to the DB. Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted. This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now. To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests. **Considerations when reviewing this code:** - Please triple check my inline comments where I state why I think the normalization in that place is not required - Please quadruple check my gut that (given normalization broke cloning before) there should be no major disruption changing the AddrForRepo logic Closes https://github.com/sourcegraph/sourcegraph/issues/35960 ## Test plan Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in `.git`. Also verified that deleting a repo now properly finds the correct repo in the DB to update.
| @@ -48,39 +255,39 @@ func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName { | |||
| return protocol.NormalizeRepo(api.RepoName(name)) | |||
| } | ||
|
|
||
| func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName { | ||
| func repoNameFromDir(reposDir string, dir common.GitDir) api.RepoName { |
There was a problem hiding this comment.
Note: This reverse lookup is still flawed and needs more work in another PR, because when we need to normalize a repo name to disk, we can't retrieve the original name ever.
We'll need to introduce a map or something that keeps the normalizations stored.
| func New(observationCtx *observation.Context, reposDir string) FS { | ||
| return &realGitserverFS{ | ||
| logger: observationCtx.Logger.Scoped("gitserverfs"), | ||
| observationCtx: observationCtx, | ||
| reposDir: reposDir, | ||
| clonedState: make(map[api.RepoName]struct{}), | ||
| lastCloneStateReset: time.Now(), | ||
| } | ||
| } |
There was a problem hiding this comment.
run initialize in here instead of exposing it
pjlast
left a comment
There was a problem hiding this comment.
Went over this in a pairing session. Spotted one possible out of date comment?
| dir := gitserverfs.RepoDirFromName(s.reposDir, repo) | ||
| if repoCloned(dir) { | ||
| return nil, true | ||
| func (s *Server) MaybeStartClone(ctx context.Context, repo api.RepoName) (cloned bool, status CloneStatus, _ error) { |
There was a problem hiding this comment.
It seems like the comment of "If disableAutoGitUpdates is set in the site config, no operation is taken and a NotFound error is returned." is no longer true? Is this intentional?
There was a problem hiding this comment.
It's still true, isn't it? maybe during merge conflicts this diff was scrambled though 🙈 https://github.com/sourcegraph/sourcegraph/pull/60627/files#diff-fbe02db9cf392abbd5452e8c6dee945bdee1eea7044a5264485b6a4ffe704e08R32-R34
As posted on slack yesterday:
When talking to gitserver, in various random layers we call a function called
protocol.NormalizeRepo. This function potentially “normalizes” the repo name, aka. it changes it.Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically.
Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB.
And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the repo name. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..).
So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”.
So it all depends on the repo name, it has to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB.
When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called
https://bitbucket.sgdev.org/SOUR/src-cli.git.git(notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies forapi.UndeletedRepoName, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB.In gitserver we do the following:
When we get a repo name, we normalize to get the dir on disk. So
[bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git)becomes/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth).When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around.
/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.gitbecomes[bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli). Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth.So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in
.git, and repos calledbitbucket/xandbitBucket/xare suddenly not updating the right repo in the DB.There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with
api.RepoNamewhich is often passed to the DB.Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted.
This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now.
To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests.
Considerations when reviewing this code:
Closes https://github.com/sourcegraph/sourcegraph/issues/35960
Closes https://github.com/sourcegraph/sourcegraph/issues/56712
Test plan
Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in
.git. Also verified that deleting a repo now properly finds the correct repo in the DB to update.