Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

gitserver: Introduce FS layer to encapsulate repo name conversions#60627

Merged
eseliger merged 7 commits intomainfrom
es/gitserverfs
Mar 28, 2024
Merged

gitserver: Introduce FS layer to encapsulate repo name conversions#60627
eseliger merged 7 commits intomainfrom
es/gitserverfs

Conversation

@eseliger
Copy link
Member

@eseliger eseliger commented Feb 19, 2024

As posted on slack yesterday:

When talking to gitserver, in various random layers we call a function called protocol.NormalizeRepo. This function potentially “normalizes” the repo name, aka. it changes it.

Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically.
Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB.

And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the repo name. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..).

So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”.
So it all depends on the repo name, it has to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB.

When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called https://bitbucket.sgdev.org/SOUR/src-cli.git.git (notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies for api.UndeletedRepoName, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB.

In gitserver we do the following:
When we get a repo name, we normalize to get the dir on disk. So [bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git) becomes /data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth).
When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around. /data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git becomes [bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli). Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth.

So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in .git, and repos called bitbucket/x and bitBucket/x are suddenly not updating the right repo in the DB.


There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with api.RepoName which is often passed to the DB.

Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted.
This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now.

To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests.

Considerations when reviewing this code:

  • Please triple check my inline comments where I state why I think the normalization in that place is not required
  • Please quadruple check my gut that (given normalization broke cloning before) there should be no major disruption changing the AddrForRepo logic

Closes https://github.com/sourcegraph/sourcegraph/issues/35960

Closes https://github.com/sourcegraph/sourcegraph/issues/56712

Test plan

Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in .git. Also verified that deleting a repo now properly finds the correct repo in the DB to update.

@cla-bot cla-bot bot added the cla-signed label Feb 19, 2024
repoName := protocol.NormalizeRepo(c.repo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).

Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.

})
defer endObservation(1, observation.Args{})

repoName := protocol.NormalizeRepo(args.Repo)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the same description as for the Exec request:

Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).

Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.

Comment on lines -877 to -878
// In case the repo has already been deleted from the database we need to pass
// the old name in order to land on the correct gitserver instance
undeletedName := api.UndeletedRepoName(repo)

client, err := c.ClientForRepo(ctx, undeletedName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually cannot do that. It means that gitserver can not correctly set the DB state of this repo to uncloned after it removed the repo.

Scenario:
repo name is DELETED-123123-github.com/src-cli, in the DB the repo.name column contains exactly that value. The 123123 is the timestamp of the time it was deleted at. It cannot be reconstructed.

Now we tell gitserver to delete this repo, it can do that alright, because it correctly finds the undeleted version github.com/src-cli on disk (the new version can still do that, because we now Undelete the repo name at the name->fs path conversion layer).
It now attempts to update github.com/src-cli, which fails. We have not correctly set the state in the DB.

Now, since we don't scramble the name, this works as expected.

Comment on lines -177 to -178
// Normalizing the name in case the caller didn't.
name := string(protocol.NormalizeRepo(repoName))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing is an operation to make the reponame safe to map to the filesystem on disk. It is not needed here. It didn't hurt that it was here, but it is certainly confusing. It also means that the addrforrepo logic is more complex than it has to be, making it harder to potentially reimplement in in the DB layer or elsewhere.

What we do need to do though is undeleting the RepoName here. We want to map the requests to the same shard for DELETED-123123-github.com/src-cli and github.com/src-cli, so we can delete a repo from disk after the latter has been marked as deleted.
Note that this only affects the sharding decision, NOT the repo name sent over.


// Prepare the file system.
if err := gitserverfs.InitGitserverFileSystem(logger, config.ReposDir); err != nil {
fs := gitserverfs.New(observationCtx, config.ReposDir)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels nice to not have to pass the reposDir around everywhere, and instead have a subsystem that manages it.

// See https://github.com/sourcegraph/sourcegraph/issues/54317 for details.
s.logger.Warn("Disabling 'echo' metric")
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These live on the FS implementation now.

// search handles the core logic of the search. It is passed a matchesBuf so it doesn't need to
// concern itself with event types, and all instrumentation is handled in the calling function.
func (s *Server) search(ctx context.Context, args *protocol.SearchRequest, onMatch func(*protocol.CommitMatch) error) (limitHit bool, err error) {
args.Repo = protocol.NormalizeRepo(args.Repo)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to "normalize" here, it just scrambles the repo name so we cannot properly map it back to the DB entity.
The FS handles getting the path correctly.

Comment on lines -48 to -50
repo := string(protocol.NormalizeRepo(req.Repo))
repoDir := filepath.Join(s.reposDir, repo)
repoGitDir := filepath.Join(repoDir, ".git")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually didn't use the helper method at all, which could've caused an inconsistency. We now properly use RepoDir on the fs here.

} else {
cmd = exec.CommandContext(ctx, "git", "push", "--force", remoteURL.String(), fmt.Sprintf("%s:%s", cmtHash, ref))
cmd.Dir = repoGitDir
repoGitDir.Set(cmd)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important as it also sets some addtl env vars on the git command.

if req.PushRef == nil {
cmd = exec.CommandContext(ctx, "git", "update-ref", "--", ref, cmtHash)
cmd.Dir = repoGitDir
repoGitDir.Set(cmd)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important as it also sets some addtl env vars on the git command.

Comment on lines -20 to -25
// ReposDir is the directory where the repositories are stored.
ReposDir string
// P4Home is the path to the directory that 'p4' will use as $HOME
// and where it will store cache data.
P4Home string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we would move to the same structure that we adopted for git recently: Have a backend interface that proxies all these calls. That way, we can feed the FS at backend creation time and don't need to make it a concern of the caller to bring the right FS as an argument of the function.
Anyhow, that's for later. For now, we pass the FS down explicitly.

func (s *Server) RepoUpdate(req *protocol.RepoUpdateRequest) protocol.RepoUpdateResponse {
logger := s.logger.Scoped("handleRepoUpdate")
var resp protocol.RepoUpdateResponse
req.Repo = protocol.NormalizeRepo(req.Repo)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing the repo here means that the CloneRepo below would not be able to get the VCSSyncer for the repo, failing the clone. We now keep the name as-is.

// the actual running clone for the DB state of last_error.
defer func() {
// Use a different context in case we failed because the original context failed.
// TODO: Should not store anything when tryacquire returned false.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated but wanted to leave a note for myself :D

return err
}

repo = protocol.NormalizeRepo(repo)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing here means that getRemoteURL right below would fail. We don't do that anymore.

Comment on lines -192 to -195
// We may have a deleted repo, we need to extract the original name both to
// ensure that the shard check is correct and also so that we can find the
// directory.
repo.Name = api.UndeletedRepoName(repo.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addrforrepo func handles that. Altering the name means the state syncer will not correctly update the record in the DB.

_, cloning := locker.Status(dir)
cloned, err := fs.RepoCloned(repo.Name)
if err != nil {
// Failed to determine cloned state, we have to skip this record for now.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: does this make sense as the action to take, skip over it?

@@ -14,33 +14,5 @@ func NormalizeRepo(input api.RepoName) api.RepoName {
repo = path.Clean("/" + repo)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this function doesn't need to exist at all. Whenever it makes a transformation, it will break the reponame->dir->reponame mapping.
For now, until we spend more time on a better mapping table, for security reasons we will keep this simplified version, but the goal really is to get rid of it.

Comment on lines +403 to +402
// TODO: For soft-deleted, it might be nice to attempt to update the clone status,
// but that can only work when we can map a repo on disk back to a repo in DB
// when the name has been modified to have the DELETED- prefix.
err = fs.RemoveRepo(repoName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later.

//
// The main use of RepositoryLocker is to prevent concurrent clones. However,
// it is also used during maintenance tasks such as recloning/migrating/etc.
type RepositoryLocker interface {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the locker currently doesn't live in the FS layer, it seems improper to make it talk in terms of repo dirs, which are only normalized at the point we access the FS. Since we're going much more high level for the locking, switched to reponame for now.
Ultimately, we might want to merge FS and locker into one thing (and lose repo clone progress reporting, but can use it for locking cleanup operations as well).

@eseliger eseliger changed the title gitserver: Introduce FS layer gitserver: Introduce FS layer to encapsulate repo name conversions Feb 20, 2024
Comment on lines -77 to -103
{
Name: "NonexistingRepo",
Request: &v1.ExecRequest{
Repo: "github.com/gorilla/doesnotexist",
Args: [][]byte{[]byte("diff")},
},
ExpectedCode: codes.NotFound,
ExpectedError: "repo not found",
ExpectedDetails: []any{&v1.RepoNotFoundPayload{
Repo: "github.com/gorilla/doesnotexist",
CloneInProgress: false,
}},
},
{
Name: "UnclonedRepo",
Request: &v1.ExecRequest{
Repo: "github.com/nicksnyder/go-i18n",
Args: [][]byte{[]byte("diff")},
},
ExpectedCode: codes.NotFound,
ExpectedError: "repo not found",
ExpectedDetails: []any{&v1.RepoNotFoundPayload{
Repo: "github.com/nicksnyder/go-i18n",
CloneInProgress: true,
}},
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are gRPC layer concerns and with the new FS mocks needed reimplementing but didn't actually test anything important, so skipped over.

@eseliger eseliger force-pushed the es/gitserverfs branch 3 times, most recently from defc5ff to 212c4ad Compare February 20, 2024 15:41
@eseliger eseliger marked this pull request as ready for review March 4, 2024 15:33
@eseliger eseliger requested a review from a team March 4, 2024 15:33
As posted on slack yesterday:

When talking to gitserver, in various random layers we call a function called `protocol.NormalizeRepo`. This function potentially “normalizes” the repo name, aka. it changes it.

Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically.
Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB.

And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the *repo name*. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..).

So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”.
So it all depends on the repo name, it _has_ to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB.

When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called `https://bitbucket.sgdev.org/SOUR/src-cli.git.git` (notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies for `api.UndeletedRepoName`, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB.

In gitserver we do the following:
When we get a repo name, we normalize to get the dir on disk. So `[bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git)` becomes `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git`. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth).
When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around. `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git` becomes `[bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli)`. Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth.

So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in `.git`, and repos called `bitbucket/x` and `bitBucket/x` are suddenly not updating the right repo in the DB.

---

There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with `api.RepoName` which is often passed to the DB.

Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted.
This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now.

To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests.

**Considerations when reviewing this code:**

- Please triple check my inline comments where I state why I think the normalization in that place is not required
- Please quadruple check my gut that (given normalization broke cloning before) there should be no major disruption changing the AddrForRepo logic

Closes https://github.com/sourcegraph/sourcegraph/issues/35960

## Test plan

Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in `.git`. Also verified that deleting a repo now properly finds the correct repo in the DB to update.
@@ -48,39 +255,39 @@ func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
return protocol.NormalizeRepo(api.RepoName(name))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why normalize here?

}

func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
func repoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This reverse lookup is still flawed and needs more work in another PR, because when we need to normalize a repo name to disk, we can't retrieve the original name ever.

We'll need to introduce a map or something that keeps the normalizations stored.

Comment on lines +42 to +50
func New(observationCtx *observation.Context, reposDir string) FS {
return &realGitserverFS{
logger: observationCtx.Logger.Scoped("gitserverfs"),
observationCtx: observationCtx,
reposDir: reposDir,
clonedState: make(map[api.RepoName]struct{}),
lastCloneStateReset: time.Now(),
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run initialize in here instead of exposing it

Copy link
Contributor

@pjlast pjlast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over this in a pairing session. Spotted one possible out of date comment?

dir := gitserverfs.RepoDirFromName(s.reposDir, repo)
if repoCloned(dir) {
return nil, true
func (s *Server) MaybeStartClone(ctx context.Context, repo api.RepoName) (cloned bool, status CloneStatus, _ error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the comment of "If disableAutoGitUpdates is set in the site config, no operation is taken and a NotFound error is returned." is no longer true? Is this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still true, isn't it? maybe during merge conflicts this diff was scrambled though 🙈 https://github.com/sourcegraph/sourcegraph/pull/60627/files#diff-fbe02db9cf392abbd5452e8c6dee945bdee1eea7044a5264485b6a4ffe704e08R32-R34

@eseliger eseliger merged commit 58aee4e into main Mar 28, 2024
@eseliger eseliger deleted the es/gitserverfs branch March 28, 2024 23:24
eseliger added a commit that referenced this pull request Mar 29, 2024
eseliger added a commit that referenced this pull request Mar 29, 2024
eseliger added a commit that referenced this pull request Apr 2, 2024
eseliger added a commit that referenced this pull request Apr 2, 2024
…rsions (#60627)" (#61487) (#61523)

This reverts commit 0e8cbca.

After the initial version didn't work for case-sensitive filesystems, this is the next attempt at shipping this.

## Test plan

Tests and verified that the previous issue is resolved.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gitserver: Improve mapping of paths to repository Repos that have .git in the name do not sync properly

2 participants