gitserver: Introduce FS layer to encapsulate repo name conversions#60627

eseliger · 2024-02-19T23:35:39Z

As posted on slack yesterday:

When talking to gitserver, in various random layers we call a function called protocol.NormalizeRepo. This function potentially “normalizes” the repo name, aka. it changes it.

Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically.
Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB.

And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the repo name. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..).

So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”.
So it all depends on the repo name, it has to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB.

When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called https://bitbucket.sgdev.org/SOUR/src-cli.git.git (notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies for api.UndeletedRepoName, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB.

In gitserver we do the following:
When we get a repo name, we normalize to get the dir on disk. So [bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git) becomes /data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth).
When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around. /data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git becomes [bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli). Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth.

So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in .git, and repos called bitbucket/x and bitBucket/x are suddenly not updating the right repo in the DB.

There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with api.RepoName which is often passed to the DB.

Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted.
This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now.

To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests.

Considerations when reviewing this code:

Please triple check my inline comments where I state why I think the normalization in that place is not required
Please quadruple check my gut that (given normalization broke cloning before) there should be no major disruption changing the AddrForRepo logic

Closes https://github.com/sourcegraph/sourcegraph/issues/35960

Closes https://github.com/sourcegraph/sourcegraph/issues/56712

Test plan

Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in .git. Also verified that deleting a repo now properly finds the correct repo in the DB to update.

eseliger · 2024-02-20T00:39:31Z

internal/gitserver/client.go

-	repoName := protocol.NormalizeRepo(c.repo)
-


Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).

Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.

eseliger · 2024-02-20T00:39:44Z

internal/gitserver/client.go

 	})
 	defer endObservation(1, observation.Args{})

-	repoName := protocol.NormalizeRepo(args.Repo)


See the same description as for the Exec request:

Normalizing the repoName here did nothing good. ClientForRepo already applied it, so that was a noop, and the Repo arg in the ExecRequest basically got corrupted, meaning that gitserver could never find the corresponding DB entry for the repo sent over (because we corrupt the de-facto PK).

Removing it here doesn't actually change anything other than not sending a potentially wrong name over. The ClientForRepo function didn't benefit from this line anyways.

eseliger · 2024-02-20T00:42:38Z

internal/gitserver/client.go

-	// In case the repo has already been deleted from the database we need to pass
-	// the old name in order to land on the correct gitserver instance
-	undeletedName := api.UndeletedRepoName(repo)
-
-	client, err := c.ClientForRepo(ctx, undeletedName)


We actually cannot do that. It means that gitserver can not correctly set the DB state of this repo to uncloned after it removed the repo.

Scenario:
repo name is DELETED-123123-github.com/src-cli, in the DB the repo.name column contains exactly that value. The 123123 is the timestamp of the time it was deleted at. It cannot be reconstructed.

Now we tell gitserver to delete this repo, it can do that alright, because it correctly finds the undeleted version github.com/src-cli on disk (the new version can still do that, because we now Undelete the repo name at the name->fs path conversion layer).
It now attempts to update github.com/src-cli, which fails. We have not correctly set the state in the DB.

Now, since we don't scramble the name, this works as expected.

eseliger · 2024-02-20T00:44:42Z

internal/gitserver/addrs.go

-	// Normalizing the name in case the caller didn't.
-	name := string(protocol.NormalizeRepo(repoName))


Normalizing is an operation to make the reponame safe to map to the filesystem on disk. It is not needed here. It didn't hurt that it was here, but it is certainly confusing. It also means that the addrforrepo logic is more complex than it has to be, making it harder to potentially reimplement in in the DB layer or elsewhere.

What we do need to do though is undeleting the RepoName here. We want to map the requests to the same shard for DELETED-123123-github.com/src-cli and github.com/src-cli, so we can delete a repo from disk after the latter has been marked as deleted.
Note that this only affects the sharding decision, NOT the repo name sent over.

eseliger · 2024-02-20T00:46:26Z

cmd/gitserver/shared/shared.go


 	// Prepare the file system.
-	if err := gitserverfs.InitGitserverFileSystem(logger, config.ReposDir); err != nil {
+	fs := gitserverfs.New(observationCtx, config.ReposDir)


Feels nice to not have to pass the reposDir around everywhere, and instead have a subsystem that manages it.

eseliger · 2024-02-20T00:47:12Z

cmd/gitserver/internal/servermetrics.go

 		// See https://github.com/sourcegraph/sourcegraph/issues/54317 for details.
 		s.logger.Warn("Disabling 'echo' metric")
 	}
-


These live on the FS implementation now.

eseliger · 2024-02-20T00:49:18Z

cmd/gitserver/internal/search.go

 // search handles the core logic of the search. It is passed a matchesBuf so it doesn't need to
 // concern itself with event types, and all instrumentation is handled in the calling function.
 func (s *Server) search(ctx context.Context, args *protocol.SearchRequest, onMatch func(*protocol.CommitMatch) error) (limitHit bool, err error) {
-	args.Repo = protocol.NormalizeRepo(args.Repo)


No need to "normalize" here, it just scrambles the repo name so we cannot properly map it back to the DB entity.
The FS handles getting the path correctly.

eseliger · 2024-02-20T01:17:19Z

cmd/gitserver/internal/patch.go

-	repo := string(protocol.NormalizeRepo(req.Repo))
-	repoDir := filepath.Join(s.reposDir, repo)
-	repoGitDir := filepath.Join(repoDir, ".git")


this actually didn't use the helper method at all, which could've caused an inconsistency. We now properly use RepoDir on the fs here.

eseliger · 2024-02-20T01:17:42Z

cmd/gitserver/internal/patch.go

 		} else {
 			cmd = exec.CommandContext(ctx, "git", "push", "--force", remoteURL.String(), fmt.Sprintf("%s:%s", cmtHash, ref))
-			cmd.Dir = repoGitDir
+			repoGitDir.Set(cmd)


This is important as it also sets some addtl env vars on the git command.

eseliger · 2024-02-20T01:17:47Z

cmd/gitserver/internal/patch.go

 	if req.PushRef == nil {
 		cmd = exec.CommandContext(ctx, "git", "update-ref", "--", ref, cmtHash)
-		cmd.Dir = repoGitDir
+		repoGitDir.Set(cmd)


This is important as it also sets some addtl env vars on the git command.

eseliger · 2024-02-20T01:19:16Z

cmd/gitserver/internal/perforce/changelist.go

-	// ReposDir is the directory where the repositories are stored.
-	ReposDir string
-	// P4Home is the path to the directory that 'p4' will use as $HOME
-	// and where it will store cache data.
-	P4Home string
-


Ideally, we would move to the same structure that we adopted for git recently: Have a backend interface that proxies all these calls. That way, we can feed the FS at backend creation time and don't need to make it a concern of the caller to bring the right FS as an argument of the function.
Anyhow, that's for later. For now, we pass the FS down explicitly.

eseliger · 2024-02-20T01:25:11Z

cmd/gitserver/internal/server.go

 func (s *Server) RepoUpdate(req *protocol.RepoUpdateRequest) protocol.RepoUpdateResponse {
 	logger := s.logger.Scoped("handleRepoUpdate")
 	var resp protocol.RepoUpdateResponse
-	req.Repo = protocol.NormalizeRepo(req.Repo)


Normalizing the repo here means that the CloneRepo below would not be able to get the VCSSyncer for the repo, failing the clone. We now keep the name as-is.

eseliger · 2024-02-20T01:26:02Z

cmd/gitserver/internal/server.go

 	// the actual running clone for the DB state of last_error.
 	defer func() {
 		// Use a different context in case we failed because the original context failed.
+		// TODO: Should not store anything when tryacquire returned false.


Unrelated but wanted to leave a note for myself :D

eseliger · 2024-02-20T01:29:13Z

cmd/gitserver/internal/server.go

 		return err
 	}

-	repo = protocol.NormalizeRepo(repo)


Normalizing here means that getRemoteURL right below would fail. We don't do that anymore.

eseliger · 2024-02-20T01:31:38Z

cmd/gitserver/internal/statesyncer.go

-			// We may have a deleted repo, we need to extract the original name both to
-			// ensure that the shard check is correct and also so that we can find the
-			// directory.
-			repo.Name = api.UndeletedRepoName(repo.Name)


The addrforrepo func handles that. Altering the name means the state syncer will not correctly update the record in the DB.

eseliger · 2024-02-20T01:32:11Z

cmd/gitserver/internal/statesyncer.go

-			_, cloning := locker.Status(dir)
+			cloned, err := fs.RepoCloned(repo.Name)
+			if err != nil {
+				// Failed to determine cloned state, we have to skip this record for now.


q: does this make sense as the action to take, skip over it?

eseliger · 2024-02-20T01:42:52Z

internal/gitserver/protocol/util.go

@@ -14,33 +14,5 @@ func NormalizeRepo(input api.RepoName) api.RepoName {
 	repo = path.Clean("/" + repo)


Ideally, this function doesn't need to exist at all. Whenever it makes a transformation, it will break the reponame->dir->reponame mapping.
For now, until we spend more time on a better mapping table, for security reasons we will keep this simplified version, but the goal really is to get rid of it.

eseliger · 2024-02-20T13:42:46Z

cmd/gitserver/internal/cleanup.go

+		// TODO: For soft-deleted, it might be nice to attempt to update the clone status,
+		// but that can only work when we can map a repo on disk back to a repo in DB
+		// when the name has been modified to have the DELETED- prefix.
+		err = fs.RemoveRepo(repoName)


eseliger · 2024-02-20T13:47:50Z

cmd/gitserver/internal/lock.go

 //
 // The main use of RepositoryLocker is to prevent concurrent clones. However,
 // it is also used during maintenance tasks such as recloning/migrating/etc.
 type RepositoryLocker interface {


As the locker currently doesn't live in the FS layer, it seems improper to make it talk in terms of repo dirs, which are only normalized at the point we access the FS. Since we're going much more high level for the locking, switched to reponame for now.
Ultimately, we might want to merge FS and locker into one thing (and lose repo clone progress reporting, but can use it for locking cleanup operations as well).

eseliger · 2024-02-20T14:22:52Z

cmd/gitserver/internal/server_test.go

-		{
-			Name: "NonexistingRepo",
-			Request: &v1.ExecRequest{
-				Repo: "github.com/gorilla/doesnotexist",
-				Args: [][]byte{[]byte("diff")},
-			},
-			ExpectedCode:  codes.NotFound,
-			ExpectedError: "repo not found",
-			ExpectedDetails: []any{&v1.RepoNotFoundPayload{
-				Repo:            "github.com/gorilla/doesnotexist",
-				CloneInProgress: false,
-			}},
-		},
-		{
-			Name: "UnclonedRepo",
-			Request: &v1.ExecRequest{
-				Repo: "github.com/nicksnyder/go-i18n",
-				Args: [][]byte{[]byte("diff")},
-			},
-			ExpectedCode:  codes.NotFound,
-			ExpectedError: "repo not found",
-			ExpectedDetails: []any{&v1.RepoNotFoundPayload{
-				Repo:            "github.com/nicksnyder/go-i18n",
-				CloneInProgress: true,
-			}},
-		},


These are gRPC layer concerns and with the new FS mocks needed reimplementing but didn't actually test anything important, so skipped over.

As posted on slack yesterday: When talking to gitserver, in various random layers we call a function called `protocol.NormalizeRepo`. This function potentially “normalizes” the repo name, aka. it changes it. Gitserver, historically (years ago) was a simple git proxy. So you ask it for repo content, and it eventually delivers. It’s been a write-through cache, basically. Then, we transformed it a bunch of times, first to keep a fixed set of repos always (code host connection syncing), then later to be a stateful service that persists it’s state in the DB. And here comes the thing: Unlike years ago, gitserver has to map what it has on disk to the DB. The key we use for that mapping is the *repo name*. (Which is not a great key, because we reuse the name when a repo is deleted and created under another external ID, but that’s for later..). So when we ask gitserver to clone a repo, it first asks the DB: “Hey how do I clone this repo with name X?“. During regular operations, we try to keep the gitserver_repos table up to date, by saying “repo with name X is cloned now”. So it all depends on the repo name, it _has_ to match something in the db, otherwise a clone cannot work. Otherwise, the repo state cannot be updated in the DB. When we talk to gitserver, we currently sometimes use this “normalization” function. Sometimes not. You’ll notice that when you add a repo called `https://bitbucket.sgdev.org/SOUR/src-cli.git.git` (notice the double .git, one of them is part of the name, only the latter is part of the git url scheme), it fails cloning initially, but eventually works. Because some request along the way does not do that “normalization”, thus letting gitserver finally clone the repo alright. But it takes a bunch of failed attempts. Recloning the repo from the UI doesn’t work, it complains that the repo was not found (because that request does corrupt the name somewhere in the layers). The same issue applies for `api.UndeletedRepoName`, it tries to do the right thing, but does it at a too high level, thus making that we lose track of the repo in the DB. In gitserver we do the following: When we get a repo name, we normalize to get the dir on disk. So `[bitbucket.sgdev.org/SOUR/src-cli.git](http://bitbucket.sgdev.org/SOUR/src-cli.git)` becomes `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git`. Some other normalizations make sure that the dir is a valid path and no malicious paths can be used (../../etc/passwd and so forth). When we traverse over the repos on disk to do maintenance tasks etc, we don’t have a repo name, we have a repo dir instead. So for those, we do this process the other way around. `/data/repos/bitbucket.sgdev.org/SOUR/src-cli/.git` becomes `[bitbucket.sgdev.org/SOUR/src-cli](http://bitbucket.sgdev.org/SOUR/src-cli)`. Notice? The name is NOT the same name as it was initially, because this conversion is not lossless back and forth. So we are no longer able to map the repository on disk to the corresponding database entry. Bad. Even worse, we cannot clone repos that end in `.git`, and repos called `bitbucket/x` and `bitBucket/x` are suddenly not updating the right repo in the DB. --- There’s a few things we can do here, what I went for in this PR is to push those normalizations further down into the FS layer, but keep it untouched in the layers where we deal with `api.RepoName` which is often passed to the DB. Ultimately, as soon as we have to make a conversion or when a repo name changes, this mapping will still break. It’ll require a reclone, or make it so that we cannot update DB state in janitor anymore once a repo was soft-deleted. This seems like a harder problem to solve, but I think we can get into a much better state by pushing these normalizations into the FS layer for now. To make this abstraction as obvious as possible, I decided to do a migration I intended for a while anyways: Encapsulate the FS activities in an interface. This also makes it easier to mock things out, for example the check for IsRepoCloned in one of the tests. **Considerations when reviewing this code:** - Please triple check my inline comments where I state why I think the normalization in that place is not required - Please quadruple check my gut that (given normalization broke cloning before) there should be no major disruption changing the AddrForRepo logic Closes https://github.com/sourcegraph/sourcegraph/issues/35960 ## Test plan Pruned gitserver disk locally, let it reclone all the things, made sure clones work as expected, even for repos that end in `.git`. Also verified that deleting a repo now properly finds the correct repo in the DB to update.

eseliger · 2024-03-27T13:09:35Z

cmd/gitserver/internal/gitserverfs/fs.go

@@ -48,39 +255,39 @@ func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
 	return protocol.NormalizeRepo(api.RepoName(name))


why normalize here?

eseliger · 2024-03-27T13:12:59Z

cmd/gitserver/internal/gitserverfs/fs.go

 }

-func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
+func repoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {


Note: This reverse lookup is still flawed and needs more work in another PR, because when we need to normalize a repo name to disk, we can't retrieve the original name ever.

We'll need to introduce a map or something that keeps the normalizations stored.

eseliger · 2024-03-27T13:22:22Z

cmd/gitserver/internal/gitserverfs/fs.go

+func New(observationCtx *observation.Context, reposDir string) FS {
+	return &realGitserverFS{
+		logger:              observationCtx.Logger.Scoped("gitserverfs"),
+		observationCtx:      observationCtx,
+		reposDir:            reposDir,
+		clonedState:         make(map[api.RepoName]struct{}),
+		lastCloneStateReset: time.Now(),
+	}
+}


run initialize in here instead of exposing it

pjlast

Went over this in a pairing session. Spotted one possible out of date comment?

pjlast · 2024-03-28T12:54:35Z

cmd/gitserver/internal/clone.go

-	dir := gitserverfs.RepoDirFromName(s.reposDir, repo)
-	if repoCloned(dir) {
-		return nil, true
+func (s *Server) MaybeStartClone(ctx context.Context, repo api.RepoName) (cloned bool, status CloneStatus, _ error) {


It seems like the comment of "If disableAutoGitUpdates is set in the site config, no operation is taken and a NotFound error is returned." is no longer true? Is this intentional?

It's still true, isn't it? maybe during merge conflicts this diff was scrambled though 🙈 https://github.com/sourcegraph/sourcegraph/pull/60627/files#diff-fbe02db9cf392abbd5452e8c6dee945bdee1eea7044a5264485b6a4ffe704e08R32-R34

…sions (#60627)" This reverts commit 58aee4e.

…sions (#60627)" (#61487) This reverts commit 58aee4e.

…rsions (#60627)" (#61487) This reverts commit 0e8cbca.

…rsions (#60627)" (#61487) (#61523) This reverts commit 0e8cbca. After the initial version didn't work for case-sensitive filesystems, this is the next attempt at shipping this. ## Test plan Tests and verified that the previous issue is resolved.

cla-bot bot added the cla-signed label Feb 19, 2024

eseliger commented Feb 20, 2024

View reviewed changes

eseliger changed the title ~~gitserver: Introduce FS layer~~ gitserver: Introduce FS layer to encapsulate repo name conversions Feb 20, 2024

eseliger commented Feb 20, 2024

View reviewed changes

eseliger force-pushed the es/gitserverfs branch 3 times, most recently from defc5ff to 212c4ad Compare February 20, 2024 15:41

eseliger marked this pull request as ready for review March 4, 2024 15:33

eseliger requested a review from a team March 4, 2024 15:33

eseliger force-pushed the es/gitserverfs branch from 212c4ad to 55d4aa0 Compare March 4, 2024 15:34

eseliger commented Mar 27, 2024

View reviewed changes

pjlast approved these changes Mar 28, 2024

View reviewed changes

eseliger added 6 commits March 28, 2024 19:50

Merge branch 'main' into es/gitserverfs

a461ea9

Fixup again

eb17689

Fixup

116f4fe

Fixup

37c0174

Add to changelog

243a25a

Merge branch 'main' into es/gitserverfs

3da65ac

eseliger merged commit 58aee4e into main Mar 28, 2024

eseliger deleted the es/gitserverfs branch March 28, 2024 23:24

eseliger added a commit that referenced this pull request Mar 29, 2024

Revert "gitserver: Introduce FS layer to encapsulate repo name conver…

c861a16

…sions (#60627)" This reverts commit 58aee4e.

eseliger added a commit that referenced this pull request Mar 29, 2024

Revert "gitserver: Introduce FS layer to encapsulate repo name conver…

0e8cbca

…sions (#60627)" (#61487) This reverts commit 58aee4e.

eseliger added a commit that referenced this pull request Apr 2, 2024

Reapply "gitserver: Introduce FS layer to encapsulate repo name conve…

a750595

…rsions (#60627)" (#61487) This reverts commit 0e8cbca.

		// Normalizing the name in case the caller didn't.
		name := string(protocol.NormalizeRepo(repoName))

		@@ -14,33 +14,5 @@ func NormalizeRepo(input api.RepoName) api.RepoName {
		repo = path.Clean("/" + repo)

		@@ -48,39 +255,39 @@ func RepoNameFromDir(reposDir string, dir common.GitDir) api.RepoName {
		return protocol.NormalizeRepo(api.RepoName(name))

Conversation

eseliger commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pjlast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eseliger commented Feb 19, 2024 •

edited

Loading