Create tags for potentially dirty git repository state

Question

Note: The below is similar to this question (i.e. same motivation & general idea), but tries to be more careful: My version should never touch the repository state (to preserve staged files etc), and should properly handle submodules.

The goal of the below script is to create a tag for the current state of the git repository. If the repository/any submodules are dirty, it should create a "detached commit" without modifying the current state of the index etc. to capture the file content. The motivation for this is to be able to have a tag describing the source code that went into a build. So if I later want to figure out what code was used for a build, I have something to git checkout, even if the repository was dirty at the time of the commit (as happens often while implementing new stuff).

The code is split into two files:

get_git_tag.sh

This is the main script to call - it generates the tag name, and calls the inner script to do the work

#!/bin/bash

set -euo pipefail

tag=$(date +"dev-%Y%m%d-%H%M%S")

cd "$(dirname "$0")/../../"

bash "$(dirname "$0")/iget_git_tag.sh" "$tag" "main"

iget_git_tag.sh

The script that handles the actual tag creation & recursively calls itself

#!/bin/bash


set -euo pipefail

if [ -z "$(git status --porcelain)" ]; then
    # if not dirty, use any vXXX type tag if available, otherwise the commit SHA
    git describe --tags --exact-match --match "v[0-9]*" 2>/dev/null || git rev-parse --verify --short HEAD
else
    # if dirty, create a "detached commit"
    tmp_git_index=$(mktemp)
    cp $(git rev-parse --git-path index) "$tmp_git_index"
    export GIT_INDEX_FILE="$tmp_git_index"

    # stage all changes to the temporary index file
    git add --all

    # recursively call the script on submodules to produce tags, and add the relevant commit SHAs to the index
    git submodule foreach --quiet "GIT_INDEX_FILE=$GIT_INDEX_FILE git -C \$toplevel update-index --cacheinfo 160000,\$(git rev-parse \$(bash \"$(realpath "$0")\" \"$1\" sub)),\$sm_path"

    build_sha=$(GIT_AUTHOR_DATE="1.1.1970 0:0:0" GIT_COMMITTER_DATE="1.1.1970 0:0:0" git commit-tree $(git write-tree) -p HEAD -m "Build commit")

    # if the build was already tagged, use that tag, otherwise make a new one
    git describe --tags --exact-match --match "dev-*" "$build_sha" 2>/dev/null || (git tag "$1" "$build_sha"; echo "$1")
fi

Questions

What I am mainly interested in:

Any fundamental issues with this approach?
I did quite a few tests of different scenarios, but I probably missed even more: Are there scenarios/repository states that are likely to break this script? If a few tags are invalid, that's not so great, but even worse would be if the script actually modifies the repository state itself (e.g. staged files, file contents) rather than only creating the detached commits
Does the code properly handle nested submodules? I don't plan on having them in my repository any time soon, but ideally it would be future-proof in that regard.

Fundamental question: why avoid committing? Commit completed (small) changes. Stash incomplete ones. Branch and tag for reference by humans. Not every one of your branches or tags needs to be shared with others, and could be deleted if abandoned. — John Mahowald
– John Mahowald, Commented 3 hours ago

J_H · Accepted Answer · 2025-12-11 19:11:16Z

contract

In the Review Context the "never mutate the repository state" requirement is extremely clear. We should see a script comment to that effect, so maintainers don't forget it and so users know what to expect.

parameters

From the iget_git_tag.sh $tag main call, I was expecting to see

tag="$1"
branch="$2"

But in the OP code it appears the second parameter is ignored, being overwritten with "sub" on recursive calls.

The anonymous \"$1\" is correct, but naming it $tag would be a boon to future maintainers.

output

I guess this script is supposed to print a tag on stdout and nothing on stderr, but it's not obvious it always does that and that behavior is not written down in any # comment.

It's especially important to document this so callers know what to expect. I can imagine that someone calls this dozens of times in a "boring" context, gets a single tag back, and assumes that's the expected behavior. And then is surprised when N tags on N lines come back, fails to loop through them properly, and does the Wrong Thing to some poor repo.

Kudos on the comments you did include; each one was helpful.

env vars

    export GIT_INDEX_FILE="$tmp_git_index"
    ...
    git submodule foreach --quiet "GIT_INDEX_FILE=$GIT_INDEX_FILE ...

Given that we've already exported it in the parent, I'm a little surprised that we would need to keep transmitting it to children and descendants.

epoch

GIT_AUTHOR_DATE="1.1.1970 0:0:0"

I'm a little surprised that Linus accepts m.d.Y, but ok, you learn something every day.

Consider adopting the more conventional spelling of 00:00:00 for midnight.

Consider assigning that to a variable:
GIT_AUTHOR_DATE="$epoch" GIT_COMMITTER_DATE="$epoch" git commit-tree ...

I understand the desire to scope down where those dates appear. Consider using ( cmds ) syntax to create a child bash:

    (export GIT_AUTHOR_DATE="$epoch"
     export GIT_COMMITTER_DATE="$epoch"
     build_sha=$(git commit-tree $(git write-tree) -p HEAD -m "Build commit")
    )

After the closing ), the child shell disappears, along with both env var settings.

magic number

This is pretty terrible: ... update-index --cacheinfo 160000,...

Give us a clue what that mode means. The documentation does a poor job of that, so this code really ought to. And assign it a symbolic name, perhaps $submodule_mode.

While you're at it, the clunky \"$(realpath "$0")\" deserves a symbolic name.

And it's unclear where $sm_path was defined. Give us # a hint. The submodule docs do explain about

access to the variables $name, $sm_path, $displaypath, $sha1 and $toplevel

Rather than iterate with git submodule foreach, consider using a bash for loop. Then we wouldn't need to overload a single cryptic source line with so many details, as they could be spread over several lines.

submodules

did quite a few tests of different scenarios, ...

Does the code properly handle nested submodules?

git is complex. You really need to have some automated tests involving submodules if that's part of the usage you wish to support. For example, I have no idea why you didn't ask foreach to visit all the submodules in a --recursive manner.

fundamental issues with this approach?

Generally it is complex, hard to test, and does not instill confidence in its correctness, given git's many corner cases.

It seems to need N+1 tags when N submodules are used?

Here's what would give me far greater confidence in the implementation:

Produce a tag (set of tags?) for current repo state.
Verify that git diff $tag produces zero diffs.
Report success status.

That way even with lots of complex stuff leading up to it there's a simple guarantee on the result, which is easily checked.

internal audit trail

The "do not change git's state" requirement seems a little odd, given that git is a great audit engine, and your "reproduce the build conditions" is all about auditing what happened.

A natural solution would be git commit -a -m $tag, but I suppose you don't want to mix prose messages and tag messages in git log output. Notice that you could verify we have unpushed commits and then do something like git commit -a --amend --no-edit.

It's not a good fit for the main branch the OP code mentions, but on a feature branch you might produce a mix of manual and automated commit messages. And then git merge --squash tidies up that history.

Or you might have an integration build branch, similar to main, which has a "dirty" history full of automated tag names in the commit log.

external audit trail

The span of your audit requirement is probably just a few weeks, since after that a feature will have been merged down (or abandoned). Consider having your tool append audit records to a file that git does not manage. Perhaps a level up, in the parent folder containing the repo, or perhaps in a .gitignore'd file or folder. You could dump info from git log, git diff, and so on.

Stack Exchange Network

Create tags for potentially dirty git repository state

get_git_tag.sh

iget_git_tag.sh

Questions

1 Answer 1

contract

parameters

output

env vars

epoch

magic number

submodules

internal audit trail

external audit trail

You must log in to answer this question.

Linked

Hot Network Questions

get_git_tag.sh

iget_git_tag.sh

Questions

1 Answer 1

parameters

output

env vars

epoch

magic number

submodules

internal audit trail

external audit trail

You must log in to answer this question.

Linked

Related