Skip to content

fix(ado): resolve az binary via shutil.which for Windows az.cmd (closes #1430)#1432

Merged
danielmeppiel merged 3 commits into
mainfrom
danielmeppiel/fix-ado-az-cmd-windows-1430
May 21, 2026
Merged

fix(ado): resolve az binary via shutil.which for Windows az.cmd (closes #1430)#1432
danielmeppiel merged 3 commits into
mainfrom
danielmeppiel/fix-ado-az-cmd-windows-1430

Conversation

@danielmeppiel

Copy link
Copy Markdown
Collaborator

TL;DR

apm update against private Azure DevOps deps fails on Windows with Authentication failed for dev.azure.com and the diagnostic "az present but not logged in" — even when the user IS signed in via az login. The root cause is that subprocess.run(["az", ...]) calls CreateProcessW, which does not honor PATHEXT for non-.exe executables, so the Windows az.cmd wrapper cannot be invoked despite shutil.which("az") resolving it. This PR resolves az via shutil.which once at construction and widens the ADO preflight carve-out so Git Credential Manager survives as a fallback. Closes #1430.

Problem (WHY)

Important

The user is shown a remediation message instructing them to run az login — the exact action they have already taken. Worst-possible-failure-mode for diagnostics.

The bug surfaces only on apm update, not apm install, because the preflight is gated on update_refs. The Windows-only cascade:

  1. AzureCliBearerProvider.is_available() returns True (shutil.which honors PATHEXT).
  2. Every subprocess.run([self._az_command, ...]) raises FileNotFoundError (CreateProcessW does not honor PATHEXT).
  3. _run_get_access_token wraps the OSError as AzureCliBearerError(kind="subprocess_error").
  4. _resolve_token in core/auth.py swallows it -> (None, "none", "basic").
  5. _preflight_auth_check builds a probe env that strips GIT_CONFIG_GLOBAL / GIT_CONFIG_NOSYSTEM, killing Git Credential Manager — the only remaining channel that could have rescued the probe with an Entra-cached credential.
  6. git ls-remote returns 401. build_error_context re-hits the same broken subprocess in get_current_tenant_id(), swallows the failure, and renders Case 3: "az present but not logged in".

PROSE anchor: "Grounding outputs in deterministic tool execution transforms probabilistic generation into verifiable action." The fix replaces a probabilistic PATH lookup (left to CreateProcessW) with a deterministic resolution (shutil.which).

Approach (WHAT)

Two layers, both required:

Layer Where What
Root cause src/apm_cli/core/azure_cli.py Resolve az via shutil.which once in __init__; pass the absolute path to every subprocess.run.
Defense in depth src/apm_cli/install/pipeline.py Widen the existing generic-host carve-out so ADO preflight probes also keep the user's ~/.gitconfig available, preserving Git Credential Manager as a fallback channel.

Implementation (HOW)

src/apm_cli/core/azure_cli.py__init__ resolves az_command via shutil.which once and stores either the absolute path or None. Absolute paths supplied by callers (test injection) are trusted verbatim; everything else flows through shutil.which so a CWD-relative token like subdir/az cannot bypass resolution. is_available() becomes state-only (no per-call shutil.which), so it can no longer disagree with get_bearer_token()'s pre-check. get_current_tenant_id() gains an explicit early-return when _az_command is None to avoid passing None as argv[0] (previously masked by a broad except).

src/apm_cli/install/pipeline.py — the carve-out at line 160 changes from if is_generic: to if is_generic or is_azure_devops_hostname(host):. _dl.git_env (built by GitAuthEnvBuilder.setup_environment) still forces an empty global gitconfig for the actual clone path; only the single ls-remote probe leg gains GCM access. The bearer-fallback _bearer_op builds its own clean env via _build_git_env(scheme="bearer") and is unaffected.

Diagrams

Cascade before and after the fix — every step is a real call site in the diff.

flowchart LR
    subgraph Before["Before (Windows, az.cmd installed, user logged in)"]
        A1["is_available()<br/>shutil.which → True"] --> B1["subprocess.run(['az', ...])<br/>CreateProcessW: FileNotFoundError"]
        B1 --> C1["AzureCliBearerError<br/>kind=subprocess_error"]
        C1 --> D1["_resolve_token<br/>→ (None, 'none', 'basic')"]
        D1 --> E1["preflight strips<br/>GIT_CONFIG_GLOBAL"]
        E1 --> F1["GCM disabled<br/>ls-remote 401"]
        F1 --> G1["Case 3:<br/>'az present, not logged in'"]
    end
    subgraph After["After"]
        A2["__init__<br/>shutil.which → C:\\...\\az.cmd"] --> B2["subprocess.run(['C:\\...\\az.cmd', ...])<br/>CreateProcessW OK"]
        B2 --> C2["bearer token<br/>preflight passes"]
    end
    style G1 stroke:#d00,stroke-width:2px
    style C2 stroke:#070,stroke-width:2px
Loading

Trade-offs

  • is_available() is now construction-time only. PATH changes mid-process won't be observed; restart the CLI if you install az after APM is running. Acceptable: tokens are cached anyway and the provider is a process-wide singleton.
  • ADO preflight now consults ~/.gitconfig. A user's insteadOf rewrite could redirect the ls-remote probe to a different host. Mitigations: the probe is read-only; the actual clone path retains full gitconfig isolation; "attacker controls ~/.gitconfig" implies the attacker already owns the git client. Generic hosts already had this carve-out ([BUG] apm install --update fails for GHES/generic hosts due to pipeline probe blocking credential helpers #1082); the security argument applies equally to ADO.
  • Unconditional carve-out vs feature flag. Considered an opt-in env var; rejected. The surface is narrow (one read-only probe), GCM-for-ADO is the expected corporate setup, and a flag adds maintenance for negligible upside.

Benefits

  1. Windows ADO users on the documented az login path can apm update again.
  2. The diagnostic no longer lies — is_available() and get_bearer_token() now agree.
  3. GCM remains a viable fallback when bearer acquisition fails for any reason (sandbox, proxy, future PATH quirks).
  4. Test injection contract preserved: callers can still pass an absolute path verbatim.

Validation

Note

CI lint mirror green; affected suites all pass locally.

$ uv run --extra dev ruff check src/ tests/ && uv run --extra dev ruff format --check src/ tests/ \
    && uv run --extra dev python -m pylint --disable=all --enable=R0801 --min-similarity-lines=10 \
       --fail-on=R0801 src/apm_cli/ && bash scripts/lint-auth-signals.sh
All checks passed!
1037 files already formatted
Your code has been rated at 10.00/10
[+] auth-signal lint clean

$ uv run --extra dev pytest tests/unit/test_azure_cli.py \
    tests/unit/install/test_pipeline_auth_preflight.py \
    tests/integration/test_ado_preflight_bearer_fallback_e2e.py -q
41 passed in 1.65s

Scenario evidence

User-promise scenario Test Principle
Windows az.cmd resolves to subprocess argv[0] tests/unit/test_azure_cli.py::TestWindowsAzCmdResolution::test_get_bearer_token_invokes_resolved_az_cmd_path Portability-by-manifest
Pre-fix FileNotFoundError shape no longer surfaces as subprocess_error tests/unit/test_azure_cli.py::TestWindowsAzCmdResolution::test_bare_az_would_raise_filenotfound_but_resolved_path_succeeds Determinism
is_available() is consistent with get_bearer_token() pre-check tests/unit/test_azure_cli.py::TestIsAvailable::test_is_available_stable_after_construction DevX (no misleading diagnostics)
ADO preflight no longer disables GCM tests/unit/install/test_pipeline_auth_preflight.py::TestPreflightGenericHostAllowsCredentialHelpers::test_ado_host_omits_credential_blocking_env Auth-fallback contract
Existing ADO PAT→bearer fallback still works tests/integration/test_ado_preflight_bearer_fallback_e2e.py (unchanged, passes) Regression-trap
get_current_tenant_id short-circuits cleanly when az missing tests/unit/test_azure_cli.py::TestWindowsAzCmdResolution::test_get_current_tenant_id_returns_none_without_subprocess_when_az_missing Explicit > swallowed exception

How to test

  1. On Windows with az.cmd installed and az login completed: clone a repo with a private ADO dep in apm.yml, run apm update. Pre-fix: fails with "az present but not logged in". Post-fix: succeeds.
  2. On Windows without az installed but with GCM configured for dev.azure.com: run apm update. Post-fix: GCM resolves the credential and the probe passes.
  3. On Linux/macOS: run the focused suite — uv run --extra dev pytest tests/unit/test_azure_cli.py tests/unit/install/test_pipeline_auth_preflight.py tests/integration/test_ado_preflight_bearer_fallback_e2e.py.
  4. Lint mirror — uv run --extra dev ruff check src/ tests/ && uv run --extra dev ruff format --check src/ tests/.

Closes #1430.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

danielmeppiel and others added 3 commits May 21, 2026 16:47
#1430)

On Windows the Azure CLI ships as az.cmd (a batch wrapper). Python's
subprocess.run -> CreateProcessW does NOT honor PATHEXT for non-.exe
executables, so a bare "az" token cannot be resolved and raises
FileNotFoundError. shutil.which("az") DOES honor PATHEXT and finds
az.cmd, so AzureCliBearerProvider.is_available() returned True while
every actual subprocess call failed. The error then cascaded through
the ADO --update preflight, killing Git Credential Manager along the
way, and surfaced as the misleading 'az present but not logged in'
diagnostic -- even when the user WAS logged in via az login.

Layer 1 (root cause): resolve the az executable once in __init__ via
shutil.which and store the absolute path. Use it in every subprocess
call (_run_get_access_token, get_current_tenant_id). is_available()
now mirrors the same construction-time resolution so it can no longer
disagree with get_bearer_token()'s pre-check. Explicit absolute /
separator-bearing paths from callers (tests) are trusted verbatim.

Layer 2 (defense in depth): pipeline.py _preflight_auth_check now
strips GIT_CONFIG_GLOBAL / GIT_CONFIG_NOSYSTEM / GIT_ASKPASS from the
probe env for ADO hosts too (previously only generic). This lets Git
Credential Manager answer for Entra-cached ADO credentials when az
bearer acquisition is unavailable for any reason -- the #1430 PATH
quirk, sandbox, proxy, etc. The actual clone path remains isolated
(it builds its own clean env via GitAuthEnvBuilder.setup_environment),
so the security rationale for empty global gitconfig during clone is
preserved. Only the single ls-remote probe leg gains GCM access.

Tests:
- New TestWindowsAzCmdResolution class with the exact regression
  shape: shutil.which returns a Windows az.cmd path; verify subprocess
  receives it as argv[0] for both get_bearer_token and tenant probe.
- New is_available stability test (resolution is one-shot).
- Init absolute-path bypass test (preserves test injection contract).
- Existing test_ado_host_retains_credential_blocking_env renamed and
  inverted to assert the new contract: ADO preflight no longer kills
  GCM. Existing ADO bearer-fallback E2E unchanged (the fake-git path
  has no GCM installed so behavior is unaffected).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three nits surfaced by auth-expert + supply-chain-security panel:

1. Tighten the explicit-path bypass in AzureCliBearerProvider.__init__
   from 'isabs OR contains os.sep' to 'isabs only'. The os.sep heuristic
   accepted relative-with-separator tokens like 'subdir/az' verbatim,
   handing a CWD-relative path to subprocess.run without going through
   shutil.which. Only absolute paths are now trusted verbatim; everything
   else flows through resolution. New test
   test_init_relative_with_separator_still_resolves locks the contract.

2. Add an explicit early-return in get_current_tenant_id when
   _az_command is None. Previously this case relied on the broad
   'except Exception' to swallow a TypeError from passing None as
   argv[0]. The new guard mirrors get_bearer_token's is_available()
   pre-check. New test
   test_get_current_tenant_id_returns_none_without_subprocess_when_az_missing
   asserts subprocess.run is never invoked in that path.

3. Add an explicit FileNotFoundError regression-trap test
   (test_bare_az_would_raise_filenotfound_but_resolved_path_succeeds)
   that simulates the exact Windows CreateProcessW behaviour: argv[0]
   == 'az' raises FileNotFoundError, argv[0] == resolved .cmd path
   succeeds. Pins both halves of the #1430 contract in a single test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 21, 2026 14:55
@danielmeppiel danielmeppiel merged commit 66a9a11 into main May 21, 2026
18 checks passed
@danielmeppiel danielmeppiel deleted the danielmeppiel/fix-ado-az-cmd-windows-1430 branch May 21, 2026 14:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Windows-specific Azure DevOps authentication failure during apm update by making Azure CLI invocation deterministic (resolving az via shutil.which so az.cmd works) and by relaxing the ADO preflight probe environment to preserve Git Credential Manager as a fallback channel.

Changes:

  • Resolve and store the az executable path at AzureCliBearerProvider construction time and reuse it for all subprocess calls.
  • Expand the preflight probe environment carve-out to include Azure DevOps hosts so credential helpers (notably GCM) can participate in the ls-remote probe.
  • Add/adjust unit tests covering Windows az.cmd resolution behavior and the updated preflight environment expectations; add a changelog entry.
Show a summary per file
File Description
src/apm_cli/core/azure_cli.py Resolves az once at construction and uses the resolved path for subprocess calls; makes is_available() state-based and adds a guard in get_current_tenant_id().
src/apm_cli/install/pipeline.py Widens the preflight env carve-out to also apply to Azure DevOps hosts to preserve credential helper behavior for the probe.
tests/unit/test_azure_cli.py Adds regression tests for Windows az.cmd resolution and stability of is_available() after construction.
tests/unit/install/test_pipeline_auth_preflight.py Updates/renames the ADO preflight env test to assert credential-blocking vars are omitted.
CHANGELOG.md Adds an Unreleased Fixed entry documenting the behavior change.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 3

Comment thread CHANGELOG.md

### Fixed

- `apm update` against private Azure DevOps deps no longer fails on Windows with a misleading "az present but not logged in" diagnostic when the user IS signed in via `az login`. Root cause: Python's `subprocess.run(["az", ...])` -> `CreateProcessW` does not honor `PATHEXT` for non-`.exe` executables, so the Windows `az.cmd` wrapper could not be invoked even though `shutil.which("az")` resolved it. `AzureCliBearerProvider` now resolves the `az` binary via `shutil.which` once at construction and passes the absolute path to every subprocess call. As a defense-in-depth measure, the ADO `--update` preflight probe no longer strips `GIT_CONFIG_GLOBAL` / `GIT_CONFIG_NOSYSTEM` / `GIT_ASKPASS`, so Git Credential Manager can answer for Entra-cached ADO credentials whenever bearer acquisition is unavailable for any reason (sandbox, proxy, future PATH quirks). The actual clone path keeps its full gitconfig isolation. (#1430)
Comment on lines 114 to +124
def is_available(self) -> bool:
"""Return True iff the ``az`` binary is on PATH.
"""Return True iff the ``az`` binary was found on PATH at construction.

Resolution is one-shot (performed in :meth:`__init__`). PATH changes
made after the provider was constructed will NOT be observed; restart
the CLI if you have installed ``az`` mid-session.

Does NOT check whether the user is logged in -- that requires a
subprocess call and is deferred to :meth:`get_bearer_token`.
"""
return shutil.which(self._az_command) is not None
return self._az_command is not None
Comment on lines +160 to 176
# GIT_CONFIG_GLOBAL / GIT_CONFIG_NOSYSTEM carve-out: GitAuthEnvBuilder
# forces an empty global gitconfig for ALL hosts to prevent a user's
# ~/.gitconfig insteadOf rewrites or credential helpers from leaking
# tokens during a clone. But for preflight probes (a single ls-remote
# against the same host the dep targets), the redirection surface is
# nil and killing the user's global config kills Git Credential
# Manager along with it -- the helper most Windows ADO users rely on
# for Entra-cached credentials. For ADO specifically that matters
# because bearer acquisition can fail for reasons unrelated to login
# state (sandbox, proxy, microsoft/apm#1430-style PATH quirks), and
# GCM is the only remaining channel that can save us. Generic hosts
# have the same logic; widening the carve-out to ADO keeps the
# actual clone path isolated (it builds its own clean env) while
# giving the preflight probe the best chance to succeed.
if is_generic or is_azure_devops_hostname(host):
for _key in ("GIT_CONFIG_GLOBAL", "GIT_CONFIG_NOSYSTEM", "GIT_ASKPASS"):
probe_env.pop(_key, None)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] apm update ADO bearer auth fails on Windows: subprocess.run can't resolve az.cmd

2 participants