Skip to content

Conversation

@cpjet64
Copy link
Contributor

@cpjet64 cpjet64 commented Oct 24, 2025

Preserve CRLF/LF newline consistency on Windows edits

Problem & intent

Edits on Windows sometimes converted CRLF files to LF or produced mixed endings, creating noisy diffs and IDE warnings. This PR ensures newline preservation during file writes.


Before / After

Before
The text writer emitted LF regardless of a file’s original newline convention.
CRLF files could flip to LF or end up mixed.

After

  • Writer preserves the original file’s EOL.
  • If the on‑disk file used CRLF → edits keep CRLF.
  • If LF → edits keep LF.
  • New files follow the existing global default (unchanged).

Scope of change

codex-rs/apply-patch/src/lib.rs

  • Detect original EOL from file contents.
  • Normalize final text to match at write point.
  • Introduce tiny helper functions: Eol, detect_eol, normalize_to_eol.
  • Single call at the final text write site.

No changes to binary paths, new-file adds, TUI paste logic, or PATH/PATHEXT/env behavior.


Security effects

None. Only affects how edited text is serialized to disk.


Testing evidence

Static

  • cargo fmt
  • cargo clippy -p codex-apply-patch -- -D warningsclean
  • cargo test -p codex-apply-patch → added CRLF/LF preservation tests pass; full suite green.

Manual

  • CRLF originals remain CRLF after edits.
  • LF originals remain LF.

Notes

  • Diff/preview logic remains LF to avoid snapshot churn; preservation occurs only at final write.
  • New files keep default EOL behavior by design.

Checklist

@dylan-hurd-oai dylan-hurd-oai self-requested a review October 28, 2025 05:34
Copy link
Collaborator

@dylan-hurd-oai dylan-hurd-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpjet64 thanks for digging into this! Really appreciate the investigation here. My primary concern is whether this change will be sufficiently deterministic.

If you're interested in implementing follow-ups to the review, we're happy to continue the discussion as you iterate, or we can just build off the work you've contributed so far. But we're eager to land a fix here to improve the Windows experience!

Crlf,
}

fn detect_eol(s: &str) -> Eol {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will sufficiently cover all cases, e.g. how do we expect to handle newly created files in a repo that uses CRLF?

We might want to consider a more deterministic approach for detect_eol. We could check git configuration.

Curious if you have thoughts on adding logic to collect_git_info and passing using a cli flag to inform this setting instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Thanks for the review. I kept this PR intentionally small to limit the review surface. If you’re okay with a little more LOC I’ll add a follow-up commit on this same branch (fix/win-crlf-preserve) to make new-file EOL behavior deterministic while leaving edits to existing files unchanged.

Proposal (follow-up commit)

Behavior

  • Existing files: unchanged — detect EOL from on-disk contents and preserve (fast, backward compatible).

  • New files (deterministic precedence):

    1. .gitattributes for the path (use git check-attr eol -- <path> where available)
    2. git config (core.eol then core.autocrlf)
    3. CLI/env override: --assume-eol=(git|crlf|lf|detect) or APPLY_PATCH_ASSUME_EOL
    4. Fallback to the current default behavior (what this writer already uses)

Wiring

  • Add a decide_eol() helper that implements the precedence and call it at the final write site before normalizing text.
  • Add a --assume-eol CLI flag + optional env var for explicit overrides.
  • Keep diff/preview logic unchanged (previews remain LF); normalization only affects final disk writes.

Common edge cases & how this follow-up handles them

  1. Repo not a git repo / git unavailable

    • Fall back to existing behavior. git is best-effort; failures are non-fatal and logged at debug level.
  2. .gitattributes pattern matching complexity

    • Prefer git check-attr to avoid reimplementing matching. If git is missing, fall back to a best-effort .gitattributes parse (documented).
  3. core.autocrlf semantics / cross-machine variance

    • Precedence: .gitattributes > core.eol > core.autocrlf.
    • Interpret core.autocrlf=true → CRLF, input → LF, false → unknown. .gitattributes is the repo-level guarantee.
  4. Binary files / -text attribute

    • If git check-attr reports binary or -text, skip normalization — don’t rewrite binaries.
  5. Mixed line endings in existing files

    • Preserve current behavior: detect dominant EOL and preserve. No auto-fixing of mixed files.
  6. Large files / performance

    • We only read content for existing files (current behavior). A streaming heuristic can be added later if needed.
  7. Paths with spaces / special characters

    • Use std::process::Command (no shell) and pass path args directly to git to avoid escaping issues.
  8. Files outside repo / submodules / worktrees

    • If file sits outside the repo root passed to the writer, skip repo checks and fall back. Submodule/worktree behavior depends on repo root provided; failures fall back safely.
  9. CI vs local dev differences

    • .gitattributes is authoritative for CI; that’s why it’s first in precedence.
  10. Old git without check-attr

    • Fall back to parsing .gitattributes if git check-attr isn’t available; document this as weaker behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough proposal and consideration here! At first glance this feels reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered adding a TUI slash command to auto-detect CRLF/LF, force CRLF, or force LF, but it would have been a much larger addition. I think most users would never use it, so it risked becoming bloat.

Context:
* Windows edits sometimes flipped CRLF files to LF or created mixed endings.

Change:
* Detect the original file EOL (treat any \r\n as CRLF; else LF) and normalize the final text to match only for Update File writes.
* Added small helper (detect_eol/normalize_to_eol) and a single call at the final write site.
* Added focused tests to verify CRLF/LF preservation.

Risk:
* Minimal and localized. Only affects the text update write path; binary paths and new-file adds are unchanged.

Tests:
* cargo fmt
* cargo clippy -p codex-apply-patch -- -D warnings (clean)
* cargo test -p codex-apply-patch (all tests passed; includes new EOL tests)

Verification:
* Verified on local runs that updating CRLF files preserves CRLF and LF files preserve LF. New-file behavior remains unchanged.
@cpjet64 cpjet64 force-pushed the fix/win-crlf-preserve branch from 1822f8e to 0c56804 Compare October 28, 2025 14:40
Copy link
Contributor Author

@cpjet64 cpjet64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Commit pushed. Please let me know if you have any questions or want any changes! Super happy to help resolve any and all Windows issues!

Copy link
Collaborator

@dylan-hurd-oai dylan-hurd-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like the direction here! 1 strong opinion on the precedence, otherwise mostly structural requestsx

AssumeEol::Lf => return Eol::Lf,
AssumeEol::Crlf => return Eol::Crlf,
_ => {}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO cli args should take precedence, even for existing files. If the client of this command explicitly sets their preferene, we should consistently adhere to it. e.g. if we wanted to use this option in core, we would also want it to flow through and ignore the split on existing files.

// Existing files: detect from on-disk bytes
if !is_new_file {
if let Ok(bytes) = std::fs::read(path) {
let e = detect_eol_from_bytes(&bytes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this fallback should just be based on the patch contents here (i.e. use the previous logic). Re-reading the entire file is going to be too much of a performance hit, and if a file is already mixed then adding more based on the line endings near the edit seems reasonable / equivalently deterministic.

let _argv0 = args.next();
// Allow optional flags, then a single positional PATCH argument; otherwise read from stdin.
let args = std::env::args().skip(1).collect::<Vec<String>>();
let mut patch_arg: Option<String> = None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have more than 1 arg, I would prefer to switch to using clap for arg parsing here

.arg("check-attr")
.arg("eol")
.arg("text")
.arg("binary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function call here suggests it's generic, but it parses out specific attributes for line endings. Can we make either move this into git_check_attr_eol to simplify, or make this logic generic?


fn os_native_eol() -> Eol {
if cfg!(windows) { Eol::Crlf } else { Eol::Lf }
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.rs has now crossed 2000 lines. We should have done this already, but can we start splitting this up? Let's move the eol functions into their own eol mod, please

_ => {}
}

// 2) .gitattributes, core.eol, core.autocrlf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting that I'm a bit conflicted about whether this config or patch contents should take precedence for existing files. I think the performance of not executing git commands on every patch is likely worth it, but given how core apply_patch is I'd like to be exceedingly thoughtful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylan-hurd-oai Would you have a issue with some simple caching? Heres what I am thinking:

Precedence and rationale:
For existing files, CLI/env wins; otherwise we use the original file’s EOL from the patch/original buffer (no git), which preserves intent and avoids I/O.
For new files, CLI/env → .gitattributes (path‑specific) → core.eol → core.autocrlf → OS native → detect; this matches repo policy and CI while staying deterministic.

Performance:
Hot path for existing files does zero git calls and no extra file re‑reads (we infer from bytes already in memory or the hunk’s adjacent lines).
New‑file policy does at most two cheap git queries per path (attrs + config), memoized per run.

Edge handling:
Mixed EOL in existing files keeps the dominant EOL (we don’t auto‑“fix” mixed files); binaries or -text skip normalization; trailing newline presence is preserved exactly; previews stay LF‑normalized—only the final write normalizes.

Caching safety:
Per‑process, in‑memory caches keyed by (repo_root, rel_path) with invalidation if the patch touches any .gitattributes; on any git error we fall back by precedence (no risk of stale/corrupt state).

If this works I can easily roll it in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cache approach is overkill for how apply_patch is used. The primary codex harness starts a new apply_patch process for every call, so this cache will be restarted every time. I agree that we want an efficient approach here - i.e. we should perform the minimum number of git calls - but I think we can do so with a simpler approach that simply loads the config for all files to be modified at the start of the process.

…ted caching

Context:
* Address Dylan’s review on EOL handling: extract logic into eol.rs, add a clear CLI/env override, and ensure precedence covers .gitattributes, core.eol, core.autocrlf, and OS native. Preserve trailing newline presence exactly and skip normalization for binary/-text. Add an opt‑in caching layer to reduce repeated git calls per run.

Change:
* Extracted EOL types + helpers into src/eol.rs (detect, normalize, git helpers, OS native).
* Implemented precedence: CLI/env > .gitattributes (typed; binary/-text -> Unknown) > core.eol > core.autocrlf > OS native; plus detect‑from‑patch for new files in non‑git dirs or when --assume-eol=detect.
* Standalone binary now uses clap; new -E/--assume-eol {lf|crlf|git|detect}; CLI wins over env.
* Added process‑local, feature‑gated caches for git lookups (core.eol, core.autocrlf, check‑attr) with invalidation on .gitattributes Add/Update/Delete.
* Strengthened Windows handling: repo root canonicalization, forward‑slash rel paths, and resilient rel calculation; ensured attrs win absent overrides.
* Tests: added/updated unit + integration tests for detection, normalization, precedence, detect fallback, CLI/env interaction, Windows path behavior, and caching hit/miss + invalidation.

Risk:
* Low; behavior identical except where explicitly specified (detect fallback for new files). Caches are in‑memory and feature‑gated (default on). No sandbox or shared crates touched.

Tests:
* cargo test -p codex-apply-patch (with cache): 54 passed.
* cargo test -p codex-apply-patch --no-default-features (cache off): 52 passed.
* Workspace run shows an existing app-server user-agent test failure; unrelated to apply‑patch.

Verification:
* Ran cargo fmt, cargo clippy (scoped), and tests as above. Confirmed invalidation triggers on .gitattributes changes and that CLI ‘detect’ overrides repo policy as intended.
@cpjet64
Copy link
Contributor Author

cpjet64 commented Oct 28, 2025

@dylan-hurd-oai Commit pushed with requested changes. Please let me know if I am on the right track or if theres anything else you would prefer.

EOL Policy Implementation

  • Single source of truth: All EOL policy and git logic consolidated into eol.rs, centralizing decisions for easier testing and evolution.

  • Explicit, test-backed precedence:

    • CLI/env.gitattributes (typed; binary/-text → skip) → core.eolcore.autocrlf → OS native
    • Narrow detect-from-patch fallback only for new files (non-git or when explicitly requested)
  • Attribute parsing kept specific: .gitattributes handling is localized and only interprets eol/text/binary, per the "make it specific" guidance.

  • No re-reads for existing files: EOL is inferred from original in-memory bytes, preserving trailing newline presence exactly.

  • Binary/-text respected: When attributes indicate binary or -text, normalization is skipped entirely.

  • Windows resilience:

    • Normalize relative paths (forward slashes, lowercased on Windows)
    • Canonicalize repo roots
    • Ensure .gitattributes wins absent overrides
  • Performance without behavior change:

    • Cache is behind a compile-time feature gate for opt-in/out flexibility
    • Memory-only with scoped invalidation on .gitattributes changes
    • Does not alter precedence or outputs
    • Enabled by default for this branch; disable via --no-default-features to compare behavior/performance side-by-side
    • Highly recommend implementing the cache functionality as during my local testing on my large rappct project using cargo run it was a significant reduction in process time though it wasnt as noticeable on any of my small projects

@etraut-openai
Copy link
Collaborator

Looks like this PR also has a minor formatting issue to address.

@etraut-openai etraut-openai added the needs-response Additional information is requested label Nov 5, 2025
…heck

Context:
* rust-ci Format job runs cargo fmt --check on stable; grouped use statements failed under this config.

Change:
* Replace grouped imports in apply-patch/src/lib.rs with per-item imports at a few call sites.

Risk:
* None; no semantic changes.

Tests:
* cargo test -p codex-apply-patch: all tests pass locally.
* cargo clippy -p codex-apply-patch: clean.

Verification:
* cargo fmt --check passes locally in codex-rs.

(cherry picked from commit 0310b4c)
@cpjet64 cpjet64 force-pushed the fix/win-crlf-preserve branch from 57aac23 to 8e6d8bf Compare November 6, 2025 04:43
@cpjet64
Copy link
Contributor Author

cpjet64 commented Nov 6, 2025

Looks like this PR also has a minor formatting issue to address.

Applied the formatting and a additional fix. @etraut-openai

@etraut-openai
Copy link
Collaborator

Thanks. Looks like some of the tests are still failing.

cpjet64 and others added 11 commits November 6, 2025 10:01
…h across OSes; recheck attributes for AddFile when rel missing; serialize EOL cache tests
… paths; exact parsing; prefer attributes for AddFile; add parser/cross-platform tests
…o core.eol/core.autocrlf; normalize with chosen EOL

Context:
* New-file EOL selection on macOS CI was nondeterministic due to path normalization and system/global Git settings leaking into attribute checks.

Change:
* Add choose_eol_for_new_file(..) with precedence: explicit override (lf/crlf) > .gitattributes > core.eol > core.autocrlf > LF; for git/detect, fall back to repo policy.
* Normalize repo-relative keys for git check-attr; harden with -c core.attributesfile=(NUL|/dev/null) and env guards (GIT_ATTR_NOSYSTEM=1, GIT_CONFIG_NOSYSTEM=1, GIT_CONFIG_GLOBAL=(NUL|/dev/null)).
* Use chooser in add-file path and normalize content via normalize_to_eol_preserve_eof before write.
* Delegate decide_eol(new-file) to chooser for one policy path.
* Tests: add chooser unit tests; update gitattributes/core.eol expectations; remove masking override in CRLF attr test.

Risk:
* Scoped to codex-apply-patch new-file EOL behavior; existing-file handling unchanged. Behavior becomes deterministic across hosts.

Tests:
* Unit tests for gitattributes, core.eol, core.autocrlf, env override; CLI tests for add-file path.

Verification:
* cargo fmt
* cargo clippy -p codex-apply-patch --all-features -- -D warnings (clean)
* cargo nextest run -p codex-apply-patch --all-features --no-fail-fast: 67/67 passed.
@cpjet64
Copy link
Contributor Author

cpjet64 commented Nov 7, 2025

@etraut-openai Sorry for the delay. All tests seem to be passing now.

@etraut-openai etraut-openai removed the needs-response Additional information is requested label Nov 7, 2025
@etraut-openai
Copy link
Collaborator

@cpjet64, thanks for the update. Looking at the diff, it looks much larger than it should be. I'm guessing that it's because you've modified the line endings (LF vs CRLF) on some files. Please fix that. It will make the code review much easier.

@etraut-openai etraut-openai added needs-response Additional information is requested and removed needs-response Additional information is requested labels Nov 7, 2025
@etraut-openai
Copy link
Collaborator

@cpjet64, looking good. We'll probably wait until early next week to merge this because we're juggling a bunch of other changes right now and are trying to manage overall risk. Thanks again for this change. It's going to be a good quality-of-life improvement for many codex users.

@cpjet64
Copy link
Contributor Author

cpjet64 commented Nov 7, 2025

No worries/thank you! Sorry I haven't been able to get this completed yet today has been crazy for me. Just a heads up this PR does contain a lot of code but I did get hit with a EOL issue so from my initial look over I'd say about half the changed LOC is actual code and the rest is EOL stuff I need to go through and fix. I appreciate the patience!

Copy link
Collaborator

@dylan-hurd-oai dylan-hurd-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review! Excited by the progress, but I think we have a couple things to address here:

  • I left a comment about the cache in the specific thread, but I think we should simplify the implementation here. Let's discuss there!
  • Agree with @etraut-openai about line ending artifacts, please address
  • I'm seeing several comment artifacts about code changes, please remove them
  • For checking existing files, I think the original approach of inferring from just the apply_patch original content is sufficient, rather than reading entire files.

// repo_root_from_cwd removed; we compute repo root relative to each path.

// legacy_decide_eol_do_not_use removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment here

/// - CLI override Lf/Crlf wins
/// - CLI override Git => consult Git
/// - Otherwise consult .gitattributes → core.eol → core.autocrlf
/// - Fall back to OS native; detection from patch bytes should be handled by caller
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're repeating this comment a few times - we should specify it in one place, so it is more likely to stay up to date

}
// Only used by tests; no non-test variant needed.

// Note: git_core_autocrlf_* no longer used in production code; omitted outside tests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments look like artifacts - please remove. If functions are only used by tests, they should be in the test module

_ => {}
}

// 2) .gitattributes, core.eol, core.autocrlf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cache approach is overkill for how apply_patch is used. The primary codex harness starts a new apply_patch process for every call, so this cache will be restarted every time. I agree that we want an efficient approach here - i.e. we should perform the minimum number of git calls - but I think we can do so with a simpler approach that simply loads the config for all files to be modified at the start of the process.

echo '```';
git config --list --show-origin | sort || true;
echo '```';
} >> "$GITHUB_STEP_SUMMARY"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here explaining why the default git behavior is insufficient?

assert_eq!(fs::read_to_string(&absolute_path)?, "hello\n");
{
let s = fs::read_to_string(&absolute_path)?;
assert_eq!(s.replace("\r\n", "\n"), "hello\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is leaking our eol change to other parts of the codebase, and I don't think existing tests should have to change to continue passing

@cpjet64
Copy link
Contributor Author

cpjet64 commented Nov 11, 2025

Just checking in. I am working through the comments and changes as I get the time. Thank you for your input.

@etraut-openai
Copy link
Collaborator

Closing because of lack of response. If you want to continue to work on this, we can reopen.

@etraut-openai etraut-openai removed the needs-response Additional information is requested label Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Patched files have mixed line endings on Windows

3 participants