-
Notifications
You must be signed in to change notification settings - Fork 6.8k
fix[apply-patch]: preserve original CRLF/LF line endings on text updates #5587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dylan-hurd-oai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpjet64 thanks for digging into this! Really appreciate the investigation here. My primary concern is whether this change will be sufficiently deterministic.
If you're interested in implementing follow-ups to the review, we're happy to continue the discussion as you iterate, or we can just build off the work you've contributed so far. But we're eager to land a fix here to improve the Windows experience!
codex-rs/apply-patch/src/lib.rs
Outdated
| Crlf, | ||
| } | ||
|
|
||
| fn detect_eol(s: &str) -> Eol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will sufficiently cover all cases, e.g. how do we expect to handle newly created files in a repo that uses CRLF?
We might want to consider a more deterministic approach for detect_eol. We could check git configuration.
Curious if you have thoughts on adding logic to collect_git_info and passing using a cli flag to inform this setting instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylan-hurd-oai Thanks for the review. I kept this PR intentionally small to limit the review surface. If you’re okay with a little more LOC I’ll add a follow-up commit on this same branch (fix/win-crlf-preserve) to make new-file EOL behavior deterministic while leaving edits to existing files unchanged.
Proposal (follow-up commit)
Behavior
-
Existing files: unchanged — detect EOL from on-disk contents and preserve (fast, backward compatible).
-
New files (deterministic precedence):
.gitattributesfor the path (usegit check-attr eol -- <path>where available)git config(core.eolthencore.autocrlf)- CLI/env override:
--assume-eol=(git|crlf|lf|detect)orAPPLY_PATCH_ASSUME_EOL - Fallback to the current default behavior (what this writer already uses)
Wiring
- Add a
decide_eol()helper that implements the precedence and call it at the final write site before normalizing text. - Add a
--assume-eolCLI flag + optional env var for explicit overrides. - Keep diff/preview logic unchanged (previews remain LF); normalization only affects final disk writes.
Common edge cases & how this follow-up handles them
-
Repo not a git repo /
gitunavailable- Fall back to existing behavior.
gitis best-effort; failures are non-fatal and logged at debug level.
- Fall back to existing behavior.
-
.gitattributespattern matching complexity- Prefer
git check-attrto avoid reimplementing matching. Ifgitis missing, fall back to a best-effort.gitattributesparse (documented).
- Prefer
-
core.autocrlfsemantics / cross-machine variance- Precedence:
.gitattributes>core.eol>core.autocrlf. - Interpret
core.autocrlf=true→ CRLF,input→ LF,false→ unknown..gitattributesis the repo-level guarantee.
- Precedence:
-
Binary files /
-textattribute- If
git check-attrreportsbinaryor-text, skip normalization — don’t rewrite binaries.
- If
-
Mixed line endings in existing files
- Preserve current behavior: detect dominant EOL and preserve. No auto-fixing of mixed files.
-
Large files / performance
- We only read content for existing files (current behavior). A streaming heuristic can be added later if needed.
-
Paths with spaces / special characters
- Use
std::process::Command(no shell) and pass path args directly togitto avoid escaping issues.
- Use
-
Files outside repo / submodules / worktrees
- If file sits outside the repo root passed to the writer, skip repo checks and fall back. Submodule/worktree behavior depends on repo root provided; failures fall back safely.
-
CI vs local dev differences
.gitattributesis authoritative for CI; that’s why it’s first in precedence.
-
Old
gitwithoutcheck-attr- Fall back to parsing
.gitattributesifgit check-attrisn’t available; document this as weaker behavior.
- Fall back to parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough proposal and consideration here! At first glance this feels reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also considered adding a TUI slash command to auto-detect CRLF/LF, force CRLF, or force LF, but it would have been a much larger addition. I think most users would never use it, so it risked becoming bloat.
Context: * Windows edits sometimes flipped CRLF files to LF or created mixed endings. Change: * Detect the original file EOL (treat any \r\n as CRLF; else LF) and normalize the final text to match only for Update File writes. * Added small helper (detect_eol/normalize_to_eol) and a single call at the final write site. * Added focused tests to verify CRLF/LF preservation. Risk: * Minimal and localized. Only affects the text update write path; binary paths and new-file adds are unchanged. Tests: * cargo fmt * cargo clippy -p codex-apply-patch -- -D warnings (clean) * cargo test -p codex-apply-patch (all tests passed; includes new EOL tests) Verification: * Verified on local runs that updating CRLF files preserves CRLF and LF files preserve LF. New-file behavior remains unchanged.
1822f8e to
0c56804
Compare
…it config > CLI; preserve EOF newline)
cpjet64
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylan-hurd-oai Commit pushed. Please let me know if you have any questions or want any changes! Super happy to help resolve any and all Windows issues!
dylan-hurd-oai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like the direction here! 1 strong opinion on the precedence, otherwise mostly structural requestsx
codex-rs/apply-patch/src/lib.rs
Outdated
| AssumeEol::Lf => return Eol::Lf, | ||
| AssumeEol::Crlf => return Eol::Crlf, | ||
| _ => {} | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO cli args should take precedence, even for existing files. If the client of this command explicitly sets their preferene, we should consistently adhere to it. e.g. if we wanted to use this option in core, we would also want it to flow through and ignore the split on existing files.
codex-rs/apply-patch/src/lib.rs
Outdated
| // Existing files: detect from on-disk bytes | ||
| if !is_new_file { | ||
| if let Ok(bytes) = std::fs::read(path) { | ||
| let e = detect_eol_from_bytes(&bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fallback should just be based on the patch contents here (i.e. use the previous logic). Re-reading the entire file is going to be too much of a performance hit, and if a file is already mixed then adding more based on the line endings near the edit seems reasonable / equivalently deterministic.
| let _argv0 = args.next(); | ||
| // Allow optional flags, then a single positional PATCH argument; otherwise read from stdin. | ||
| let args = std::env::args().skip(1).collect::<Vec<String>>(); | ||
| let mut patch_arg: Option<String> = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have more than 1 arg, I would prefer to switch to using clap for arg parsing here
codex-rs/apply-patch/src/lib.rs
Outdated
| .arg("check-attr") | ||
| .arg("eol") | ||
| .arg("text") | ||
| .arg("binary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function call here suggests it's generic, but it parses out specific attributes for line endings. Can we make either move this into git_check_attr_eol to simplify, or make this logic generic?
codex-rs/apply-patch/src/lib.rs
Outdated
|
|
||
| fn os_native_eol() -> Eol { | ||
| if cfg!(windows) { Eol::Crlf } else { Eol::Lf } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.rs has now crossed 2000 lines. We should have done this already, but can we start splitting this up? Let's move the eol functions into their own eol mod, please
codex-rs/apply-patch/src/lib.rs
Outdated
| _ => {} | ||
| } | ||
|
|
||
| // 2) .gitattributes, core.eol, core.autocrlf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documenting that I'm a bit conflicted about whether this config or patch contents should take precedence for existing files. I think the performance of not executing git commands on every patch is likely worth it, but given how core apply_patch is I'd like to be exceedingly thoughtful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dylan-hurd-oai Would you have a issue with some simple caching? Heres what I am thinking:
Precedence and rationale:
For existing files, CLI/env wins; otherwise we use the original file’s EOL from the patch/original buffer (no git), which preserves intent and avoids I/O.
For new files, CLI/env → .gitattributes (path‑specific) → core.eol → core.autocrlf → OS native → detect; this matches repo policy and CI while staying deterministic.
Performance:
Hot path for existing files does zero git calls and no extra file re‑reads (we infer from bytes already in memory or the hunk’s adjacent lines).
New‑file policy does at most two cheap git queries per path (attrs + config), memoized per run.
Edge handling:
Mixed EOL in existing files keeps the dominant EOL (we don’t auto‑“fix” mixed files); binaries or -text skip normalization; trailing newline presence is preserved exactly; previews stay LF‑normalized—only the final write normalizes.
Caching safety:
Per‑process, in‑memory caches keyed by (repo_root, rel_path) with invalidation if the patch touches any .gitattributes; on any git error we fall back by precedence (no risk of stale/corrupt state).
If this works I can easily roll it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cache approach is overkill for how apply_patch is used. The primary codex harness starts a new apply_patch process for every call, so this cache will be restarted every time. I agree that we want an efficient approach here - i.e. we should perform the minimum number of git calls - but I think we can do so with a simpler approach that simply loads the config for all files to be modified at the start of the process.
…ted caching
Context:
* Address Dylan’s review on EOL handling: extract logic into eol.rs, add a clear CLI/env override, and ensure precedence covers .gitattributes, core.eol, core.autocrlf, and OS native. Preserve trailing newline presence exactly and skip normalization for binary/-text. Add an opt‑in caching layer to reduce repeated git calls per run.
Change:
* Extracted EOL types + helpers into src/eol.rs (detect, normalize, git helpers, OS native).
* Implemented precedence: CLI/env > .gitattributes (typed; binary/-text -> Unknown) > core.eol > core.autocrlf > OS native; plus detect‑from‑patch for new files in non‑git dirs or when --assume-eol=detect.
* Standalone binary now uses clap; new -E/--assume-eol {lf|crlf|git|detect}; CLI wins over env.
* Added process‑local, feature‑gated caches for git lookups (core.eol, core.autocrlf, check‑attr) with invalidation on .gitattributes Add/Update/Delete.
* Strengthened Windows handling: repo root canonicalization, forward‑slash rel paths, and resilient rel calculation; ensured attrs win absent overrides.
* Tests: added/updated unit + integration tests for detection, normalization, precedence, detect fallback, CLI/env interaction, Windows path behavior, and caching hit/miss + invalidation.
Risk:
* Low; behavior identical except where explicitly specified (detect fallback for new files). Caches are in‑memory and feature‑gated (default on). No sandbox or shared crates touched.
Tests:
* cargo test -p codex-apply-patch (with cache): 54 passed.
* cargo test -p codex-apply-patch --no-default-features (cache off): 52 passed.
* Workspace run shows an existing app-server user-agent test failure; unrelated to apply‑patch.
Verification:
* Ran cargo fmt, cargo clippy (scoped), and tests as above. Confirmed invalidation triggers on .gitattributes changes and that CLI ‘detect’ overrides repo policy as intended.
|
@dylan-hurd-oai Commit pushed with requested changes. Please let me know if I am on the right track or if theres anything else you would prefer. EOL Policy Implementation
|
|
Looks like this PR also has a minor formatting issue to address. |
…heck Context: * rust-ci Format job runs cargo fmt --check on stable; grouped use statements failed under this config. Change: * Replace grouped imports in apply-patch/src/lib.rs with per-item imports at a few call sites. Risk: * None; no semantic changes. Tests: * cargo test -p codex-apply-patch: all tests pass locally. * cargo clippy -p codex-apply-patch: clean. Verification: * cargo fmt --check passes locally in codex-rs. (cherry picked from commit 0310b4c)
…e CRLF/LF preservation on Windows
57aac23 to
8e6d8bf
Compare
Applied the formatting and a additional fix. @etraut-openai |
|
Thanks. Looks like some of the tests are still failing. |
…rktree EOL on modify
…h across OSes; recheck attributes for AddFile when rel missing; serialize EOL cache tests
… paths; exact parsing; prefer attributes for AddFile; add parser/cross-platform tests
…variant test; tighten .gitattributes lookup
…; keep new-file EOL CRLF/LF behavior stable
…o core.eol/core.autocrlf; normalize with chosen EOL Context: * New-file EOL selection on macOS CI was nondeterministic due to path normalization and system/global Git settings leaking into attribute checks. Change: * Add choose_eol_for_new_file(..) with precedence: explicit override (lf/crlf) > .gitattributes > core.eol > core.autocrlf > LF; for git/detect, fall back to repo policy. * Normalize repo-relative keys for git check-attr; harden with -c core.attributesfile=(NUL|/dev/null) and env guards (GIT_ATTR_NOSYSTEM=1, GIT_CONFIG_NOSYSTEM=1, GIT_CONFIG_GLOBAL=(NUL|/dev/null)). * Use chooser in add-file path and normalize content via normalize_to_eol_preserve_eof before write. * Delegate decide_eol(new-file) to chooser for one policy path. * Tests: add chooser unit tests; update gitattributes/core.eol expectations; remove masking override in CRLF attr test. Risk: * Scoped to codex-apply-patch new-file EOL behavior; existing-file handling unchanged. Behavior becomes deterministic across hosts. Tests: * Unit tests for gitattributes, core.eol, core.autocrlf, env override; CLI tests for add-file path. Verification: * cargo fmt * cargo clippy -p codex-apply-patch --all-features -- -D warnings (clean) * cargo nextest run -p codex-apply-patch --all-features --no-fail-fast: 67/67 passed.
…py (uninlined_format_args)
…; stabilize Windows ARM attrs
…ttr; unify /dev/null usage
…e exec approval test
|
@etraut-openai Sorry for the delay. All tests seem to be passing now. |
|
@cpjet64, thanks for the update. Looking at the diff, it looks much larger than it should be. I'm guessing that it's because you've modified the line endings (LF vs CRLF) on some files. Please fix that. It will make the code review much easier. |
|
@cpjet64, looking good. We'll probably wait until early next week to merge this because we're juggling a bunch of other changes right now and are trying to manage overall risk. Thanks again for this change. It's going to be a good quality-of-life improvement for many codex users. |
|
No worries/thank you! Sorry I haven't been able to get this completed yet today has been crazy for me. Just a heads up this PR does contain a lot of code but I did get hit with a EOL issue so from my initial look over I'd say about half the changed LOC is actual code and the rest is EOL stuff I need to go through and fix. I appreciate the patience! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review! Excited by the progress, but I think we have a couple things to address here:
- I left a comment about the cache in the specific thread, but I think we should simplify the implementation here. Let's discuss there!
- Agree with @etraut-openai about line ending artifacts, please address
- I'm seeing several comment artifacts about code changes, please remove them
- For checking existing files, I think the original approach of inferring from just the apply_patch original content is sufficient, rather than reading entire files.
| // repo_root_from_cwd removed; we compute repo root relative to each path. | ||
|
|
||
| // legacy_decide_eol_do_not_use removed. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this comment here
| /// - CLI override Lf/Crlf wins | ||
| /// - CLI override Git => consult Git | ||
| /// - Otherwise consult .gitattributes → core.eol → core.autocrlf | ||
| /// - Fall back to OS native; detection from patch bytes should be handled by caller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're repeating this comment a few times - we should specify it in one place, so it is more likely to stay up to date
| } | ||
| // Only used by tests; no non-test variant needed. | ||
|
|
||
| // Note: git_core_autocrlf_* no longer used in production code; omitted outside tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments look like artifacts - please remove. If functions are only used by tests, they should be in the test module
codex-rs/apply-patch/src/lib.rs
Outdated
| _ => {} | ||
| } | ||
|
|
||
| // 2) .gitattributes, core.eol, core.autocrlf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cache approach is overkill for how apply_patch is used. The primary codex harness starts a new apply_patch process for every call, so this cache will be restarted every time. I agree that we want an efficient approach here - i.e. we should perform the minimum number of git calls - but I think we can do so with a simpler approach that simply loads the config for all files to be modified at the start of the process.
| echo '```'; | ||
| git config --list --show-origin | sort || true; | ||
| echo '```'; | ||
| } >> "$GITHUB_STEP_SUMMARY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here explaining why the default git behavior is insufficient?
| assert_eq!(fs::read_to_string(&absolute_path)?, "hello\n"); | ||
| { | ||
| let s = fs::read_to_string(&absolute_path)?; | ||
| assert_eq!(s.replace("\r\n", "\n"), "hello\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is leaking our eol change to other parts of the codebase, and I don't think existing tests should have to change to continue passing
|
Just checking in. I am working through the comments and changes as I get the time. Thank you for your input. |
|
Closing because of lack of response. If you want to continue to work on this, we can reopen. |
Preserve CRLF/LF newline consistency on Windows edits
Problem & intent
Edits on Windows sometimes converted CRLF files to LF or produced mixed endings, creating noisy diffs and IDE warnings. This PR ensures newline preservation during file writes.
Before / After
Before
The text writer emitted LF regardless of a file’s original newline convention.
CRLF files could flip to LF or end up mixed.
After
Scope of change
codex-rs/apply-patch/src/lib.rsEol,detect_eol,normalize_to_eol.No changes to binary paths, new-file adds, TUI paste logic, or PATH/PATHEXT/env behavior.
Security effects
None. Only affects how edited text is serialized to disk.
Testing evidence
Static
cargo fmtcargo clippy -p codex-apply-patch -- -D warnings→ cleancargo test -p codex-apply-patch→ added CRLF/LF preservation tests pass; full suite green.Manual
Notes
Checklist
cargo fmt/clippyclean