sample: review-driven cleanup, fix streaming bernoulli header bug, dead retry loop, cluster pre-alloc#3774

jqnatividad · 2026-04-28T13:43:08Z

Summary

Review-driven cleanup of src/cmd/sample.rs (net -44 lines, 8 sampling methods touched). Two commits:

98693bdff — initial review pass
d6ba2ea2a — addresses roborev build(deps): bump simple-expand-tilde from 0.1.4 to 0.1.5 #1767 follow-up review

Bugs fixed

Streaming Bernoulli over URL corrupted output in two ways:
- The header was split on the first raw \n, so an RFC 4180 header with a quoted newline truncated the headers and shifted every record. Fixed by csv-driven boundary detection (Reader::position().byte()); same fix applied to the data loop's rposition('\n'). Trailing partial records are held back until either more data arrives or the stream ends naturally.
- --no-headers was passed straight into has_headers(...) (inverted), discarding the first data row. Fixed.
- --bernoulli 1.5 https://… reached random_bool() and panicked because validation lived after the URL dispatch. Validation hoisted to the top of run() (also rejects NaN/Inf sample sizes).
- --max-size was silently ignored on the streaming path. Now honored; capped buffers do not flush a final partial record.
Weighted sampling: outer while loop in do_weighted_sampling was dead code (the inner for consumes the iterator, no retry possible). Removed; single clear pass with the existing under-fill warning.
Cluster sampling: pre-sized unique_clusters / all_clusters with the requested sample size instead of the cardinality, causing the HashMap to rehash its way up. Now uses cardinality (and falls back to the estimate when cardinality is unknown).
parse_timestamp: a millisecond Unix timestamp like 1_704_067_200_000 parsed as seconds (year ~55,899). Disambiguated by magnitude (10^11 cutoff). Range-form check ((-SEC_LIMIT..SEC_LIMIT).contains(&ts_val)) so i64::MIN doesn't panic.
do_reservoir_sampling: tightened bounds (reservoir.len().min(sample_size)) so an undersized reservoir slice is defensible. Removed unnecessary unsafe { get_unchecked_mut(...) } blocks here, in do_stratified_sampling, and in sample_indices.

Cleanups

get_column_index: direct &[u8] byte comparison; no more String::from_utf8_lossy allocation per field.
check_stats_cache: dropped the chars().all(char::is_numeric) Unicode-numeric pre-check — just parse::<usize>().ok().or_else(...). flag_force synced via debug_assert! instead of the misleading false literal.
sample_stratified: collapsed the parallel strata_counts HashMap (its count was never read) and the parallel records_seen map into a single HashMap<Vec<u8>, (Vec<ByteRecord>, usize)>. Halves per-record lookups in the hot loop and removes the awkward "lookup-or-insert" dance.
Time-series aggregate_records: numeric-column detection hoisted out of the per-interval loop. sample_timeseries does one O(records × cols) scan after sorting (with early exit once every column is ruled out). aggregate_records now takes a precomputed &[bool] mask and &[&ByteRecord] (no per-interval ByteRecord clones).
with_rng! macro: replaces the recurring three-arm match rng_kind { Standard => …, Faster => …, Cryptosecure => … } blocks across sample_systematic, sample_stratified, sample_weighted, sample_cluster, sample_timeseries (random-offset branch), and the default INDEXED path in run(). Net: ~80 LOC deleted, "did I update all three branches?" trip-hazards removed, log lines unified to {:?} <METHOD> sampling….
stream_bernoulli_sampling: lazy RNG creation (only the chosen kind, was eagerly building all three including slow-init HC128).
Stale comment on do_reservoir_sampling ("constant memory" → O(k)).

Test plan

cargo build --locked --bin qsv -F all_features — clean
cargo t sample -F all_features — 73 / 73 pass
cargo clippy --locked --bin qsv -F all_features -- -D warnings — clean
(reviewer) sanity-check streaming Bernoulli against a remote URL with embedded quoted newlines in a header, and against --max-size mid-record truncation
(reviewer) sanity-check qsv stats cache integration on stratified / weighted / cluster paths

🤖 Generated with Claude Code

…ad retry loop, cluster pre-alloc P0: - run() Bernoulli probability validation now happens before the URL/streaming dispatch, so `--bernoulli 1.5 https://…` no longer reaches random_bool() and panics. Also added is_finite() guard against NaN/Inf sample sizes. - stream_bernoulli_sampling rewrite: single buffer with manual header split on first newline (no more lost data records co-located with the header chunk), fixed inverted has_headers flag in the data loop, lazy RNG creation (only the chosen kind, was eagerly building all three including slow-init HC128), honors --max-size like the download path. - do_weighted_sampling: removed dead outer while-loop. The inner for-loop consumes the &mut Iterator, so retries were impossible — single clear pass now, dropped unused HashSet/attempts/records_exhausted bookkeeping. - sample_cluster: pre-size unique_clusters/all_clusters with cardinality (the total number of unique clusters), not requested_clusters (the sample size). Also dropped a per-record to_vec() on the second pass via Borrow<[u8]>. P1: - parse_timestamp: disambiguate Unix sec vs ms by magnitude (10^11 cutoff) so millisecond timestamps don't silently parse as far-future dates. - do_reservoir_sampling: bound writes against reservoir.len().min(sample_size) rather than just sample_size, defensible against an undersized slice; fixed misleading "constant memory" comment (it's O(k)). - check_stats_cache: dropped the redundant chars().all(char::is_numeric) pre-check (Unicode-numeric foot-gun); just parse::<usize>().ok().or_else. P2: - get_column_index: direct &[u8] byte comparison; no more from_utf8_lossy allocation per field. - check_stats_cache: dead flag_force write into SchemaArgs (we already returned early when set) replaced with literal false + comment. - sample_stratified: dropped the parallel strata_counts HashMap whose count was never read; reservoirs map built in one pass. - do_stratified_sampling: lookup via Borrow<[u8]> first, only allocate the Vec<u8> stratum key once per stratum instead of per record. - Removed unnecessary unsafe get_unchecked_mut blocks in do_reservoir_sampling, do_stratified_sampling, sample_indices. - aggregate_records: takes a precomputed &[bool] numeric mask; sample_timeseries scans the dataset once after sorting (with early exit once every column is ruled out) instead of repeating the per-column "are all values numeric?" probe inside every interval. Also pass refs into aggregate_records to avoid per-interval ByteRecord clones. - with_rng! macro: collapses the recurring three-arm dispatch (Standard/Faster/Cryptosecure) into a single macro invocation across sample_systematic, sample_stratified, sample_weighted, sample_cluster, sample_timeseries, and the default INDEXED path in run(). Tests: 73/73 sample tests pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fer flush, i64::MIN abs, stratified clarity Medium: - stream_bernoulli_sampling: header was split on the first raw `\n`, which mis-handles RFC 4180 headers whose fields contain a quoted newline (the csv reader gets a slice that ends inside the quote and either errors or truncates). Replaced with csv-driven boundary detection: parse one record via csv::Reader and use position().byte() to drain exactly that many bytes. The same bug existed in the data loop's rposition('\n') split, so the data path now uses the same approach — only commit records whose terminator lies INSIDE the buffer; hold back trailing partials until more data arrives or the stream closes naturally. Low: - stream_bernoulli_sampling: with --max-size truncation, the post-loop flush could emit one malformed half-record at the cap boundary. The new loop tracks `parser_eof = stream_done && !size_capped` and only treats trailing bytes as a real record on a NATURAL EOS. - parse_timestamp: replaced `ts_val.abs() < SEC_LIMIT` with a range form (`(-SEC_LIMIT..SEC_LIMIT).contains(&ts_val)`) so an input of `i64::MIN` no longer panics in debug / wraps in release. - do_stratified_sampling: replaced the awkward two-lookup records_seen.get_mut().or-insert pattern by unifying the per-stratum state into a single HashMap<Vec<u8>, (Vec<ByteRecord>, usize)>. Halves per-record HashMap lookups and reads more naturally. - check_stats_cache: dropped the `flag_force: false` literal in favor of passing `args.flag_force` plus a `debug_assert!(!args.flag_force)`, keeping the early-return guard and the SchemaArgs construction in sync if the guard ever moves. Tests: 73/73 sample tests pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-04-28T13:44:18Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 106 complexity

Metric Results

Complexity 106

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

to avoid typos lint error

Copilot

Pull request overview

This PR refactors and tightens qsv sample by removing duplicated RNG dispatch code, improving performance in several sampling methods, and fixing multiple correctness issues—most notably around streaming Bernoulli sampling over remote URLs.

Changes:

Fixes streaming Bernoulli sampling correctness (quoted-newline-safe record boundaries, --no-headers handling, early probability validation, --max-size enforcement) and reduces RNG init overhead.
Simplifies sampling implementations (removes dead retry loop in weighted sampling, removes several unsafe indexing sites, bounds reservoir writes defensively).
Performance/maintainability cleanups (byte-based column lookup, prealloc tweaks for cluster sampling, precomputed numeric-column mask for time-series aggregation, with_rng! macro to unify RNG branching).

…ror status) stream_bernoulli_sampling: - Honor --delimiter for header AND data probes. Previously it always used default_delim derived from QSV_DEFAULT_DELIMITER (or comma), so a user-supplied --delimiter was silently ignored when sampling a remote CSV. Resolved once at function entry: --delimiter > QSV_DEFAULT_DELIMITER > comma. - Fail fast on non-2xx HTTP responses by chaining .error_for_status() onto .send().await — reqwest's send() does not error on HTTP error status, so without this a 404/500 HTML body would have been streamed straight into the csv parser and surfaced as confusing record errors. Tests: 73/73 sample tests pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3775) * test/sample: add integration tests for streaming Bernoulli URL path Closes the test-coverage gap flagged in PR #3774. Stands up a local actix-web fixture (port 8082, distinct from test_fetch's 8081) and exercises the boundary detection and validation guards added there: - sample_bernoulli_url_quoted_newline_header: header field 0 contains a literal `\n` inside an RFC-4180 quote. Asserts the header arrives intact (3 fields, embedded newline preserved) and that every emitted data row also has 3 fields. Old code would have split on the raw byte and corrupted every following record. - sample_bernoulli_url_max_size_truncation: serves a ~1.2 MiB CSV with fixed 100-byte records so `--max-size 1` cuts deterministically inside record 10486. Asserts max id <= 10485 (no half-record at the cap) and that every emitted row is well-formed. - sample_bernoulli_url_404_fails_fast: hits an unmapped path on the fixture server. Asserts qsv exits with error instead of feeding the HTML 404 body into the csv parser (regression for the missing `error_for_status()`). - sample_bernoulli_url_custom_delimiter: serves a TSV and passes `--delimiter '\t'`. Reads raw stdout and splits on tab (the writer also honors --delimiter, so read_stdout's comma parser would collapse rows). Asserts header and data rows split into 3 fields. Tests use #[serial] so they don't race on the port. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * typo: mis-split->split incorrectly * test/sample: address review feedback on streaming Bernoulli tests - (#1 #2) Replace SAMPLE_TEST_PORT + SAMPLE_TEST_HOST (which duplicated the port and could drift) with a single SAMPLE_TEST_PORT + SAMPLE_TEST_BIND_HOST literal. URL-builder and bind() both derive from the same source — no more brittle .split(':').next().unwrap() that would also panic on IPv6 hosts. - (#3) Wrap the ServerHandle in a SampleWebServer RAII guard. The server now stops in Drop, so a panic inside read_stdout / stdout doesn't leak the port and cascade into "Address already in use" on the next #[serial] test. - (#4) Call wrk.assert_success(&mut cmd) before reading stdout in the success-path tests, so a regression that makes qsv exit non-zero surfaces qsv's stderr instead of a generic CSV-parse error. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test/sample: address Copilot review on PR #3775 — single-run cmd, server start timeout - Replace the assert_success-then-read_stdout double-run pattern with a single capture-and-parse helper. The previous shape ran qsv twice per test, doubling fixture-server requests (and the ~1.2 MiB max-size download) and meaning the parsed stdout came from a different execution than the one whose status was asserted. - Added run_and_assert_success(): runs once, asserts status, returns Output (with stderr surfaced on failure). - Added parse_csv_stdout(): mirrors wrk.read_stdout's Vec<Vec<String>> shape but reads from a captured buffer. - All three success-path tests (quoted newline header, max-size truncation, custom delimiter) now use these helpers. - Switch the SampleWebServer startup channel to send Result<ServerHandle, String> and use recv_timeout(10s) instead of recv(). A failed bind (e.g., port already in use) used to leave start() blocked forever; it now panics fast with the bind error surfaced from the server thread. 77/77 sample tests pass; clippy --bin qsv clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jqnatividad and others added 2 commits April 28, 2026 08:20

jqnatividad requested a review from Copilot April 28, 2026 13:43

Copilot started reviewing on behalf of jqnatividad April 28, 2026 13:46 View session

typo: mis-split -> incorrectly split

d6010fb

to avoid typos lint error

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread src/cmd/sample.rs

Comment thread src/cmd/sample.rs

Comment thread src/cmd/sample.rs Outdated

Comment thread src/cmd/sample.rs

jqnatividad merged commit 02e0a98 into master Apr 28, 2026
15 of 16 checks passed

jqnatividad deleted the sample-review-202604 branch April 28, 2026 13:59

jqnatividad mentioned this pull request Apr 28, 2026

test/sample: add integration tests for streaming Bernoulli URL path #3775

Merged

4 tasks

BrewTestBot mentioned this pull request May 3, 2026

qsv 20.0.0 Homebrew/homebrew-core#280687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample: review-driven cleanup, fix streaming bernoulli header bug, dead retry loop, cluster pre-alloc#3774