Bump polars py 1.30.0 beta 1#2747
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This pull request updates the dependency information for Polars to the new beta version and adjusts related tests and documentation accordingly.
- Updated the expected output in tests/test_sqlp.rs to match changes in Polars' behavior.
- Revised the Polars version badge in README.md.
- Modified Cargo.toml to reference the new Polars and Polars-ops revision.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/test_sqlp.rs | Updated expected_end string in test to reflect the new output format. |
| README.md | Polars version badge updated to include py-1.30.0-beta.1 details. |
| Cargo.toml | Dependency revisions updated for both polars and polars-ops. |
Comments suppressed due to low confidence (1)
tests/test_sqlp.rs:945
- Verify that the updated expected_end string accurately reflects the required output format, ensuring that the intentional removal of 'boston311-100.csv]' and any whitespace changes do not lead to unintended test failures.
+ let expected_end = r#"PROJECT 4/29 COLUMNS
jqnatividad
added a commit
that referenced
this pull request
Jun 5, 2026
Address three roborev findings on the `get`/diskcache subsystem: - Medium: fetching an already-cached HTTP URL under a different --name failed because a fresh cache hit never calls CacheManager::put (so no alias for the new name was written). The manager now records the cache key it operated on (in get or put) and get_resource recovers the entry by that key, binding the requested name to it. - Medium: local entries stored the (possibly relative) given path as the refresh source, so a stale dc: refresh from another directory could read the wrong file. ingest_local now stores the canonicalized absolute path. - Low: reusing a logical name for a different source orphaned the old entry JSON/blob (duplicate names, inflated metadata). write_entry now writes the new entry first, removes the previously-aliased different entry, then writes the alias (content-addressed dedup keeps shared blobs intact). Adds regression tests: get_http_same_url_different_name (fresh-hit recovery) and get_name_reuse_replaces_entry (orphan removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jqnatividad
added a commit
that referenced
this pull request
Jun 6, 2026
…2263) (#3953) * feat(get): new command for getting tabular data from various sources (#2263) Add a `get` command + `dc:` disk-cache subsystem that fetches tabular data from local files, http(s) URLs, dathere:// and ckan:// into a managed, queryable cache reusable by any qsv command via the `dc:` prefix. - src/diskcache.rs: two-tier cache module. The always-compiled core shares the source resolver (dathere/ckan/http) incl. the same-origin CKAN token-strip security check; lookup.rs is unified onto it so there is a single, audited resolver path. The get-gated rich cache is a content-addressed zstd blob store + per-entry JSON index (BLAKE3, ETag, Last-Modified, compressed/uncompressed sizes, record count, TTL), with HTTP cache semantics via http-cache-reqwest + reqwest-middleware (custom CacheManager) and auto-indexing of cached files for instant dc: counts / random access. - src/cmd/get.rs: `qsv get <source>...` plus cache-list/info/clear/prune. - Config::new resolves `dc:<name>` to a decompressed temp CSV + sibling .idx, routing failures through format_error (get-gated). - Registered in qsv, qsvmcp and qsvdp; the `get` feature joins distrib_features, qsvmcp and datapusher_plus. - tests/test_get.rs: mock-server ETag 304 revalidation, content-addressed dedup, dc: indexed read, and cache clear/prune. Phase 1 of #2263; cloud storage (S3/Azure/GCS) and sftp are deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): harden cache alias/update paths (roborev #2747) Address three roborev findings on the `get`/diskcache subsystem: - Medium: fetching an already-cached HTTP URL under a different --name failed because a fresh cache hit never calls CacheManager::put (so no alias for the new name was written). The manager now records the cache key it operated on (in get or put) and get_resource recovers the entry by that key, binding the requested name to it. - Medium: local entries stored the (possibly relative) given path as the refresh source, so a stale dc: refresh from another directory could read the wrong file. ingest_local now stores the canonicalized absolute path. - Low: reusing a logical name for a different source orphaned the old entry JSON/blob (duplicate names, inflated metadata). write_entry now writes the new entry first, removes the previously-aliased different entry, then writes the alias (content-addressed dedup keeps shared blobs intact). Adds regression tests: get_http_same_url_different_name (fresh-hit recovery) and get_name_reuse_replaces_entry (orphan removal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): separate name aliases from URL-keyed entries (roborev #2748) Follow-up to the previous alias hardening. Properly model the cache as URL-keyed *entries* (own the blob + metadata) with many name *aliases* pointing at them, with two-level reference counting: - Medium: the fresh-cache-hit recovery no longer mutates the shared entry's logical_name (which corrupted metadata for other aliases and left the newest name the only one cache-list showed). It now binds the requested name via a dedicated alias without touching the entry; cache-list enumerates aliases so every name is listed, and get_resource reports the requested name only in its returned metadata. - Low: re-fetching the same cache key after its content or compression changed now reclaims the previous blob/index when no other entry references it (write_entry compares prev vs new blob and frees the orphan). Cleanup paths are now alias-aware: deleting/pruning an entry removes all names pointing at it and frees the blob only when unreferenced; cache-info de-dupes by content hash and reports names + unique blobs. Adds regression tests for multi-name listing and changed-content blob reclamation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): correct force, fresh-hit repoint & compression-variant cleanup (roborev #2749) - Medium: --force / --refresh always now use http-cache CacheMode::Reload (always hit the origin, then update the cache) instead of deleting the requested alias. Deleting only the alias left a multi-aliased entry behind that could still be served as a fresh hit, so --force didn't actually re-fetch shared entries. - Medium: after every successful fetch, the requested name is now bound unconditionally to the entry the middleware actually used (via the observed cache key), repointing it when it previously pointed at a different (stale) entry — not only when the alias was missing. Previously a fresh hit under a name that already pointed elsewhere kept resolving the old data. - Low: delete_entry_by_keyhash now frees the data blob on an exact blob-path match (content hash AND compression) and reserves the content-hash-only check for the shared {blake3}.idx.zst index, so a compression variant of a shared hash is no longer left orphaned. Removes the now-unused delete_entry_by_name helper. Adds regression tests: get_force_refetches_shared_entry, get_fresh_hit_repoints_stale_name and get_compression_variant_blob_reclaimed (11 get tests pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(get): add `get` to feature-bundle enumerations (CI docs-drift) The `get` feature was added to all_features, distrib_features, qsvmcp and datapusher_plus in Cargo.toml; update the docs that enumerate those bundles so the docs-drift-check CI gate passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * address review: harden get disk-cache edge cases (Copilot + DevSkim) Copilot findings: - set_qsv_cache_dir: propagate CliError instead of expand_tilde().unwrap() panicking when the home dir can't be determined (both branches). - resolve_uri_prefix: URL-encode the ckan://<name>? value before building the resource_search query (spaces/&/# no longer corrupt it). - resolve_ckan_resource: same-origin token-strip now compares against the effective API base (default when None), so the token isn't dropped for resources on the default CKAN origin. - atomic_write / ensure_indexed: use process+call-unique temp filenames so concurrent writers/indexers of the same target/content can't collide; clean up the temp on rename failure. - alias filenames: switch from the lossy safe_name() to a reversible, injective hex encoding (encode_name/decode_name) so distinct logical names never collide and cache-list shows the original name. - resolve_dc_path: give the dc: temp file a guaranteed tabular extension so Config's format check accepts extension-less handles. - cache-info: de-dupe on-disk totals by (blake3, compression), content totals by hash. DevSkim: add inline ignores (DS137138, DS162092) for the test-only local mock HTTP server (http://127.0.0.1). Adds regression test get_alias_names_do_not_collide. All 12 get tests + the lookup consumer suite pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): bounded, migration-aware alias filenames (roborev #2752) The hex-encoded alias filename scheme had two problems: - Medium: hex doubles the filename length, so logical names longer than ~127 bytes exceeded the 255-byte filename limit and failed to cache. Alias filenames are now BLAKE3(name) — a fixed 64-char key — keeping them bounded regardless of name length while staying injective in practice. The original name is preserved inside the alias file content ("{keyhash}\n{name}") and used for cache-list display. - Medium: switching schemes left aliases written by the previous scheme unreachable. alias_keyhash() now falls back to the prior hex (encode_name) and safe_name alias files and migrates them to the canonical hashed alias on lookup; list_entries recovers the name from content, falling back to decoding legacy filenames. Adds regression test get_long_logical_name (200-char name). All 13 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): guard lossy safe_name legacy alias migration (roborev #2753) The safe_name legacy-alias fallback is lossy: a different logical name can sanitize to the same filename. alias_keyhash now only adopts (migrates/removes) a safe_name legacy alias when the pointed entry's primary name actually matches the requested name, so resolving e.g. `dc:a b.csv` can't steal an existing `a_b.csv` legacy alias. The injective encode_name (hex) fallback is unaffected. Adds regression test get_legacy_safe_name_alias_not_stolen. All 14 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(get): drop legacy alias-migration fallbacks `get` is unreleased, so no cache exists that uses the prior `encode_name`/ `safe_name` alias filename schemes. Remove the legacy-migration machinery and the whole class of lossy-fallback edge cases it introduced: - alias_keyhash now resolves a name only via its canonical BLAKE3(name) alias. - Remove the now-unused encode_name/decode_name helpers. - list_entries reads the original name from the alias file content only. - Drop the legacy-only regression test. Alias filenames remain bounded BLAKE3 keys with the original name stored in the file content. 13 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(get): Phase 2 — cloud storage (s3/gs/az) + cache management (#2263) Adds cloud object-storage sources and cache-management subcommands to `qsv get`, completing the originally-scoped feature set for #2263. Cloud storage (behind opt-in `get_cloud` sub-feature): - Fetch s3://, gs://, az:// (and s3a/adl/azure/abfs/abfss) via the object_store crate into the same content-addressed disk cache, readable by any command via the `dc:` prefix. ETag-based conditional revalidation (If-None-Match -> 304) gives the same "don't re-download unchanged" guarantee as the HTTP path. - Credentials come from the standard AWS_*/AZURE_*/GOOGLE_* environment (and IAM roles); a repeatable `--cloud-opt key=value` flag overrides region/endpoint/etc. - object_store is already compiled transitively via polars in distrib/qsvmcp/ datapusher_plus, so the marginal dependency weight there is ~zero (Cargo.lock gains no new crates). A bare `-F get` build stays lean — get_cloud is opt-in. Cache management: - `cache-set-ttl <name> --ttl` / `cache-set-policy <name> --refresh` mutate an existing entry's TTL / refresh policy in place (entry-level: affects every alias pointing at the entry). - `cache-list --verify` recomputes each blob's BLAKE3 and reports OK/FAIL per name, exiting non-zero on any integrity failure. `get_cloud` added to distrib_features, qsvmcp and datapusher_plus; the feature enumerations in FEATURES.md/README.md/PROJECT_TECHNICAL_OVERVIEW.md were updated and docs/help/get.md regenerated (docs-drift clean). Tests: +7 integration tests (S3 fetch + dc read, S3 ETag revalidation, S3 404, cache-set-ttl, cache-set-policy, verify-ok, verify-detects-corruption) using the actix mock with a path-style S3 endpoint override. All 20 get tests pass with get_cloud; 17 under plain get (the S3 tests are feature-gated). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): scope cloud cache entries by store identity (roborev #2756) Cloud cache entries were keyed only by `CLOUD:{source}`, ignoring the store identity supplied via `--cloud-opt` / environment (endpoint, region, account). The same `s3://bucket/key` against two different S3-compatible endpoints (or Azure accounts) therefore collided in the cache, so a second fetch could revalidate against — or `--refresh never` could serve — data from the wrong backing store. Fix: - Derive a non-secret store identity (endpoint/region/account/… lower-cased, sorted, de-duplicated) from the resolved cloud options and fold it into the cache key (`CLOUD:{source}#k=v&...`). Credentials are excluded via a secret-key denylist so they never reach the on-disk cache key. - Persist that identity on the cache entry (new `cloud_identity` field, `#[serde(default)]`) and replay it on `dc:` auto-refresh so the refresh rebuilds the SAME store and resolves to the SAME entry. Credentials still come from the ambient environment and are never persisted. Adds `get_s3_identity_scopes_cache_key`: two fetches of one s3:// URL with different regions must download independently (a collision would make the second a 304 revalidation). All 21 get tests pass; clippy clean; no-cloud build unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): collapse cloud identity opts last-wins per key (roborev #2757) The previous fix (#2756) built the cloud store-identity by sorting and de-duplicating whole `(key, value)` pairs, so a key supplied with two different values — e.g. `AWS_REGION=us-east-1` in the environment plus `--cloud-opt aws_region=eu-west-1` — kept BOTH. object_store, by contrast, builds the store with last-wins precedence (`--cloud-opt` overrides env), so the cache key diverged from the effective store, and a later `dc:` auto-refresh (which re-merges ambient env with the persisted identity) recomputed a different key and re-downloaded instead of revalidating the same entry. Fix: reduce the identity to a single value per (lower-cased) key using the same last-wins precedence object_store applies — env first, then `--cloud-opt` — via a BTreeMap (also yielding deterministic key-sorted order). Because the persisted identity is replayed as high-precedence `--cloud-opt` on refresh, it now wins over any ambient env var (even a changed one), so the recomputed key matches the original. Adds `get_s3_env_cloud_opt_override_refresh_stable`: an env-vs-`--cloud-opt` region override followed by a stale `dc:` refresh under a CHANGED `AWS_REGION` must revalidate the same entry (one download). All 22 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(get): assert dc: refresh issues a real 304 revalidation (roborev #2758) `get_s3_env_cloud_opt_override_refresh_stable` asserted only count==4 and body_sends==1, which `resolve_dc_path`'s best-effort fallback-to-stale would also satisfy on a *failed* refresh — so a broken `cloud_identity` replay could have passed the test without any successful refresh request. Add a `revalidations` (304) counter to the mock `GetWebServer` alongside the existing body-send counter, and assert the stale `dc:` read makes exactly one conditional request that the server answers with 304. This proves the identity replay rebuilt the correct store and resolved to the same entry, rather than silently serving the stale copy. The accessor is gated to get_cloud to avoid a dead-code warning in non-cloud builds. All 22 get tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * address review: assert test exit status, fix docs, persist ckan-api (Copilot #3953) Addresses 10 unresolved Copilot review comments on PR #3953: - tests/test_get.rs (7): the `cache-list` invocations that used `wrk.output(&mut list)` then inspected stdout now assert `out.status.success()` first (printing stderr on failure), so a non-zero exit can't be masked by a stdout substring match. - tests/test_get.rs: corrected a stale comment — alias filenames are BLAKE3 hashes and the original name is read from the alias file *content*, not "decoded from a reversible alias". - src/diskcache.rs: `set_qsv_cache_dir` doc no longer claims to "canonicalize" (it expands `~` and creates the dir; the path is otherwise returned as given). - src/diskcache.rs: `dc:` auto-refresh of a `ckan://` entry now re-resolves against the SAME CKAN instance it was fetched from. The effective `--ckan-api` base URL is persisted on the entry (new `ckan_api_url` field, `#[serde(default)]`) and replayed on refresh, falling back to `QSV_CKAN_API`/default only for pre-existing entries. Previously a non-default CKAN entry would silently refresh against the wrong instance. All 22 get tests pass; lookup suite non-regressing; clippy clean; fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(get): persist ckan_api_url only for ckan:// sources (roborev #2760) The persisted-CKAN-API fix (Copilot #3953) set `ckan_api_url` on the cache entry for every HTTP fetch. Because `--ckan-api` always carries a default value, plain `http(s)://` entries were recorded — and surfaced by `cache-list --json` — with a CKAN API URL, contradicting the field's `None for non-CKAN entries` contract and making the metadata misleading. Fix: only populate `ckan_api_url` on the manager when `resolved.is_ckan` is true; non-CKAN HTTP entries persist `None`. Adds `get_http_entry_has_no_ckan_api_metadata`, asserting a plain http:// entry serializes `"ckan_api_url": null` and never the default CKAN URL. All 23 get tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.