Skip to content

Adjust how converted records are copied to GCS#2023

Merged
andrewpollock merged 4 commits into
google:masterfrom
andrewpollock:remove_stale_conversions_from_gcs
Mar 4, 2024
Merged

Adjust how converted records are copied to GCS#2023
andrewpollock merged 4 commits into
google:masterfrom
andrewpollock:remove_stale_conversions_from_gcs

Conversation

@andrewpollock
Copy link
Copy Markdown
Contributor

The previous copying technique optimised for getting the records to GCS as fast as possible for opportunistic ingestion by combine-to-osv.

The problem with this approach is that if any improvements to the conversion code stop generating records that shouldn't be generated, the most recent copy lingers in GCS indefinitely. This is causing situations like the one described in #1961

So collect all of the records generated and copy them to GCS at the end of the complete run, deleting anything in GCS that wasn't just copied.

The previous copying technique optimised for getting the records to GCS
as fast as possible for opportunistic ingestion by combine-to-osv.

The problem with this approach is that if any improvements to the
conversion code stop generating records that shouldn't be generated, the
most recent copy lingers indefinitely. This is causing situations like
the one described in google#1961

So collect all of the records generated and copy them to GCS at the end
of the complete run, deleting anything in GCS that isn't still being
generated.
@andrewpollock andrewpollock enabled auto-merge (squash) February 29, 2024 05:57
@andrewpollock andrewpollock merged commit de0024a into google:master Mar 4, 2024
andrewpollock added a commit to andrewpollock/osv.dev that referenced this pull request Mar 5, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that
are no longer present in GCS.

Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no
longer being generated (due to changes in the heuristics like that made
in google#1707) to remain existing in GCS.

This PR addresses cases like this and the need identified in google#1467 by
adding a deletion phase to the importing of new/updated records. The
functionality is flag-protected, it won't go live in Production until
a new `--delete` flag is included so in the execution of the importer.

Incidental changes:

- use the GCS bucket `directory_path` to efficiently filter the blobs
  returned when listing bucket contents
- make blob retrieval resilient to blob generation change between blob
  listing and blob retrieval (this can happen if `combine-to-osv`
  happens to have run in between these two points in time)
- fix a behavior inconsistency with schema validation not being
  performed when `ignore_last_import_time` is in effect (addressing
  head scratching TODO from @michaelkeder)
- tidy up the existing tests, making them more readable and debuggable
- add a slow manual test against live data in staging to validate
  real-world behavior and run time (this adds ~13 minutes to an import
  run on just the CVE GCS bucket)
andrewpollock added a commit to andrewpollock/osv.dev that referenced this pull request Mar 7, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that
are no longer present in GCS.

Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no
longer being generated (due to changes in the heuristics like that made
in google#1707) to remain existing in GCS.

This PR addresses cases like this and the need identified in google#1467 by
adding a deletion phase to the importing of new/updated records. The
functionality is flag-protected, it won't go live in Production until
a new `--delete` flag is included so in the execution of the importer.

Incidental changes:

- use the GCS bucket `directory_path` to efficiently filter the blobs
  returned when listing bucket contents
- make blob retrieval resilient to blob generation change between blob
  listing and blob retrieval (this can happen if `combine-to-osv`
  happens to have run in between these two points in time)
- fix a behavior inconsistency with schema validation not being
  performed when `ignore_last_import_time` is in effect (addressing
  head scratching TODO from @michaelkeder)
- tidy up the existing tests, making them more readable and debuggable
- add a slow manual test against live data in staging to validate
  real-world behavior and run time (this adds ~13 minutes to an import
  run on just the CVE GCS bucket)
andrewpollock added a commit to andrewpollock/osv.dev that referenced this pull request Mar 11, 2024
Pre-copying the CSV summary to GCS before the final rsync with
`--delete-unmatched` (as introduced in google#2023) just means that they don't
survive :-(
andrewpollock added a commit that referenced this pull request Mar 12, 2024
Pre-copying the CSV summary to GCS before the final rsync with
`--delete-unmatched` (as introduced in #2023) just means that they don't
survive :-(
CharlyReux pushed a commit to CharlyReux/osv.dev that referenced this pull request May 1, 2024
Pre-copying the CSV summary to GCS before the final rsync with
`--delete-unmatched` (as introduced in google#2023) just means that they don't
survive :-(
andrewpollock added a commit that referenced this pull request May 2, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that
are no longer present in GCS.

Related changes (#2023, #2029) to the NVD CVE generation cause CVEs no
longer being generated (due to changes in the heuristics like that made
in #1707) to remain existing in GCS.

This PR addresses cases like this and the need identified in #1467 by
adding a deletion phase to the importing of new/updated records. The
functionality is flag-protected, it won't go live in Production until a
new `--delete` flag is included so in the execution of the importer.

Incidental changes:

- use the GCS bucket `directory_path` to efficiently filter the blobs
returned when listing bucket contents
- make blob retrieval resilient to blob generation change between blob
listing and blob retrieval (this can happen if `combine-to-osv` happens
to have run in between these two points in time)
- fix a behavior inconsistency with schema validation not being
performed when `ignore_last_import_time` is in effect (addressing head
scratching TODO from @michaelkedar)
- tidy up the existing tests, making them more readable and debuggable
- add a slow manual test against live data in staging to validate
real-world behavior and run time (this adds ~13 minutes to an import run
on just the CVE GCS bucket)

---------

Signed-off-by: Andrew Pollock <apollock@google.com>
@andrewpollock andrewpollock deleted the remove_stale_conversions_from_gcs branch May 23, 2024 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants