Adjust how converted records are copied to GCS#2023
Merged
andrewpollock merged 4 commits intoMar 4, 2024
Merged
Conversation
The previous copying technique optimised for getting the records to GCS as fast as possible for opportunistic ingestion by combine-to-osv. The problem with this approach is that if any improvements to the conversion code stop generating records that shouldn't be generated, the most recent copy lingers indefinitely. This is causing situations like the one described in google#1961 So collect all of the records generated and copy them to GCS at the end of the complete run, deleting anything in GCS that isn't still being generated.
another-rex
approved these changes
Mar 4, 2024
andrewpollock
added a commit
to andrewpollock/osv.dev
that referenced
this pull request
Mar 5, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS. Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in google#1707) to remain existing in GCS. This PR addresses cases like this and the need identified in google#1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new `--delete` flag is included so in the execution of the importer. Incidental changes: - use the GCS bucket `directory_path` to efficiently filter the blobs returned when listing bucket contents - make blob retrieval resilient to blob generation change between blob listing and blob retrieval (this can happen if `combine-to-osv` happens to have run in between these two points in time) - fix a behavior inconsistency with schema validation not being performed when `ignore_last_import_time` is in effect (addressing head scratching TODO from @michaelkeder) - tidy up the existing tests, making them more readable and debuggable - add a slow manual test against live data in staging to validate real-world behavior and run time (this adds ~13 minutes to an import run on just the CVE GCS bucket)
andrewpollock
added a commit
to andrewpollock/osv.dev
that referenced
this pull request
Mar 7, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS. Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in google#1707) to remain existing in GCS. This PR addresses cases like this and the need identified in google#1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new `--delete` flag is included so in the execution of the importer. Incidental changes: - use the GCS bucket `directory_path` to efficiently filter the blobs returned when listing bucket contents - make blob retrieval resilient to blob generation change between blob listing and blob retrieval (this can happen if `combine-to-osv` happens to have run in between these two points in time) - fix a behavior inconsistency with schema validation not being performed when `ignore_last_import_time` is in effect (addressing head scratching TODO from @michaelkeder) - tidy up the existing tests, making them more readable and debuggable - add a slow manual test against live data in staging to validate real-world behavior and run time (this adds ~13 minutes to an import run on just the CVE GCS bucket)
andrewpollock
added a commit
to andrewpollock/osv.dev
that referenced
this pull request
Mar 11, 2024
Pre-copying the CSV summary to GCS before the final rsync with `--delete-unmatched` (as introduced in google#2023) just means that they don't survive :-(
andrewpollock
added a commit
that referenced
this pull request
Mar 12, 2024
Pre-copying the CSV summary to GCS before the final rsync with `--delete-unmatched` (as introduced in #2023) just means that they don't survive :-(
CharlyReux
pushed a commit
to CharlyReux/osv.dev
that referenced
this pull request
May 1, 2024
Pre-copying the CSV summary to GCS before the final rsync with `--delete-unmatched` (as introduced in google#2023) just means that they don't survive :-(
andrewpollock
added a commit
that referenced
this pull request
May 2, 2024
Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS. Related changes (#2023, #2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in #1707) to remain existing in GCS. This PR addresses cases like this and the need identified in #1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new `--delete` flag is included so in the execution of the importer. Incidental changes: - use the GCS bucket `directory_path` to efficiently filter the blobs returned when listing bucket contents - make blob retrieval resilient to blob generation change between blob listing and blob retrieval (this can happen if `combine-to-osv` happens to have run in between these two points in time) - fix a behavior inconsistency with schema validation not being performed when `ignore_last_import_time` is in effect (addressing head scratching TODO from @michaelkedar) - tidy up the existing tests, making them more readable and debuggable - add a slow manual test against live data in staging to validate real-world behavior and run time (this adds ~13 minutes to an import run on just the CVE GCS bucket) --------- Signed-off-by: Andrew Pollock <apollock@google.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous copying technique optimised for getting the records to GCS as fast as possible for opportunistic ingestion by combine-to-osv.
The problem with this approach is that if any improvements to the conversion code stop generating records that shouldn't be generated, the most recent copy lingers in GCS indefinitely. This is causing situations like the one described in #1961
So collect all of the records generated and copy them to GCS at the end of the complete run, deleting anything in GCS that wasn't just copied.