Add /ia/sync.json POST handler#11230
Conversation
|
I admit I don't recall perfectly where we left off; Is there a way we can try a working example on testing.openlibrary.org where we manually make a API call w/ a curl and a This is annoying to test because it will require us manually minting an hmac token to send with the request based on some shared secret (and we may want to add/stage this secret for now in testing olsystem) |
Updates function to take the full secret key name as a parameter. Sets the default message delimiter to a pipe to make parsing the expiry timestamp easier. Fixes time comparison bug.
The three outstanding tasks from this comment still remain:
I was able to test <?php
$msg = "hello there";
$secret_key = "placeholderkey";
$expiry_time = date('c', time() + 60); // 'c' = ISO 8601 format
$data = $msg . "|" . $expiry_time;
$hash = hash_hmac('md5', $data, $secret_key);
echo "Hash: " . $hash . "\n";
echo "Data: " . $data . "\n";
?>Have verified that an |
|
As reference, we have One Time Password logic in this PR that is similar (living in core/auth.py) that could a home + model for how we might want to centralize our HMAC logic Rather than waiting for 11288 to land, maybe the takeaway is we could use core/auth.py and have some sort of hmac class/utility to centralize how we either generate or verify an hmac. |
|
We have a new key |
Updates function to take the full secret key name as a parameter. Sets the default message delimiter to a pipe to make parsing the expiry timestamp easier. Fixes time comparison bug.
openlibrary/plugins/admin/code.py
Outdated
| except (ValueError, ExpiredTokenError): | ||
| raise web.HTTPError("401 Unauthorized") | ||
|
|
||
| op, ocaid, _ = msg.split("|") |
There was a problem hiding this comment.
Including the op in the encrypted message isn't necessary. This handler's path will be updated to something like /ia/unlink, and will handle all disassociation requests. Later, a complementary /ia/link handler will be created for association requests.
There was a problem hiding this comment.
msg is now expected to by in <ocaid>:<timestamp format.
Updates the endpoint to `/ia/unlink.json`, removing the need for an operation to be included in request payloads.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new endpoint for syncing data between Internet Archive (IA) and Open Library (OL), specifically for "unlinking" items by removing OCAIDs from editions. The implementation includes HMAC-based authentication with timestamp validation to secure the endpoint.
Key Changes
- Creates new
HMACTokenauthentication class inopenlibrary/core/auth.pywith HMAC verification and expiry checking - Adds
/ia/unlinkPOST endpoint inopenlibrary/plugins/admin/code.pyto remove OCAIDs from editions - Implements
make_darkoperation to delete ocaid and ia: source records from editions
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
openlibrary/core/auth.py |
New file containing HMACToken class for HMAC digest verification with timestamp validation and timing-attack mitigation |
openlibrary/plugins/admin/code.py |
Adds unlink_ia_ol endpoint class with POST handler for authenticated OCAID removal, including edition lookup and dark-making logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class HMACToken: | ||
| @staticmethod | ||
| def verify(digest: str, msg: str, secret_key_name: str, delimiter: str = "|"): | ||
| """ | ||
| Verify an HMAC digest against a message with timestamp validation. | ||
|
|
||
| This method validates that the provided HMAC digest matches the expected | ||
| digest for the given message, and that the token has not expired. The | ||
| message is expected to contain an ISO format timestamp as its last | ||
| delimiter-separated component. | ||
|
|
||
| To mitigate timing attacks, all cryptographic operations are performed | ||
| before raising any exceptions, ensuring constant-time execution regardless | ||
| of which validation fails first. | ||
|
|
||
| Args: | ||
| digest: The HMAC digest to verify (hexadecimal string). | ||
| msg: The message to verify, which must end with a delimiter-separated | ||
| ISO format timestamp (e.g., "data|2024-12-31T23:59:59+00:00"). | ||
| secret_key_name: The configuration key name used to retrieve the | ||
| secret key for HMAC computation. | ||
| delimiter: The character used to separate message components. | ||
| Defaults to "|". | ||
|
|
||
| Returns: | ||
| bool: True if the digest is valid and the token has not expired. | ||
|
|
||
| Raises: | ||
| ExpiredTokenError: If the timestamp in the message indicates the | ||
| token has expired (raised after digest comparison). | ||
| ValueError: If the secret key cannot be found in the configuration | ||
| (raised after digest comparison). | ||
| """ | ||
| current_time = datetime.datetime.now(datetime.UTC) | ||
| expiry_str = msg.split(delimiter)[-1] | ||
| expiry = datetime.datetime.fromisoformat(expiry_str) | ||
|
|
||
| err: Exception | None = None | ||
|
|
||
| if current_time > expiry: | ||
| err = ExpiredTokenError() | ||
|
|
||
| if not (key := config.get(secret_key_name, '')): | ||
| err = ValueError("No key found") | ||
|
|
||
| mac = '' | ||
| if key: | ||
| mac = hmac.new( | ||
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 | ||
| ).hexdigest() | ||
|
|
||
| result = hmac.compare_digest(mac, digest) | ||
| if err: | ||
| raise err | ||
| return result |
There was a problem hiding this comment.
The new HMACToken class lacks test coverage. Tests should be added to verify:
- Successful verification with valid digest and non-expired timestamp
- Rejection of expired tokens
- Rejection of invalid digests
- Handling of missing secret keys
- Constant-time execution for timing attack mitigation
- Proper parsing of delimiter-separated messages
- Handling of malformed timestamps
openlibrary/plugins/admin/code.py
Outdated
| except (ValueError, ExpiredTokenError): | ||
| raise web.HTTPError("401 Unauthorized") |
There was a problem hiding this comment.
[nitpick] The authentication error handling catches both ValueError and ExpiredTokenError but treats them identically with a 401 response. This makes it difficult for clients to distinguish between expired tokens (which could be retried with a new token) and other authentication failures. Consider handling ExpiredTokenError separately with a more specific error message.
| except (ValueError, ExpiredTokenError): | |
| raise web.HTTPError("401 Unauthorized") | |
| except ExpiredTokenError: | |
| raise web.HTTPError( | |
| "401 Unauthorized", | |
| data=json.dumps({"error": "Token expired"}) | |
| ) | |
| except ValueError: | |
| raise web.HTTPError( | |
| "401 Unauthorized", | |
| data=json.dumps({"error": "Invalid token"}) | |
| ) |
| delimiter: The character used to separate message components. | ||
| Defaults to "|". |
There was a problem hiding this comment.
The delimiter default value is documented as "|" but the PR description states the message format should use colon (:) as the delimiter. This inconsistency between documentation and the intended API design could lead to integration issues. Ensure the delimiter matches the agreed-upon API specification.
openlibrary/plugins/admin/code.py
Outdated
| ocaid, ts = msg.split("|") | ||
|
|
||
| if not ts or not ocaid: | ||
| raise web.HTTPError( | ||
| "400 Bad Request", data=json.dumps({"error": "Invalid inputs"}) | ||
| ) | ||
|
|
There was a problem hiding this comment.
The PR description states the message format should be colon-delimited ({operation}:{ocaid}:{expiry}), but the code uses pipe delimiter (|). This mismatch means the implementation doesn't follow the documented API design. Additionally, the code doesn't parse or use an operation field at all - it only parses ocaid and ts and hardcodes the make_dark operation.
| ocaid, ts = msg.split("|") | |
| if not ts or not ocaid: | |
| raise web.HTTPError( | |
| "400 Bad Request", data=json.dumps({"error": "Invalid inputs"}) | |
| ) | |
| try: | |
| operation, ocaid, ts = msg.split(":") | |
| except ValueError: | |
| raise web.HTTPError( | |
| "400 Bad Request", data=json.dumps({"error": "Invalid message format; expected '{operation}:{ocaid}:{expiry}'"}) | |
| ) | |
| if not operation or not ocaid or not ts: | |
| raise web.HTTPError( | |
| "400 Bad Request", data=json.dumps({"error": "Invalid inputs"}) | |
| ) | |
| if operation != "make_dark": | |
| raise web.HTTPError( | |
| "400 Bad Request", data=json.dumps({"error": f"Unsupported operation: {operation}"}) | |
| ) |
openlibrary/plugins/admin/code.py
Outdated
| ocaid, ts = msg.split("|") | ||
|
|
||
| if not ts or not ocaid: | ||
| raise web.HTTPError( | ||
| "400 Bad Request", data=json.dumps({"error": "Invalid inputs"}) | ||
| ) |
There was a problem hiding this comment.
The validation if not ts or not ocaid happens after splitting but doesn't handle potential parsing errors. If msg doesn't contain a pipe delimiter, the split will succeed but won't produce two values, causing an unpacking error on line 269. Consider adding error handling for the split operation or validating the number of components.
| mac = '' | ||
| if key: | ||
| mac = hmac.new( | ||
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 | ||
| ).hexdigest() |
There was a problem hiding this comment.
The timing-attack mitigation logic is incomplete. When the secret key is missing (line 54-55), mac remains an empty string, which will always fail the compare_digest check on line 63. However, the error is only raised after the comparison (line 64-65). This means an attacker could distinguish between a missing key and an invalid digest by timing the response, since compare_digest on empty strings is faster than on actual hashes. Consider generating a dummy MAC value even when the key is missing to maintain constant-time execution.
| mac = '' | |
| if key: | |
| mac = hmac.new( | |
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 | |
| ).hexdigest() | |
| # Always produce a MAC of the correct length to mitigate timing attacks | |
| digest_size = hashlib.md5().digest_size * 2 # hexdigest is twice the digest size | |
| if key: | |
| mac = hmac.new( | |
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 | |
| ).hexdigest() | |
| else: | |
| # Use a dummy value of the correct length (all zeros) | |
| mac = '0' * digest_size |
openlibrary/plugins/admin/code.py
Outdated
| class unlink_ia_ol(delegate.page): | ||
| path = '/ia/unlink' |
There was a problem hiding this comment.
The endpoint path /ia/unlink and class name unlink_ia_ol don't match the PR description which mentions /ia/sync.json endpoint. This inconsistency could cause confusion. Consider renaming to match the intended API design described in the PR.
| class unlink_ia_ol(delegate.page): | |
| path = '/ia/unlink' | |
| class sync_ia_ol(delegate.page): | |
| path = '/ia/sync.json' |
| mac = '' | ||
| if key: | ||
| mac = hmac.new( | ||
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 |
There was a problem hiding this comment.
MD5 is considered cryptographically weak and should not be used for HMAC in security-sensitive contexts. Consider using a stronger hash algorithm like SHA-256 (hashlib.sha256) instead. While HMAC-MD5 is not as critically vulnerable as plain MD5, using SHA-256 is the modern standard and provides better security guarantees.
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5 | |
| key.encode('utf-8'), msg.encode('utf-8'), hashlib.sha256 |
There was a problem hiding this comment.
Strong agree + blocker
Let's use digest = hmac.new(seed.encode('utf-8'), payload, hashlib.sha256).digest()
| result = hmac.compare_digest(mac, digest) | ||
| if err: | ||
| raise err | ||
| return result |
There was a problem hiding this comment.
The verify method returns the result of hmac.compare_digest() but this return value is never checked in the calling code (line 265 in code.py). If the digest comparison fails but no error is set, the method returns False, but the caller doesn't handle this case - it only catches exceptions. This means an invalid digest would pass through without raising an error. Consider raising a ValueError when result is False before returning.
| return result | |
| if not result: | |
| raise ValueError("Invalid digest") | |
| return True |
| expiry = datetime.datetime.fromisoformat(expiry_str) | ||
|
|
||
| err: Exception | None = None | ||
|
|
||
| if current_time > expiry: |
There was a problem hiding this comment.
The fromisoformat method can raise a ValueError if expiry_str is not a valid ISO format timestamp. This exception is not caught, which could lead to unclear error messages for the caller. Consider wrapping this in a try-except block to raise a more specific error or ensure the timing-attack mitigation logic still executes.
| expiry = datetime.datetime.fromisoformat(expiry_str) | |
| err: Exception | None = None | |
| if current_time > expiry: | |
| expiry = None | |
| err: Exception | None = None | |
| try: | |
| expiry = datetime.datetime.fromisoformat(expiry_str) | |
| except ValueError: | |
| err = ValueError(f"Invalid ISO format timestamp: {expiry_str!r}") | |
| if expiry is not None and current_time > expiry: |
openlibrary/plugins/admin/code.py
Outdated
| if not data['source_records']: | ||
| del data['source_records'] |
There was a problem hiding this comment.
If the source_records list is empty after the MARC record is unlinked, source_records is deleted from the edition object.
Changes disassociation endpoint to `/api/unlink.json`. Moves handler to `plugins/openlibrary/api.py`.
for more information, see https://pre-commit.ci
|
Merged but please let's (a) move to sha2 as per PR notes and (b) communicate with Hank and Jude re: the hmac format, timestamps, and delimiter |
Addresses #7539
Creates new
/api/unlink.jsonendpoint for the purpose of syncing data between IA and OL.POSTrequests to this endpoint are expected to have the following query parameters:msg: A colon delimited string in the format{ocaid}|{expiry}digest: The hashed value ofmsgAll inputs required for the operation are expected to be sent in the
msgstring.The
POSThandler will, in order:digestand check the expiryocaidgiven in themsgoperationon the selected editionOutstanding tasks
verify_hmactocore/auth.pyTechnical
Special Deployment Instructions
Ensure that new shared key entry is added to our production configurations before deploying.
Testing
Screenshot
Stakeholders