Skip to content

Add /ia/sync.json POST handler#11230

Merged
mekarpeles merged 10 commits intointernetarchive:masterfrom
jimchamp:sync-ia-ol
Dec 10, 2025
Merged

Add /ia/sync.json POST handler#11230
mekarpeles merged 10 commits intointernetarchive:masterfrom
jimchamp:sync-ia-ol

Conversation

@jimchamp
Copy link
Copy Markdown
Collaborator

@jimchamp jimchamp commented Sep 3, 2025

Addresses #7539

Creates new /api/unlink.json endpoint for the purpose of syncing data between IA and OL. POST requests to this endpoint are expected to have the following query parameters:

  • msg : A colon delimited string in the format {ocaid}|{expiry}
  • digest : The hashed value of msg

All inputs required for the operation are expected to be sent in the msg string.

The POST handler will, in order:

  1. Verify the digest and check the expiry
  2. Validate the request inputs
  3. Query for the OL edition that matches the ocaid given in the msg
  4. Perform the operation on the selected edition

Outstanding tasks

  • Create new shared key for IA <-> OL sync operations
  • Move verify_hmac to core/auth.py
  • Disassociate MARC records

Technical

Special Deployment Instructions

Ensure that new shared key entry is added to our production configurations before deploying.

Testing

Screenshot

Stakeholders

@mekarpeles mekarpeles self-assigned this Sep 12, 2025
@mekarpeles
Copy link
Copy Markdown
Member

I admit I don't recall perfectly where we left off;

Is there a way we can try a working example on testing.openlibrary.org where we manually make a API call w/ a curl and a ocaid that @seabelis recommends, where we disassociate the item from Open Library?

This is annoying to test because it will require us manually minting an hmac token to send with the request based on some shared secret (and we may want to add/stage this secret for now in testing olsystem)

@mekarpeles mekarpeles added this to the Sprint 2025-11 milestone Oct 8, 2025
@jimchamp jimchamp added the Needs: Special Deploy This PR will need a non-standard deploy to production label Nov 19, 2025
Updates function to take the full secret key name as
a parameter.

Sets the default message delimiter to a pipe to make
parsing the expiry timestamp easier.

Fixes time comparison bug.
@jimchamp
Copy link
Copy Markdown
Collaborator Author

I admit I don't recall perfectly where we left off[...]

The three outstanding tasks from this comment still remain:

  • A key must be generated and added to each application's vault/configurations.
  • The verify_hmac function should be moved to a more appropriate place, especially if we plan on reusing it. Maybe one of our utils.py file would be a good place?
  • The update_marc method remains unimplemented.

I was able to test verify_hmac locally by generating a hash with the following php:

<?php

$msg = "hello there";
$secret_key = "placeholderkey";

$expiry_time = date('c', time() + 60); // 'c' = ISO 8601 format
$data = $msg . "|" . $expiry_time;
$hash = hash_hmac('md5', $data, $secret_key);

echo "Hash: " . $hash . "\n";
echo "Data: " . $data . "\n";

?>

Have verified that an ExiredTokenError is thrown if the current time is greater than the expiry.

@mekarpeles
Copy link
Copy Markdown
Member

mekarpeles commented Nov 26, 2025

As reference, we have One Time Password logic in this PR that is similar (living in core/auth.py) that could a home + model for how we might want to centralize our HMAC logic
#11288

Rather than waiting for 11288 to land, maybe the takeaway is we could use core/auth.py and have some sort of hmac class/utility to centralize how we either generate or verify an hmac.

@mekarpeles
Copy link
Copy Markdown
Member

We have a new key ia_sync_secret https://github.com/internetarchive/olsystem/pull/287 that can be used for HMAC

Updates function to take the full secret key name as
a parameter.

Sets the default message delimiter to a pipe to make
parsing the expiry timestamp easier.

Fixes time comparison bug.
except (ValueError, ExpiredTokenError):
raise web.HTTPError("401 Unauthorized")

op, ocaid, _ = msg.split("|")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the op in the encrypted message isn't necessary. This handler's path will be updated to something like /ia/unlink, and will handle all disassociation requests. Later, a complementary /ia/link handler will be created for association requests.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg is now expected to by in <ocaid>:<timestamp format.

Updates the endpoint to `/ia/unlink.json`, removing the
need for an operation to be included in request payloads.
@jimchamp jimchamp marked this pull request as ready for review December 2, 2025 23:55
Copilot AI review requested due to automatic review settings December 2, 2025 23:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new endpoint for syncing data between Internet Archive (IA) and Open Library (OL), specifically for "unlinking" items by removing OCAIDs from editions. The implementation includes HMAC-based authentication with timestamp validation to secure the endpoint.

Key Changes

  • Creates new HMACToken authentication class in openlibrary/core/auth.py with HMAC verification and expiry checking
  • Adds /ia/unlink POST endpoint in openlibrary/plugins/admin/code.py to remove OCAIDs from editions
  • Implements make_dark operation to delete ocaid and ia: source records from editions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File Description
openlibrary/core/auth.py New file containing HMACToken class for HMAC digest verification with timestamp validation and timing-attack mitigation
openlibrary/plugins/admin/code.py Adds unlink_ia_ol endpoint class with POST handler for authenticated OCAID removal, including edition lookup and dark-making logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +66
class HMACToken:
@staticmethod
def verify(digest: str, msg: str, secret_key_name: str, delimiter: str = "|"):
"""
Verify an HMAC digest against a message with timestamp validation.

This method validates that the provided HMAC digest matches the expected
digest for the given message, and that the token has not expired. The
message is expected to contain an ISO format timestamp as its last
delimiter-separated component.

To mitigate timing attacks, all cryptographic operations are performed
before raising any exceptions, ensuring constant-time execution regardless
of which validation fails first.

Args:
digest: The HMAC digest to verify (hexadecimal string).
msg: The message to verify, which must end with a delimiter-separated
ISO format timestamp (e.g., "data|2024-12-31T23:59:59+00:00").
secret_key_name: The configuration key name used to retrieve the
secret key for HMAC computation.
delimiter: The character used to separate message components.
Defaults to "|".

Returns:
bool: True if the digest is valid and the token has not expired.

Raises:
ExpiredTokenError: If the timestamp in the message indicates the
token has expired (raised after digest comparison).
ValueError: If the secret key cannot be found in the configuration
(raised after digest comparison).
"""
current_time = datetime.datetime.now(datetime.UTC)
expiry_str = msg.split(delimiter)[-1]
expiry = datetime.datetime.fromisoformat(expiry_str)

err: Exception | None = None

if current_time > expiry:
err = ExpiredTokenError()

if not (key := config.get(secret_key_name, '')):
err = ValueError("No key found")

mac = ''
if key:
mac = hmac.new(
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
).hexdigest()

result = hmac.compare_digest(mac, digest)
if err:
raise err
return result
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new HMACToken class lacks test coverage. Tests should be added to verify:

  1. Successful verification with valid digest and non-expired timestamp
  2. Rejection of expired tokens
  3. Rejection of invalid digests
  4. Handling of missing secret keys
  5. Constant-time execution for timing attack mitigation
  6. Proper parsing of delimiter-separated messages
  7. Handling of malformed timestamps

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +267
except (ValueError, ExpiredTokenError):
raise web.HTTPError("401 Unauthorized")
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The authentication error handling catches both ValueError and ExpiredTokenError but treats them identically with a 401 response. This makes it difficult for clients to distinguish between expired tokens (which could be retried with a new token) and other authentication failures. Consider handling ExpiredTokenError separately with a more specific error message.

Suggested change
except (ValueError, ExpiredTokenError):
raise web.HTTPError("401 Unauthorized")
except ExpiredTokenError:
raise web.HTTPError(
"401 Unauthorized",
data=json.dumps({"error": "Token expired"})
)
except ValueError:
raise web.HTTPError(
"401 Unauthorized",
data=json.dumps({"error": "Invalid token"})
)

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +34
delimiter: The character used to separate message components.
Defaults to "|".
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delimiter default value is documented as "|" but the PR description states the message format should use colon (:) as the delimiter. This inconsistency between documentation and the intended API design could lead to integration issues. Ensure the delimiter matches the agreed-upon API specification.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimchamp please talk with @judec + @hbromley about how we want the hmc to be

Comment on lines +269 to +275
ocaid, ts = msg.split("|")

if not ts or not ocaid:
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": "Invalid inputs"})
)

Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states the message format should be colon-delimited ({operation}:{ocaid}:{expiry}), but the code uses pipe delimiter (|). This mismatch means the implementation doesn't follow the documented API design. Additionally, the code doesn't parse or use an operation field at all - it only parses ocaid and ts and hardcodes the make_dark operation.

Suggested change
ocaid, ts = msg.split("|")
if not ts or not ocaid:
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": "Invalid inputs"})
)
try:
operation, ocaid, ts = msg.split(":")
except ValueError:
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": "Invalid message format; expected '{operation}:{ocaid}:{expiry}'"})
)
if not operation or not ocaid or not ts:
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": "Invalid inputs"})
)
if operation != "make_dark":
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": f"Unsupported operation: {operation}"})
)

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +274
ocaid, ts = msg.split("|")

if not ts or not ocaid:
raise web.HTTPError(
"400 Bad Request", data=json.dumps({"error": "Invalid inputs"})
)
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation if not ts or not ocaid happens after splitting but doesn't handle potential parsing errors. If msg doesn't contain a pipe delimiter, the split will succeed but won't produce two values, causing an unpacking error on line 269. Consider adding error handling for the split operation or validating the number of components.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +61
mac = ''
if key:
mac = hmac.new(
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
).hexdigest()
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing-attack mitigation logic is incomplete. When the secret key is missing (line 54-55), mac remains an empty string, which will always fail the compare_digest check on line 63. However, the error is only raised after the comparison (line 64-65). This means an attacker could distinguish between a missing key and an invalid digest by timing the response, since compare_digest on empty strings is faster than on actual hashes. Consider generating a dummy MAC value even when the key is missing to maintain constant-time execution.

Suggested change
mac = ''
if key:
mac = hmac.new(
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
).hexdigest()
# Always produce a MAC of the correct length to mitigate timing attacks
digest_size = hashlib.md5().digest_size * 2 # hexdigest is twice the digest size
if key:
mac = hmac.new(
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
).hexdigest()
else:
# Use a dummy value of the correct length (all zeros)
mac = '0' * digest_size

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +255
class unlink_ia_ol(delegate.page):
path = '/ia/unlink'
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint path /ia/unlink and class name unlink_ia_ol don't match the PR description which mentions /ia/sync.json endpoint. This inconsistency could cause confusion. Consider renaming to match the intended API design described in the PR.

Suggested change
class unlink_ia_ol(delegate.page):
path = '/ia/unlink'
class sync_ia_ol(delegate.page):
path = '/ia/sync.json'

Copilot uses AI. Check for mistakes.
mac = ''
if key:
mac = hmac.new(
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MD5 is considered cryptographically weak and should not be used for HMAC in security-sensitive contexts. Consider using a stronger hash algorithm like SHA-256 (hashlib.sha256) instead. While HMAC-MD5 is not as critically vulnerable as plain MD5, using SHA-256 is the modern standard and provides better security guarantees.

Suggested change
key.encode('utf-8'), msg.encode('utf-8'), hashlib.md5
key.encode('utf-8'), msg.encode('utf-8'), hashlib.sha256

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong agree + blocker

https://github.com/internetarchive/openlibrary/pull/11288/files#diff-7ff81128d3b27b26d5cf60f2d19a44042debc4b8c9b3788903cd1fb584104ea4R38

Let's use digest = hmac.new(seed.encode('utf-8'), payload, hashlib.sha256).digest()

result = hmac.compare_digest(mac, digest)
if err:
raise err
return result
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verify method returns the result of hmac.compare_digest() but this return value is never checked in the calling code (line 265 in code.py). If the digest comparison fails but no error is set, the method returns False, but the caller doesn't handle this case - it only catches exceptions. This means an invalid digest would pass through without raising an error. Consider raising a ValueError when result is False before returning.

Suggested change
return result
if not result:
raise ValueError("Invalid digest")
return True

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +51
expiry = datetime.datetime.fromisoformat(expiry_str)

err: Exception | None = None

if current_time > expiry:
Copy link

Copilot AI Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fromisoformat method can raise a ValueError if expiry_str is not a valid ISO format timestamp. This exception is not caught, which could lead to unclear error messages for the caller. Consider wrapping this in a try-except block to raise a more specific error or ensure the timing-attack mitigation logic still executes.

Suggested change
expiry = datetime.datetime.fromisoformat(expiry_str)
err: Exception | None = None
if current_time > expiry:
expiry = None
err: Exception | None = None
try:
expiry = datetime.datetime.fromisoformat(expiry_str)
except ValueError:
err = ValueError(f"Invalid ISO format timestamp: {expiry_str!r}")
if expiry is not None and current_time > expiry:

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +313
if not data['source_records']:
del data['source_records']
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source_records list is empty after the MARC record is unlinked, source_records is deleted from the edition object.

Changes disassociation endpoint to `/api/unlink.json`.
Moves handler to `plugins/openlibrary/api.py`.
@mekarpeles mekarpeles merged commit d44ecab into internetarchive:master Dec 10, 2025
4 checks passed
@mekarpeles
Copy link
Copy Markdown
Member

Merged but please let's (a) move to sha2 as per PR notes and (b) communicate with Hank and Jude re: the hmac format, timestamps, and delimiter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Special Deploy This PR will need a non-standard deploy to production

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants