Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Regex for simplifying Amazon product URLs
Amazon product URLs are often very long, but on experimentation it is revealed that the following pattern is sufficient:
https://www.amazon.com/{dp|gp}/$ID
ID is a 10-char string, which I'm guessing is the ASIN. I want to parse out the long URLs into the minimal form. What exactly is the regex for identifying the ID?
2 answers
Actually, an Amazon product URL can also include a product title and have extra params after the ASIN. Example:
https://www.amazon.com/Hitchhikers-Guide-Galaxy/dp/B0009JKV9W/ref=sr_1_1?crid=13ECW_etc.....
^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
product description extra stuff
But that part is optional. For example, the same product above can be accessed by the URL https://www.amazon.com/dp/B0009JKV9W.
Anyway, the regex could be like this:
https://www\.amazon\.com(?:/[^/]+)?/[dg]p/([a-zA-Z0-9]{10})
The (?:/[^/]+)? part matches the optional product title:
-
[^/]means "any character that's not a slash" -
+means "one or more occurrences"
Therefore, /[^/]+ is a slash followed by one or more characters that are not a slash. And it's all inside parenthesis, followed by ? (which means "zero or one occurrence", AKA "optional").
Parenthesis normally create a capturing group, but the (?: syntax makes it a non-capturing group. I prefer to use this to indicate that I'm not interested in whatever matches that part of the regex, and it prevents the engine from creating groups that I don't want to look at.
Then [dg] matches either d or g, and it's followed by p.
And finally, [a-zA-Z0-9] matches any lowercase or uppercase ASCII letter, or any digit between 0 and 9. {10} means "exactly 10 occurrences". And everything is inside parenthesis, to create a capturing group, whose value you can easily get by using any engine/library.
For example, in Python:
import re
url = ... # some URL
if match := re.match(r'https://www\.amazon\.com(?:/[^/]+)?/[dg]p/([a-zA-Z0-9]{10})', url):
# get capturing group 1
asin = match[1]
As groups are indexed in the order they appear in the expression, the information you want to extract will be in group 1 (as this is the first capturing group of the regex).
You can also check here an example of this regex working.
The other answer used A-z to match ASCII letters, but this interval also matches the characters [, \, ], ^, _ and `. Using a-zA-Z, you guarantee that only ASCII letters are matched.
That said, regex isn't exactly the best tool for URL parsing. Of course for specific, controlled cases (such as "I know it's always an Amazon product URL, which has a well defined structure"), it works fine. But for more general cases, there are speciliazed libraries/modules/functions that can parse, validate, handle tons of corner cases, and so on (things that can be harder to solve with a regular expression).
The following expression should capture the {dp|gp}/$ID part:
https://www\.amazon\.com/([gd]p/[A-z0-9]{10})
A quick explanation:
- the
\.are there to match periods only (otherwise it would match any symbol), -
[gd]pmatches eithergpordp, -
[A-z0-9]{10}matches exactly 10 alphanumeric characters - The parentheses around these last two components captures them. This is not really necessary, but the matching algorithm will be able to export only this part of the string as well.

0 comment threads