Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Regex for simplifying Amazon product URLs

−3

Amazon product URLs are often very long, but on experimentation it is revealed that the following pattern is sufficient:

https://www.amazon.com/{dp|gp}/$ID

ID is a 10-char string, which I'm guessing is the ASIN. I want to parse out the long URLs into the minimal form. What exactly is the regex for identifying the ID?

regex url

posted over 2 years ago

CC BY-SA 4.0

2y ago by Alexei‭

matthewsnyder‭

2418 reputation 52 63 282 93

Raw

Markdown

History

0 comment threads

2 answers

Score Active Age

−0

Actually, an Amazon product URL can also include a product title and have extra params after the ASIN. Example:

https://www.amazon.com/Hitchhikers-Guide-Galaxy/dp/B0009JKV9W/ref=sr_1_1?crid=13ECW_etc.....
                       ^^^^^^^^^^^^^^^^^^^^^^^^               ^^^^^^^^^^^^^^^
                         product description                  extra stuff

But that part is optional. For example, the same product above can be accessed by the URL https://www.amazon.com/dp/B0009JKV9W.

Anyway, the regex could be like this:

https://www\.amazon\.com(?:/[^/]+)?/[dg]p/([a-zA-Z0-9]{10})

The (?:/[^/]+)? part matches the optional product title:

[^/] means "any character that's not a slash"
+ means "one or more occurrences"

Therefore, /[^/]+ is a slash followed by one or more characters that are not a slash. And it's all inside parenthesis, followed by ? (which means "zero or one occurrence", AKA "optional").

Parenthesis normally create a capturing group, but the (?: syntax makes it a non-capturing group. I prefer to use this to indicate that I'm not interested in whatever matches that part of the regex, and it prevents the engine from creating groups that I don't want to look at.

Then [dg] matches either d or g, and it's followed by p.

And finally, [a-zA-Z0-9] matches any lowercase or uppercase ASCII letter, or any digit between 0 and 9. {10} means "exactly 10 occurrences". And everything is inside parenthesis, to create a capturing group, whose value you can easily get by using any engine/library.

For example, in Python:

import re

url = ... # some URL

if match := re.match(r'https://www\.amazon\.com(?:/[^/]+)?/[dg]p/([a-zA-Z0-9]{10})', url):
    # get capturing group 1
    asin = match[1]

As groups are indexed in the order they appear in the expression, the information you want to extract will be in group 1 (as this is the first capturing group of the regex).

You can also check here an example of this regex working.

The other answer used A-z to match ASCII letters, but this interval also matches the characters [, \, ], ^, _ and `. Using a-zA-Z, you guarantee that only ASCII letters are matched.

That said, regex isn't exactly the best tool for URL parsing. Of course for specific, controlled cases (such as "I know it's always an Amazon product URL, which has a well defined structure"), it works fine. But for more general cases, there are speciliazed libraries/modules/functions that can parse, validate, handle tons of corner cases, and so on (things that can be harder to solve with a regular expression).

posted 27 days ago

CC BY-SA 4.0

hkotsubo‭

5565 reputation 22 77 629 254

Copy Link

Raw

Markdown

History

1 comment thread

[A-Z] may match way more then ASCII letters! (1 comment)

−2

The following expression should capture the {dp|gp}/$ID part:

https://www\.amazon\.com/([gd]p/[A-z0-9]{10})

A quick explanation:

the \. are there to match periods only (otherwise it would match any symbol),
[gd]p matches either gp or dp,
[A-z0-9]{10} matches exactly 10 alphanumeric characters
The parentheses around these last two components captures them. This is not really necessary, but the matching algorithm will be able to export only this part of the string as well.

posted over 2 years ago

CC BY-SA 4.0

mr Tsjolder‭

595 reputation 7 19 67 6

Copy Link

Raw

Markdown

History

Communities

Regex for simplifying Amazon product URLs

0 comment threads

2 answers

1 comment thread

0 comment threads