Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Python BeautifulSoup how to correctly parse quote symbols in link titles (broken html)?
I'm learning some basic web scraping so I can download part of a static website for offline viewing. In the process, I'm running into link tags that look something like this:
<a href='https://example.com' title='Example web author's link to another page'>Page X</a>
Or, that's what I'd expect to get when looking at the page source. When I run html = urlopen(url).read() and then soup = BeautifulSoup(html, 'lxml'), BeautifulSoup somehow mangles the link title so it ends up looking more like this, depending on the text and number of quotes:
<a href="https://example.com" s'=" link to another page" title="Example web author">Page X</a>
<a page'="" href="https://example.com" s="" title="Author">Page X</a>
I'm certain the odd number of quote symbols is throwing the thing off by making it think the title ends earlier than it really does. It's a bit of a problem since link.get('title') only returns the part of the title before any quote symbols. Changing the parser didn't work and search engines are only giving me generic BeautifulSoup articles when I look this up, so I've no idea how to fix it. I'm running BeautifulSoup version 4.14.2 on Arch Linux if that helps.
1 answer
In this case, if what you've included in your post is right, the parser is actually correct. You can see it in the syntax highlighting here:
<a href='https://example.com' title='Example web author's link to another page'>Page X</a>
The href and title attributes are using single quotes. Although single and double quotes are interchangeable in HTML, where the attribute value also includes one of those characters the other must be used for the attribute.
For example, this is valid:
<span title="Can't connect foo to bar"></span>
This is not:
<span title='Can't connect foo to bar'></span>
If what you've written here is accurate, then the website you got this HTML from has malformed HTML which won't work with any standards-compliant parser.

0 comment threads