Skip to content

[Proposal]: Add HTMLEntry & related #1912

@stloyd

Description

@stloyd

Describe the Proposal

To easily work with the new \DOM\HTMLDocument, it would be good to introduce a new entry, type & cast for this. This would be a nice addition to the existing XML type, which would allow much easier side scraping and data extraction from those.

In theory, we could extend the existing XMLEntry, but that one is more specific, and a dedicated one sounds like a better idea.

API Adjustments

Entry:

/**
 * @implements Entry<?\DOM\HTMLDocument>
 */
final class HTMLEntry implements Entry
{
    use EntryRef;

    public function __construct(
        private readonly string $name,
        HTMLDocument|string|null $value,
    ) {
        if (\is_string($value)) {
            try {
                $doc = \DOM\HTMLDocument::createFromString($value, \LIBXML_COMPACT | \LIBXML_NOERROR);
            } catch (\ValueError $e) {
                throw new InvalidArgumentException(\sprintf('Given string "%s" is not valid XML', $value), $e->getCode(), $e);
            }
        }
    }

    //...
}

Cast:

/**
 * @implements Type<HTMLDocument>
 */
final readonly class HTMLType implements Type
{
    // ...

    public function cast(mixed $value): HTMLDocument
    {
        if ($this->isValid($value)) {
            return $value;
        }

        if (\is_string($value)) {
            return HTMLDocument::createFromString($value, \LIBXML_COMPACT | \LIBXML_NOERROR);
        }

        try {
            $stringValue = type_string()->cast($value);

            return HTMLDocument::createFromString($stringValue, \LIBXML_COMPACT | \LIBXML_NOERROR);
        } catch (CastingException $e) {
            throw new CastingException($value, $this, $e);
        }
   }

   // ...
}

Query function:

final class DomQueryAll extends ScalarFunctionChain
{
    public function __construct(
        private readonly mixed $value,
        private readonly ScalarFunction|string $path,
    ) {
    }

    /**
     * @return null|array<Element>
     */
    public function eval(Row $row) : ?array
    {
        $value = (new Parameter($this->value))->asInstanceOf($row, \DOM\HTMLDocument::class);
        $path = (new Parameter($this->path))->asString($row);

        if ($value === null || $path === null) {
            return null;
        }

        $result = $value->querySelectorAll($path);

        if ($result->count() === 0) {
            return null;
        }

        // ...
    }
}

Are you intending to also work on proposed change?

Yes

Are you interested in sponsoring this change?

No

Integration & Dependencies

Enabled PHP ext-dom & PHP 8.4+.

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions