Guides

Custom Rulesets

ua-parser defaults to the latest stable release of ua-core via precompiled regexes.yaml.

That is a suitable defaut, but there are plenty of reasons to use custom rulesets:

  • trim down the default ruleset to only the most current or relevant rules for efficiency e.g. you might not care about CalDav or podcast applications

  • add new rules relevant to your own traffic but which aren’t (possibly can’t be) in the main project

  • experiment with the creation of new rules

  • use a completely bespoke ruleset to track UA-identified API clients

  • use “experimental” rules which haven’t been released yet (although ua-parser-builtins provides regular prerelease versions which may be suitable for this)

ua-parser provides easy ways to load custom rolesets:

from ua_parser import Parser
from ua_parser.loaders import load_yaml # requires PyYaml

parser = Parser.from_matchers(load_yaml("regexes.yaml"))
parser.parse(some_ua)

Custom Global Parser

The global utility functions parse(), parse_user_agent(), parse_os(), and parse_device() just call to the global ua_parser.parser internally.

This means it’s possible to customise their behaviour by just setting the global parser, although obviously that affects all users in the process which is both the advantage and risk

>>> import ua_parser
>>> import ua_parser.loaders

>>> ua_parser.parse("foo")
Result(user_agent=None, os=None, device=None, string='foo')
>>> ua_parser.parser = ua_parser.Parser.from_matchers(
...    ua_parser.loaders.load_data((
...        [{"regex": "(foo)"}],
...        [],
...        [],
...    ))
... )
>>> ua_parser.parse("foo") 
Result(user_agent=UserAgent(family='foo',
                            major=None,
                            minor=None,
                            patch=None,
                            patch_minor=None),
       os=None,
       device=None,
       string='foo')

Cache And Other Advanced Parser Customisation

While loading custom rulesets has built-in support, other forms of parser customisations don’t and require manually instantiating and composing Resolver objects.

The most basic such customisation is simply configuring caching away from the default setup.

As an example, in the default configuration if google-re2 is available the RE2-based resolver is not cached, a user might consider the memory investment worth it and want to reconfigure the stack for a cached base.

The process is uncomplicated as the APIs are designed to compose together.

The first step is to instantiate a base resolver, instantiated with the relevant Matchers data:

import ua_parser.loaders
import ua_parser.re2
base = ua_parser.re2.Resolver(
    ua_parser.loaders.load_lazy_builtins())

The next step is to instantiate the cache [1] suitably configured:

cache = ua_parser.Cache(1000)

And compose the base resolver and cache together:

resolver = ua_parser.caching.CachingResolver(
    base,
    cache
)

Finally, for convenience a ua_parser.Parser can be wrapped around the resolver, and that can either be used as-is, or set as the global parser for all the library users to use this new configuration from here on:

ua_parser.parser = ua_parser.Parser(resolver)

Note

To be honest aside from configuring the presence, algorithm, and size of caches there currently isn’t much to compose that’s built in. The only remaining member of the cast is Local, which is also caching-related, and serves to use thread-local caches rather than a shared cache.

Builtin Resolvers

speed

portability

memory use

safety

regex

great

good

bad

great

re2

good

bad

good

good

basic

terrible

great

great

great

regex

The regex resolver is a bespoke effort as part of the uap-rust sibling project, built on rust-regex and a bespoke regex-prefiltering implementation, it:

  • Is the fastest available resolver, usually edging out re2 by a significant margin (when that is even available).

  • Is fully controlled by the project, and thus can be built for all interpreters and platforms supported by pyo3 (currently: cpython, pypy, and graalpy, on linux, macos and linux, intel and arm). It is also built as a cpython abi3 wheel and should thus suffer from no compatibility issues with new release.

  • Built entirely out of safe rust code, its safety risks are entirely in regex and pyo3.

  • Its biggest drawback is that it is a lot more memory intensive than the other resolvers, because regex tends to trade memory for speed (~155MB high water mark on a real-world dataset).

If available, it is the default resolver, without a cache.

re2

The re2 resolver is built atop the widely used google-re2 via its built-in Python bindings. It:

  • Is extremely fast, though around 80% slower than regex on real-world data.

  • Is only compatible with CPython, and uses pure API wheels, so needs a different release for each cpython version, for each OS, for each architecture.

  • Is built entirely in C++, but by experienced Google developers.

  • Is more memory intensive than the pure-python basic resolver, but quite slim all things considered (~55MB high water mark on a real-world dataset).

If available, it is the second-preferred resolver, without a cache.

basic

The basic resolver is a naive linear traversal of all rules, using the standard library’s re. It:

  • Is extremely slow: about 10x slower than re2 on cpython, and pypy and graal’s regex implementations do not like the workload and are 3x-4x slower than cpython.

  • Has perfect compatibility, with the caveat above, by virtue of being built entirely out of standard library code.

  • Is basically as safe as Python software can be by virtue of being just Python, with the native code being the standard library’s.

  • Is the slimmest resolver at about 40MB.

This is caveated by a hard requirement to use caches which makes it workably faster on real-world datasets (if still nowhere near uncached re2 or regex) but increases its memory requirement significantly e.g. using “sieve” and a cache size of 20000 on a real-world dataset, it is about 4x slower than re2 for about the same memory requirements.

It is the fallback and least preferred resolver, with a medium (currently 2000 entries) cache by default.

Writing Custom Resolvers

It is unclear if there would be any fun or profit to it, but an express goal of the new API is to allow writing and composing resolvers, so what is a resolver?

Resolver is a structural typing.Protocol for implementation convenience (nothing to inherit, and not even a class to write). Here it is in full:

class Resolver(Protocol):
    @abc.abstractmethod
    def __call__(self, ua: str, domain: Domain, /) -> PartialResult:
        ...

So a Resolver is just a callable which takes a string and a Domain, and returns a PartialResult.

For our first resolver, let’s say that we have an API and a mobile application, and as we expect the mobile application to be the main caller we want to special-case it, we could do it in many ways but the way we’re doing it is a bespoke Resolver which matches the application’s user agent and performs trivial parsing:

def foo_resolver(ua: str, domain: Domain, /) -> PartialResult:
    if not ua.startswith('fooapp/'):
        # not our application, match failure
        return PartialResult(domain, None, None, None, ua)

    # we've defined our UA as $appname/$version/$user-token
    app, version, user = ua.split('/', 3)
    major, minor = version.split('.')
    return PartialResult(
        domain,
        UserAgent(app, major, minor),
        None,
        Device(user),
        ua,
    )

This resolver is not hugely interesting as it resolves a very limited number of user agent strings and fails everything else, although it does demonstrate two important requirements of the protocol:

  • If a domain is requested, it must be returned, even if None (signaling a matching failure).

  • If it’s efficient there is nothing wrong with returning data for domains which were not requested, at worst they will be ignored.

For a more interesting resolver, we can write a fallback resolver: it’s a higher-order resolver which tries to call multiple sub-resolvers in sequence until the UA is resolved. This means we could then use something like:

Parser(FallbackResolver([
    foo_resolver,
    re2.Resolver(load_lazy_builtins()),
]))

to prioritise cheap resolving of our application while still resolving third party user agents:

class FallbackResolver:
    def __init__(self, resolvers: List[Resolver]) -> None:
        self.resolvers = resolvers

    def __call__(self, ua: str, domain: Domain, /) -> PartialResult:
        if domain:
            for resolver in self.resolvers:
                r = resolver(ua, domain)
                # if any value is non-none the resolver found a match
                if r.user_agent_string is not None \
                  or r.os is not None \
                  or r.device is not None:
                    return r

        # if no resolver found a match (or nothing was requested),
        # resolve to failure
        return PartialResult(domain, None, None, None, ua)