Guides¶
Custom Rulesets¶
ua-parser defaults to the latest stable release of ua-core via precompiled regexes.yaml.
That is a suitable defaut, but there are plenty of reasons to use custom rulesets:
trim down the default ruleset to only the most current or relevant rules for efficiency e.g. you might not care about CalDav or podcast applications
add new rules relevant to your own traffic but which aren’t (possibly can’t be) in the main project
experiment with the creation of new rules
use a completely bespoke ruleset to track UA-identified API clients
use “experimental” rules which haven’t been released yet (although ua-parser-builtins provides regular prerelease versions which may be suitable for this)
ua-parser provides easy ways to load custom rolesets:
ua_parser.loadersconverts whatever external storage format the rules are in to internalua_parser.Parser.from_matchers()can directly create a parser from the loaded data, using the default resolver stack
from ua_parser import Parser
from ua_parser.loaders import load_yaml # requires PyYaml
parser = Parser.from_matchers(load_yaml("regexes.yaml"))
parser.parse(some_ua)
Custom Global Parser¶
The global utility functions parse(),
parse_user_agent(), parse_os(), and
parse_device() just call to the global
ua_parser.parser internally.
This means it’s possible to customise their behaviour by just setting the global parser, although obviously that affects all users in the process which is both the advantage and risk
>>> import ua_parser
>>> import ua_parser.loaders
>>> ua_parser.parse("foo")
Result(user_agent=None, os=None, device=None, string='foo')
>>> ua_parser.parser = ua_parser.Parser.from_matchers(
... ua_parser.loaders.load_data((
... [{"regex": "(foo)"}],
... [],
... [],
... ))
... )
>>> ua_parser.parse("foo")
Result(user_agent=UserAgent(family='foo',
major=None,
minor=None,
patch=None,
patch_minor=None),
os=None,
device=None,
string='foo')
Cache And Other Advanced Parser Customisation¶
While loading custom rulesets has built-in support, other forms of
parser customisations don’t and require manually instantiating and
composing Resolver objects.
The most basic such customisation is simply configuring caching away from the default setup.
As an example, in the default configuration if google-re2 is available the
RE2-based resolver is not cached, a user might consider the memory
investment worth it and want to reconfigure the stack for a cached
base.
The process is uncomplicated as the APIs are designed to compose together.
The first step is to instantiate a base resolver, instantiated with
the relevant Matchers data:
import ua_parser.loaders
import ua_parser.re2
base = ua_parser.re2.Resolver(
ua_parser.loaders.load_lazy_builtins())
The next step is to instantiate the cache [1] suitably configured:
cache = ua_parser.Cache(1000)
And compose the base resolver and cache together:
resolver = ua_parser.caching.CachingResolver(
base,
cache
)
Finally, for convenience a ua_parser.Parser can be wrapped
around the resolver, and that can either be used as-is, or set as the
global parser for all the library users to use this new configuration
from here on:
ua_parser.parser = ua_parser.Parser(resolver)
Note
To be honest aside from configuring the presence, algorithm, and
size of caches there currently isn’t much to compose that’s built
in. The only remaining member of the cast is
Local, which is also caching-related,
and serves to use thread-local caches rather than a shared cache.
Builtin Resolvers¶
speed |
portability |
memory use |
safety |
|
|---|---|---|---|---|
|
great |
good |
bad |
great |
|
good |
bad |
good |
good |
|
terrible |
great |
great |
great |
regex¶
The regex resolver is a bespoke effort as part of the uap-rust sibling project, built on
rust-regex and a bespoke
regex-prefiltering implementation,
it:
Is the fastest available resolver, usually edging out
re2by a significant margin (when that is even available).Is fully controlled by the project, and thus can be built for all interpreters and platforms supported by pyo3 (currently: cpython, pypy, and graalpy, on linux, macos and linux, intel and arm). It is also built as a cpython abi3 wheel and should thus suffer from no compatibility issues with new release.
Built entirely out of safe rust code, its safety risks are entirely in
regexandpyo3.Its biggest drawback is that it is a lot more memory intensive than the other resolvers, because
regextends to trade memory for speed (~155MB high water mark on a real-world dataset).
If available, it is the default resolver, without a cache.
re2¶
The re2 resolver is built atop the widely used google-re2 via its built-in Python bindings.
It:
Is extremely fast, though around 80% slower than
regexon real-world data.Is only compatible with CPython, and uses pure API wheels, so needs a different release for each cpython version, for each OS, for each architecture.
Is built entirely in C++, but by experienced Google developers.
Is more memory intensive than the pure-python
basicresolver, but quite slim all things considered (~55MB high water mark on a real-world dataset).
If available, it is the second-preferred resolver, without a cache.
basic¶
The basic resolver is a naive linear traversal of all rules, using
the standard library’s re. It:
Is extremely slow: about 10x slower than
re2on cpython, and pypy and graal’s regex implementations do not like the workload and are 3x-4x slower than cpython.Has perfect compatibility, with the caveat above, by virtue of being built entirely out of standard library code.
Is basically as safe as Python software can be by virtue of being just Python, with the native code being the standard library’s.
Is the slimmest resolver at about 40MB.
This is caveated by a hard requirement to use caches which makes it
workably faster on real-world datasets (if still nowhere near
uncached re2 or regex) but increases its memory requirement
significantly e.g. using “sieve” and a cache size of 20000 on a
real-world dataset, it is about 4x slower than re2 for about the
same memory requirements.
It is the fallback and least preferred resolver, with a medium (currently 2000 entries) cache by default.
Writing Custom Resolvers¶
It is unclear if there would be any fun or profit to it, but an express goal of the new API is to allow writing and composing resolvers, so what is a resolver?
Resolver is a structural typing.Protocol for
implementation convenience (nothing to inherit, and not even a class
to write). Here it is in full:
class Resolver(Protocol):
@abc.abstractmethod
def __call__(self, ua: str, domain: Domain, /) -> PartialResult:
...
So a Resolver is just a callable which takes a
string and a Domain, and returns a
PartialResult.
For our first resolver, let’s say that we have an API and a mobile
application, and as we expect the mobile application to be the main
caller we want to special-case it, we could do it in many ways but the
way we’re doing it is a bespoke Resolver which
matches the application’s user agent and performs trivial parsing:
def foo_resolver(ua: str, domain: Domain, /) -> PartialResult:
if not ua.startswith('fooapp/'):
# not our application, match failure
return PartialResult(domain, None, None, None, ua)
# we've defined our UA as $appname/$version/$user-token
app, version, user = ua.split('/', 3)
major, minor = version.split('.')
return PartialResult(
domain,
UserAgent(app, major, minor),
None,
Device(user),
ua,
)
This resolver is not hugely interesting as it resolves a very limited number of user agent strings and fails everything else, although it does demonstrate two important requirements of the protocol:
If a domain is requested, it must be returned, even if
None(signaling a matching failure).If it’s efficient there is nothing wrong with returning data for domains which were not requested, at worst they will be ignored.
For a more interesting resolver, we can write a fallback resolver: it’s a higher-order resolver which tries to call multiple sub-resolvers in sequence until the UA is resolved. This means we could then use something like:
Parser(FallbackResolver([
foo_resolver,
re2.Resolver(load_lazy_builtins()),
]))
to prioritise cheap resolving of our application while still resolving third party user agents:
class FallbackResolver:
def __init__(self, resolvers: List[Resolver]) -> None:
self.resolvers = resolvers
def __call__(self, ua: str, domain: Domain, /) -> PartialResult:
if domain:
for resolver in self.resolvers:
r = resolver(ua, domain)
# if any value is non-none the resolver found a match
if r.user_agent_string is not None \
or r.os is not None \
or r.device is not None:
return r
# if no resolver found a match (or nothing was requested),
# resolve to failure
return PartialResult(domain, None, None, None, ua)