YUKI-based Next-generation Async-cache
- Get nginx log for some days (7 days by default), filter out all interesting requests, collect their "popularity"
- Get local interesting files metadata
- Remove files that are not "popular", try to get new files while under the limit
- Add examples in yuki configuration
- Eliminate the need of
lstat()s in stage2 (a bit slow for nix-channels)
An example, assuming that access_log is in default combined format:
location /pypi/web/ {
rewrite ^/pypi/web/(.*)$ /pypi/$1 permanent;
}
# Don't normalize these special index files
location = /pypi/web/simple/index.html {}
location = /pypi/web/simple/index.v1_html {}
location = /pypi/web/simple/index.v1_json {}
location ~ ^/pypi/simple/[^/]*([A-Z]|_|\.)[^/]* {
# local package_name = ngx.var.uri:match("/pypi/simple/(.+)")
# if package_name and package_name ~= "index.html" then
# -- Normalize the package name per PEP 503
# local normalized = package_name:gsub("[-_.]+", "-"):lower()
# return ngx.redirect("/pypi/simple/" .. normalized, ngx.HTTP_MOVED_TEMPORARILY)
# end
rewrite_by_lua_file /etc/nginx/lua/pypi_normalize.lua;
}
location ~ ^/pypi/[^/]*([A-Z]|_|\.)[^/]*/json {
# local package_name = ngx.var.uri:match("/pypi/(.+)/json")
# if package_name then
# -- Normalize the package name per PEP 503
# local normalized = package_name:gsub("[-_.]+", "-"):lower()
# return ngx.redirect("/pypi/" .. normalized .. "/json", ngx.HTTP_MOVED_TEMPORARILY)
# end
rewrite_by_lua_file /etc/nginx/lua/pypi_normalize.lua;
}
location ~ ^/pypi/[^/]+/json$ {
access_log /var/log/nginx/cacheproxy/pypi.log;
rewrite ^/pypi/([^/]+)/json$ /pypi/json/$1 break;
types { }
default_type "application/json; charset=utf-8";
}
location ~ ^/pypi/simple {
access_log /var/log/nginx/cacheproxy/pypi.log;
# conf.d/pypi.conf:
# map $http_accept $pypi_mirror_suffix {
# default ".html";
# "~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
# "~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";
# "~*text/html" ".html";
# }
index index$pypi_mirror_suffix index.html;
types {
application/vnd.pypi.simple.v1+json v1_json;
application/vnd.pypi.simple.v1+html v1_html;
text/html html;
}
default_type "text/html";
# try_files $uri$pypi_mirror_suffix $uri $uri/ @pypi_302;
try_files $uri$pypi_mirror_suffix $uri $uri/ =404;
}
location /pypi/packages/ {
access_log /var/log/nginx/cacheproxy/pypi.log;
try_files $uri $uri/ @pypi_302;
}
location /pypi/json/ {
autoindex off;
}
location @pypi_302 {
access_log /var/log/nginx/cacheproxy/pypi.log;
# -> $scheme://mirrors.example.com/pypi/...
rewrite ^/pypi/(.*)$ $scheme://mirrors.example.com/pypi/web/$1 redirect;
}
location /anaconda/cloud/ {
access_log /var/log/nginx/cacheproxy/anaconda.log;
try_files $uri $uri/ @anaconda_302;
}
location @anaconda_302 {
access_log /var/log/nginx/cacheproxy/anaconda.log;
# -> $scheme://mirrors.example.com/anaconda/...
return 302 $scheme://mirrors.example.com$request_uri;
}
location /nix-channels/store/ {
access_log /var/log/nginx/cacheproxy/nix-channels.log;
# disable autoindex, there are TOO MANY files
autoindex off;
try_files $uri $uri/ @nixchannels_404;
}
location @nixchannels_404 {
access_log /var/log/nginx/cacheproxy/nix-channels.log;
# just return 404, nix knows how to handle it
return 404;
}
location ~ ^/flathub/(objects|deltas|delta-indexes)/ {
access_log /var/log/nginx/cacheproxy/flathub.log;
autoindex off;
try_files $uri $uri/ @flathub_302;
}
location @flathub_302 {
access_log /var/log/nginx/cacheproxy/flathub.log;
rewrite ^/flathub/(.*)$ $scheme://dl.flathub.org/repo/$1 redirect;
}
location /freebsd-pkg/ {
location ~ ^/freebsd-pkg/.+\.pkg {
access_log /var/log/nginx/cacheproxy/freebsd-pkg.log;
try_files $uri $uri/ @freebsd_pkg_302;
}
}
location @freebsd_pkg_302 {
access_log /var/log/nginx/cacheproxy/freebsd-pkg.log;
# FreeBSD pkg server only supports HTTP.
rewrite ^/freebsd-pkg/(.*)$ http://pkg.freebsd.org/$1 redirect;
}See examples.
$ cargo run -- --help
YUKI-based Next-generation Async-cache
Usage: yukina [OPTIONS] --name <NAME> --log-path <LOG_PATH> --repo-path <REPO_PATH> --size-limit <SIZE_LIMIT> --url <URL>
Options:
--name <NAME>
Repo name, used for finding log file and downloading from remote
--log-path <LOG_PATH>
Directory of nginx log
--repo-path <REPO_PATH>
Directory of repo
--dry-run
Don't really download or remove anything, just show what would be done. (HEAD requests are still sent.)
--log-duration <LOG_DURATION>
Log items to check. Access log beyond log_duration would be ignored [default: 7d]
--user-agent <USER_AGENT>
User agent to use [default: "yukina (https://github.com/taoky/yukina)"]
--size-limit <SIZE_LIMIT>
Size limit of your repo
--filter <FILTER>
Filter for urls and file paths you interested in (usually blobs of the repo). Relative to repo_path
--url <URL>
URL of the remote repo. Still need to give any URL (would not be used) when --gc-only is set
--strip-prefix <STRIP_PREFIX>
Optional prefix to strip from the path after the repo name. Access URLs must match strip_prefix if set
--remote-sizedb <REMOTE_SIZEDB>
A kv database of file size to speed up stage3 in case yukina would run frequently
--local-sizedb <LOCAL_SIZEDB>
Another kv database of file size, but for local files, to skip lstat()s
--size-database-ttl <SIZE_DATABASE_TTL>
Size database Miss TTL [default: 2d]
--filesize-limit <FILESIZE_LIMIT>
Single file size limit, files larger than this will NOT be counted/downloaded [default: 4g]
--min-vote-count <MIN_VOTE_COUNT>
Minimum vote count to consider a file as a candicate [default: 2]
--retry <RETRY>
Retry count for each request [default: 3]
--extension <EXTENSION>
Extension for specific repo types [possible values: nix-channels, freebsd-pkg]
--aggressive-removal
Aggressively remove all files not accessed during log_duration, instead of just keep it within threshold
--gc-only
Don't download anything, just remove unpopular files
--download-error-threshold <DOWNLOAD_ERROR_THRESHOLD>
Error threshold for download. If the number of download errors exceeds this threshold, yukina will exit with error code 1. Setting this to 0 will disable this early exit behavior [default: 5]
--log-format <LOG_FORMAT>
Format of the log file If not set, use combined log format (the default of nginx) [default: combined] [possible values: combined, mirror-json]
-h, --help
Print help
-V, --version
Print version"Extension" is a special option for specific repo types:
- nix-channels: This extension would parse narinfo file and add the blob urls to the download list.
- freebsd-pkg: This extension would parse packagesite.txz, check downloaded files sha256, and create hard links to .by-hash folder to keep compatibility with our sync script.
"Log format" option is used to specify the format of your nginx access log. "mirror-json" is configured like this:
set $proxied 0; # manually set to 1 inside 302/404 proxy_pass locations
log_format mirror_json escape=json '{'
'"timestamp":$msec,'
'"clientip":"$remote_addr",'
'"serverip":"$server_addr",'
'"method":"$request_method",'
'"scheme":"$scheme",'
'"url":"$request_uri",'
'"status":$status,'
'"size":$body_bytes_sent,'
'"resp_time":$request_time,'
'"http_host":"$host",'
'"referer":"$http_referer",'
'"user_agent":"$http_user_agent",'
'"request_id":"$request_id",'
'"proto":"$server_protocol",'
'"proxied":"$proxied"'
'}';kv is a very simple wrapper inspecting the sqlite db yukina uses around . KV tool:sled (same as the version yukina uses)
$ cargo run --bin kv -- --help
Usage: kv <COMMAND>
Commands:
get
remove
scan
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print helpSee stats.
"Yukina" means "YUKI-based Next-generation Async-cache"... OK, you might not find that very convincing, neither do I. Actually, this name comes from Yukina Minato, the vocalist of Roselia from BanG Dream! series.
And yuki is a mirror management tool used inhouse in USTC mirrors (another choice besides tunasync, try it if you need!). This program actually does not require yuki (yuki examples are given for your convenience), but I just want to make a pun.
I would like to give special thanks to SeanChao/mirror-cache, which is sponsored in OSPP 2021. Though this project is finally not used by us, it has "forced" me to rethink the design of a repoistory caching system, and I have learned a lot from it. If you need a more general-purpose cache, you might want to try it.