yukina

YUKI-based Next-generation Async-cache

Approach (Simplified)

Get nginx log for some days (7 days by default), filter out all interesting requests, collect their "popularity"
Get local interesting files metadata
Remove files that are not "popular", try to get new files while under the limit

TODOs

Add examples in yuki configuration
Eliminate the need of lstat()s in stage2 (a bit slow for nix-channels)

Nginx configuration

An example, assuming that access_log is in default combined format:

location /pypi/web/ {
    rewrite ^/pypi/web/(.*)$ /pypi/$1 permanent;
}

# Don't normalize these special index files
location = /pypi/web/simple/index.html {}
location = /pypi/web/simple/index.v1_html {}
location = /pypi/web/simple/index.v1_json {}

location ~ ^/pypi/simple/[^/]*([A-Z]|_|\.)[^/]* {
    # local package_name = ngx.var.uri:match("/pypi/simple/(.+)")
    # if package_name and package_name ~= "index.html" then
    #     -- Normalize the package name per PEP 503
    #     local normalized = package_name:gsub("[-_.]+", "-"):lower()
    #     return ngx.redirect("/pypi/simple/" .. normalized, ngx.HTTP_MOVED_TEMPORARILY)
    # end
    rewrite_by_lua_file /etc/nginx/lua/pypi_normalize.lua;
}

location ~ ^/pypi/[^/]*([A-Z]|_|\.)[^/]*/json {
    # local package_name = ngx.var.uri:match("/pypi/(.+)/json")
    # if package_name then
    #     -- Normalize the package name per PEP 503
    #     local normalized = package_name:gsub("[-_.]+", "-"):lower()
    #     return ngx.redirect("/pypi/" .. normalized .. "/json", ngx.HTTP_MOVED_TEMPORARILY)
    # end
    rewrite_by_lua_file /etc/nginx/lua/pypi_normalize.lua;
}

location ~ ^/pypi/[^/]+/json$ {
    access_log /var/log/nginx/cacheproxy/pypi.log;
    rewrite ^/pypi/([^/]+)/json$ /pypi/json/$1 break;
    types { }
    default_type "application/json; charset=utf-8";
}

location ~ ^/pypi/simple {
    access_log /var/log/nginx/cacheproxy/pypi.log;
    # conf.d/pypi.conf:
    # map $http_accept $pypi_mirror_suffix {
    #     default ".html";
    #     "~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
    #     "~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";
    #     "~*text/html" ".html";
    # }

    index index$pypi_mirror_suffix index.html;
    types {
        application/vnd.pypi.simple.v1+json v1_json;
        application/vnd.pypi.simple.v1+html v1_html;
        text/html html;
    }
    default_type "text/html";
    # try_files $uri$pypi_mirror_suffix $uri $uri/ @pypi_302;
    try_files $uri$pypi_mirror_suffix $uri $uri/ =404;
}

location /pypi/packages/ {
    access_log /var/log/nginx/cacheproxy/pypi.log;
    try_files $uri $uri/ @pypi_302;
}

location /pypi/json/ {
    autoindex off;
}

location @pypi_302 {
    access_log /var/log/nginx/cacheproxy/pypi.log;
    # -> $scheme://mirrors.example.com/pypi/...
    rewrite ^/pypi/(.*)$ $scheme://mirrors.example.com/pypi/web/$1 redirect;
}

location /anaconda/cloud/ {
    access_log /var/log/nginx/cacheproxy/anaconda.log;
    try_files $uri $uri/ @anaconda_302;
}

location @anaconda_302 {
    access_log /var/log/nginx/cacheproxy/anaconda.log;
    # -> $scheme://mirrors.example.com/anaconda/...
    return 302 $scheme://mirrors.example.com$request_uri;
}

location /nix-channels/store/ {
    access_log /var/log/nginx/cacheproxy/nix-channels.log;
    # disable autoindex, there are TOO MANY files
    autoindex off;
    try_files $uri $uri/ @nixchannels_404;
}

location @nixchannels_404 {
    access_log /var/log/nginx/cacheproxy/nix-channels.log;
    # just return 404, nix knows how to handle it
    return 404;
}

location ~ ^/flathub/(objects|deltas|delta-indexes)/ {
    access_log /var/log/nginx/cacheproxy/flathub.log;
    autoindex off;
    try_files $uri $uri/ @flathub_302;
}

location @flathub_302 {
    access_log /var/log/nginx/cacheproxy/flathub.log;
    rewrite ^/flathub/(.*)$ $scheme://dl.flathub.org/repo/$1 redirect;
}

location /freebsd-pkg/ {
    location ~ ^/freebsd-pkg/.+\.pkg {
        access_log /var/log/nginx/cacheproxy/freebsd-pkg.log;
        try_files $uri $uri/ @freebsd_pkg_302;
    }
}

location @freebsd_pkg_302 {
    access_log /var/log/nginx/cacheproxy/freebsd-pkg.log;
    # FreeBSD pkg server only supports HTTP.
    rewrite ^/freebsd-pkg/(.*)$ http://pkg.freebsd.org/$1 redirect;
}

Yuki configuration

See examples.

Usage

$ cargo run -- --help
YUKI-based Next-generation Async-cache

Usage: yukina [OPTIONS] --name <NAME> --log-path <LOG_PATH> --repo-path <REPO_PATH> --size-limit <SIZE_LIMIT> --url <URL>

Options:
      --name <NAME>
          Repo name, used for finding log file and downloading from remote
      --log-path <LOG_PATH>
          Directory of nginx log
      --repo-path <REPO_PATH>
          Directory of repo
      --dry-run
          Don't really download or remove anything, just show what would be done. (HEAD requests are still sent.)
      --log-duration <LOG_DURATION>
          Log items to check. Access log beyond log_duration would be ignored [default: 7d]
      --user-agent <USER_AGENT>
          User agent to use [default: "yukina (https://github.com/taoky/yukina)"]
      --size-limit <SIZE_LIMIT>
          Size limit of your repo
      --filter <FILTER>
          Filter for urls and file paths you interested in (usually blobs of the repo). Relative to repo_path
      --url <URL>
          URL of the remote repo. Still need to give any URL (would not be used) when --gc-only is set
      --strip-prefix <STRIP_PREFIX>
          Optional prefix to strip from the path after the repo name. Access URLs must match strip_prefix if set
      --remote-sizedb <REMOTE_SIZEDB>
          A kv database of file size to speed up stage3 in case yukina would run frequently
      --local-sizedb <LOCAL_SIZEDB>
          Another kv database of file size, but for local files, to skip lstat()s
      --size-database-ttl <SIZE_DATABASE_TTL>
          Size database Miss TTL [default: 2d]
      --filesize-limit <FILESIZE_LIMIT>
          Single file size limit, files larger than this will NOT be counted/downloaded [default: 4g]
      --min-vote-count <MIN_VOTE_COUNT>
          Minimum vote count to consider a file as a candicate [default: 2]
      --retry <RETRY>
          Retry count for each request [default: 3]
      --extension <EXTENSION>
          Extension for specific repo types [possible values: nix-channels, freebsd-pkg]
      --aggressive-removal
          Aggressively remove all files not accessed during log_duration, instead of just keep it within threshold
      --gc-only
          Don't download anything, just remove unpopular files
      --download-error-threshold <DOWNLOAD_ERROR_THRESHOLD>
          Error threshold for download. If the number of download errors exceeds this threshold, yukina will exit with error code 1. Setting this to 0 will disable this early exit behavior [default: 5]
      --log-format <LOG_FORMAT>
          Format of the log file If not set, use combined log format (the default of nginx) [default: combined] [possible values: combined, mirror-json]
  -h, --help
          Print help
  -V, --version
          Print version

"Extension" is a special option for specific repo types:

nix-channels: This extension would parse narinfo file and add the blob urls to the download list.
freebsd-pkg: This extension would parse packagesite.txz, check downloaded files sha256, and create hard links to .by-hash folder to keep compatibility with our sync script.

"Log format" option is used to specify the format of your nginx access log. "mirror-json" is configured like this:

set $proxied 0;  # manually set to 1 inside 302/404 proxy_pass locations
log_format mirror_json escape=json '{'
    '"timestamp":$msec,'
    '"clientip":"$remote_addr",'
    '"serverip":"$server_addr",'
    '"method":"$request_method",'
    '"scheme":"$scheme",'
    '"url":"$request_uri",'
    '"status":$status,'
    '"size":$body_bytes_sent,'
    '"resp_time":$request_time,'
    '"http_host":"$host",'
    '"referer":"$http_referer",'
    '"user_agent":"$http_user_agent",'
    '"request_id":"$request_id",'
    '"proto":"$server_protocol",'
    '"proxied":"$proxied"'
    '}';

kv is a very simple wrapper inspecting the sqlite db yukina uses ~~around sled (same as the version yukina uses)~~. KV tool:

$ cargo run --bin kv -- --help
Usage: kv <COMMAND>

Commands:
  get     
  remove  
  scan    
  help    Print this message or the help of the given subcommand(s)

Options:
  -h, --help  Print help

Does it Work Well in Practice?

See stats.

Naming

"Yukina" means "YUKI-based Next-generation Async-cache"... OK, you might not find that very convincing, neither do I. Actually, this name comes from Yukina Minato, the vocalist of Roselia from BanG Dream! series.

And yuki is a mirror management tool used inhouse in USTC mirrors (another choice besides tunasync, try it if you need!). This program actually does not require yuki (yuki examples are given for your convenience), but I just want to make a pun.

Acknowledgements

I would like to give special thanks to SeanChao/mirror-cache, which is sponsored in OSPP 2021. Though this project is finally not used by us, it has "forced" me to rethink the design of a repoistory caching system, and I have learned a lot from it. If you need a more general-purpose cache, you might want to try it.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
examples		examples
src		src
stats		stats
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

yukina

Approach (Simplified)

TODOs

Nginx configuration

Yuki configuration

Usage

Does it Work Well in Practice?

Naming

Acknowledgements

About

Uh oh!

Releases 28

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

taoky/yukina

Folders and files

Latest commit

History

Repository files navigation

yukina

Approach (Simplified)

TODOs

Nginx configuration

Yuki configuration

Usage

Does it Work Well in Practice?

Naming

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 28

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages