I suspect in part due to Spans having a hash dependent on the commit hash (since we remap paths to include the commit hash), we are seeing increased noise in benchmarks (particularly incr-unchanged) that is not yet being ignored by the statistical significance algorithm as it was fairly recently introduced.
I suspect we'll want to do one of these:
- Disable the verification entirely under perf.rlo (needs rustc work)
- Fully enable the verification under perf.rlo (just a flag, but comes at a performance hit)
- Just accept increased noise levels which should fairly soon be less impactful due to the last 50 commits all being after the noise started in earnest, presuming our algorithm works.