Fix CI integration test coldstart issues.#1230
Conversation
- Closes apple#1206. - Closes apple#507. - Addresses existing log messages for apple#642. - Nondeterministic CI errors are resulting from very slow launch times for the first runtime helper, which causes ContainersService to be locked for longer than our 20 sec timeout. Bumping the timeout to 60 seconds addresses this case for now. - Since many log messages needed to be changed to troubleshoot the issue, updated all log messages to use structured logging, and implemented consistent entry/exit logging for all service operations. - Added logging for ContainerService lock acquisition to help with finding root cause for the slow service startup. - Plumbed the `--debug` flag on both `container system start` and `container system logs` so that the flag is actually useful. - Updated the `install-init.sh` script so that can install in a custom app root directory.
| START_ARGS+=(--app-root "$2") | ||
| shift 2 | ||
| ;; | ||
| -l|--log-root) |
There was a problem hiding this comment.
Nothing uses this yet, but it will be used when the CI log collection PR posts.
|
|
||
| SWIFT="/usr/bin/swift" | ||
| IMAGE_NAME="vminit:latest" | ||
| DEST_DIR="${1:-$(git rev-parse --show-toplevel)/bin}" |
There was a problem hiding this comment.
This is dead code, removed it
| bin/container system stop && sleep 3 && bin/container system start && sleep 3 | ||
| bin/container system stop | ||
| sleep 3 | ||
| bin/container --debug system start "${START_ARGS[@]}" |
There was a problem hiding this comment.
This is the essential change to the script
| var args = ["log"] | ||
| args.append(self.follow ? "stream" : "show") | ||
| args.append(contentsOf: ["--info", "--debug"]) | ||
| args.append(contentsOf: ["--info", logOptions.debug ? "--debug" : nil].compactMap { $0 }) |
There was a problem hiding this comment.
Now we only pass debug into log stream for container --debug system logs.
| return .seconds(timeoutSeconds) | ||
| } | ||
| ) | ||
| var timeout: Duration = XPCClient.xpcRegistrationTimeout |
There was a problem hiding this comment.
Key point here is that we use the common timeout value now.
| env[ApplicationRoot.environmentName] = appRoot.path(percentEncoded: false) | ||
| env[InstallRoot.environmentName] = installRoot.path(percentEncoded: false) | ||
|
|
||
| let processedArgs = (args ?? ["start"]) + (debug ? ["--debug"] : []) |
There was a problem hiding this comment.
This is the main change, adds --debug to the plugin startup as necessary.
The implication is that all plugins need to implement a start subcommand that accepts a --debug flag.
| /// requests only have milliseconds of overhead, but in some instances, | ||
| /// macOS can take 5 seconds (or considerably longer) to launch a | ||
| /// service after it has been registered. | ||
| public static let xpcRegistrationTimeout: Duration = .seconds(60) |
There was a problem hiding this comment.
Our common timeout value, chosen to work with our slow service startup in CI.
| let commandName = Self.configuration.commandName ?? "container-apiserver" | ||
| let log = APIServer.setupLogger(debug: debug) | ||
| log.info("starting \(commandName)") | ||
| log.info("starting helper", metadata: ["name": "\(commandName)"]) |
There was a problem hiding this comment.
This pattern is used for startup/shutdown for all services.
| let pluginLoader = try initializePluginLoader(log: log) | ||
| try await initializePlugins(pluginLoader: pluginLoader, log: log, routes: &routes) | ||
| let containersService = try initializeContainerService( | ||
| let containersService = try initializeContainersService( |
There was a problem hiding this comment.
Just some name nitpickery for "containers service" and "networks service".
| containersService: containersService, | ||
| log: log | ||
| log: log, | ||
| debugHelpers: debug |
There was a problem hiding this comment.
Propagate debug settings from API server to any network helpers we create.
| pluginLoader: pluginLoader, | ||
| log: log | ||
| log: log, | ||
| debugHelpers: debug |
There was a problem hiding this comment.
Propagate debug settings from API server to any runtime helpers we create.
| let containerRoot = appRoot.appendingPathComponent("containers") | ||
| try FileManager.default.createDirectory(at: containerRoot, withIntermediateDirectories: true) | ||
| self.exitMonitor = ExitMonitor(log: log) | ||
| self.lock = AsyncLock(log: log) |
There was a problem hiding this comment.
Set up debug logging for lock acquisition/release
| /// List containers matching the given filters. | ||
| public func list(filters: ContainerListFilters = .all) async throws -> [ContainerSnapshot] { | ||
| self.log.debug("\(#function)") | ||
| log.debug( |
There was a problem hiding this comment.
This is the pattern we use everywhere for entry/exit on our service operations.
- All information on entry metadata.
- Only identifying information on exit metadata.
debuglevel, excepttracefor potentially busy calls likeresize.
| try await lock.withLock { context in | ||
| public func withContainerList<T: Sendable>( | ||
| logMetadata: Logger.Metadata? = nil, | ||
| _ operation: @Sendable @escaping ([ContainerSnapshot]) async throws -> T |
There was a problem hiding this comment.
Allows us to trace locks from callers in other services using withContainerList(logMetadata:) { ... }
| /// - Returns: Tuple of (total count, active count, total size, reclaimable size) | ||
| public func calculateDiskUsage() async -> (Int, Int, UInt64, UInt64) { | ||
| await lock.withLock { _ in | ||
| await lock.withLock(logMetadata: ["acquirer": "\(#function)"]) { _ in |
There was a problem hiding this comment.
Our pattern for tracking ContainersService lock usage.
| "--uuid", configuration.id, | ||
| "--debug", | ||
| ] | ||
| debug ? "--debug" : nil, |
There was a problem hiding this comment.
Propagate debug flag from API service to runtime helper.
| "--mode", | ||
| configuration.mode.rawValue, | ||
| ] | ||
| if debugHelpers { |
There was a problem hiding this comment.
Propagate debug flag from API server to network helpers.
| import Foundation | ||
| import Testing | ||
|
|
||
| @Suite(.serialized) |
There was a problem hiding this comment.
Turn these tests back on and run them in parallel again.
| CODESIGN_OPTS ?= --force --sign - --timestamp=none | ||
|
|
||
| # Conditionally use a temporary data directory for integration tests | ||
| ifeq ($(strip $(APP_ROOT)),) |
There was a problem hiding this comment.
Don't do this anymore since we'll want to add other options (--log-root) soon. Make it so we can accumulate startup options.
--debug. #507.--debugflag on bothcontainer system startandcontainer system logsso that the flag is actually useful.install-init.shscript so that can install in a custom app root directory.