Skip to content

Fix CI integration test coldstart issues.#1230

Merged
jglogan merged 4 commits intoapple:mainfrom
jglogan:slow-launchd
Feb 19, 2026
Merged

Fix CI integration test coldstart issues.#1230
jglogan merged 4 commits intoapple:mainfrom
jglogan:slow-launchd

Conversation

@jglogan
Copy link
Contributor

@jglogan jglogan commented Feb 19, 2026

- Closes apple#1206.
- Closes apple#507.
- Addresses existing log messages for apple#642.
- Nondeterministic CI errors are resulting from
  very slow launch times for the first runtime
  helper, which causes ContainersService to be
  locked for longer than our 20 sec timeout.
  Bumping the timeout to 60 seconds addresses
  this case for now.
- Since many log messages needed to be changed
  to troubleshoot the issue, updated all log
  messages to use structured logging, and
  implemented consistent entry/exit logging for
  all service operations.
- Added logging for ContainerService lock
  acquisition to help with finding root cause
  for the slow service startup.
- Plumbed the `--debug` flag on both
  `container system start` and `container system logs`
  so that the flag is actually useful.
- Updated the `install-init.sh` script so that
  can install in a custom app root directory.
@github-actions github-actions bot added the cli label Feb 19, 2026
START_ARGS+=(--app-root "$2")
shift 2
;;
-l|--log-root)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing uses this yet, but it will be used when the CI log collection PR posts.


SWIFT="/usr/bin/swift"
IMAGE_NAME="vminit:latest"
DEST_DIR="${1:-$(git rev-parse --show-toplevel)/bin}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, removed it

bin/container system stop && sleep 3 && bin/container system start && sleep 3
bin/container system stop
sleep 3
bin/container --debug system start "${START_ARGS[@]}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the essential change to the script

var args = ["log"]
args.append(self.follow ? "stream" : "show")
args.append(contentsOf: ["--info", "--debug"])
args.append(contentsOf: ["--info", logOptions.debug ? "--debug" : nil].compactMap { $0 })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we only pass debug into log stream for container --debug system logs.

return .seconds(timeoutSeconds)
}
)
var timeout: Duration = XPCClient.xpcRegistrationTimeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key point here is that we use the common timeout value now.

env[ApplicationRoot.environmentName] = appRoot.path(percentEncoded: false)
env[InstallRoot.environmentName] = installRoot.path(percentEncoded: false)

let processedArgs = (args ?? ["start"]) + (debug ? ["--debug"] : [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change, adds --debug to the plugin startup as necessary.

The implication is that all plugins need to implement a start subcommand that accepts a --debug flag.

/// requests only have milliseconds of overhead, but in some instances,
/// macOS can take 5 seconds (or considerably longer) to launch a
/// service after it has been registered.
public static let xpcRegistrationTimeout: Duration = .seconds(60)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our common timeout value, chosen to work with our slow service startup in CI.

let commandName = Self.configuration.commandName ?? "container-apiserver"
let log = APIServer.setupLogger(debug: debug)
log.info("starting \(commandName)")
log.info("starting helper", metadata: ["name": "\(commandName)"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is used for startup/shutdown for all services.

let pluginLoader = try initializePluginLoader(log: log)
try await initializePlugins(pluginLoader: pluginLoader, log: log, routes: &routes)
let containersService = try initializeContainerService(
let containersService = try initializeContainersService(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some name nitpickery for "containers service" and "networks service".

containersService: containersService,
log: log
log: log,
debugHelpers: debug
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug settings from API server to any network helpers we create.

pluginLoader: pluginLoader,
log: log
log: log,
debugHelpers: debug
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug settings from API server to any runtime helpers we create.

let containerRoot = appRoot.appendingPathComponent("containers")
try FileManager.default.createDirectory(at: containerRoot, withIntermediateDirectories: true)
self.exitMonitor = ExitMonitor(log: log)
self.lock = AsyncLock(log: log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set up debug logging for lock acquisition/release

/// List containers matching the given filters.
public func list(filters: ContainerListFilters = .all) async throws -> [ContainerSnapshot] {
self.log.debug("\(#function)")
log.debug(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern we use everywhere for entry/exit on our service operations.

  • All information on entry metadata.
  • Only identifying information on exit metadata.
  • debug level, except trace for potentially busy calls like resize.

try await lock.withLock { context in
public func withContainerList<T: Sendable>(
logMetadata: Logger.Metadata? = nil,
_ operation: @Sendable @escaping ([ContainerSnapshot]) async throws -> T
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to trace locks from callers in other services using withContainerList(logMetadata:) { ... }

/// - Returns: Tuple of (total count, active count, total size, reclaimable size)
public func calculateDiskUsage() async -> (Int, Int, UInt64, UInt64) {
await lock.withLock { _ in
await lock.withLock(logMetadata: ["acquirer": "\(#function)"]) { _ in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our pattern for tracking ContainersService lock usage.

"--uuid", configuration.id,
"--debug",
]
debug ? "--debug" : nil,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug flag from API service to runtime helper.

"--mode",
configuration.mode.rawValue,
]
if debugHelpers {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug flag from API server to network helpers.

import Foundation
import Testing

@Suite(.serialized)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn these tests back on and run them in parallel again.

CODESIGN_OPTS ?= --force --sign - --timestamp=none

# Conditionally use a temporary data directory for integration tests
ifeq ($(strip $(APP_ROOT)),)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this anymore since we'll want to add other options (--log-root) soon. Make it so we can accumulate startup options.

Copy link
Contributor

@ajemory ajemory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few change suggestions -- I think it makes sense to make error/warning level logs with multiple metadata params span multiple lines, but not necessarily tied to the change. Let me know what you think. Otherwise, the changes look great. Thanks for tackling this!

// container collection above and a container run command that attaches to a
// network listed in the networksToPrune collection.
log.error("Failed to prune network \(network.id): \(error)")
log.error("failed to prune network", metadata: ["id": "\(network.id)", "error": "\(error)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.error("failed to prune network", metadata: ["id": "\(network.id)", "error": "\(error)"])
log.error(
"failed to prune network",
metadata: [
"id": "\(network.id)",
"error": "\(error)",
])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reasonable. Let me do it in the follow-up (log collection PR) so we can start unblocking the PR backlog.

try await xpc.listen()
} catch {
log.error("\(commandName) failed", metadata: ["error": "\(error)"])
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
log.error(
"helper failed",
metadata: [
"name": "\(commandName)",
"error": "\(error)",
])

try await xpc.listen()
} catch {
log.error("\(commandName) failed", metadata: ["error": "\(error)"])
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
log.error(
"helper failed",
metadata: [
"name": "\(commandName)",
"error": "\(error)",
])

}
} catch {
log.error("\(commandName) failed", metadata: ["error": "\(error)"])
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.error("helper failed", metadata: ["name": "\(commandName)", "error": "\(error)"])
log.error(
"helper failed",
metadata: [
"name": "\(commandName)",
"error": "\(error)",
])

} catch {
try? FileManager.default.removeItem(at: dir)
log.warning("failed to load container at \(dir.path): \(error)")
log.warning("failed to load container", metadata: ["path": "\(dir.path)", "error": "\(error)"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.warning("failed to load container", metadata: ["path": "\(dir.path)", "error": "\(error)"])
log.warning(
"failed to load container",
metadata: [
"path": "\(dir.path)",
"error": "\(error)",
])

@jglogan jglogan merged commit 3ad0914 into apple:main Feb 19, 2026
3 checks passed
@jglogan jglogan deleted the slow-launchd branch February 19, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[chore]: Fix CLINetwork tests. [Bug]: Local vminit:latest install doesn't honor APP_ROOT. [Request]: Consistent, useful behavior for --debug.

2 participants