OpenTelemetry minor improvements#32324
Conversation
…gzip compression and span customization.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #32324 +/- ##
==========================================
- Coverage 64.03% 64.02% -0.02%
==========================================
Files 1987 1987
Lines 194542 194650 +108
Branches 6481 6481
==========================================
+ Hits 124572 124618 +46
- Misses 60254 60311 +57
- Partials 9716 9721 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…se `FleetConfig` for configuration instead of `OsqueryConfig`.
…c tasks and add OpenTelemetry tracing to `FlushHostsLastSeen`. Adjust batch span processor for OTEL exporter with reduced batch size.
|
@ksykulev assigning to you for review since you've been involved in this effort. I will file a separate issue for instrumenting the missing API endpoints. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughAdds OpenTelemetry tracing to async tasks (labels, policies, host last seen, scheduled query stats), updates HTTP middleware span naming, enables gzip compression for OTLP gRPC, adjusts exporter batch size, and changes async.NewTask to accept a pointer to FleetConfig. Introduces FleetConfig.OTELEnabled(). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Cron as Async Scheduler
participant Async as Async Task
participant DS as Datastore
participant OTEL as OpenTelemetry
rect rgba(230,240,255,0.5)
note over Async: New/changed
Cron->>Async: Run collect_* task
alt OTEL enabled
Async->>OTEL: Start span "async.collect_*" (attr: async.task=…)
Async->>DS: Read pending work
DS-->>Async: Results
Async->>DS: Write updates
DS-->>Async: Ack
Async->>OTEL: End span
else OTEL disabled
Async->>DS: Read/write without span
end
end
sequenceDiagram
autonumber
participant Client as HTTP Client
participant Mux as Router (otel mw)
participant Handler as API Handler
participant OTEL as OpenTelemetry
Client->>Mux: HTTP request
note over Mux: New span name "METHOD ROUTE"
Mux->>OTEL: Start server span
Mux->>Handler: ServeHTTP(ctx with span)
Handler-->>Mux: Response
Mux->>OTEL: End server span
Mux-->>Client: HTTP response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (17)
server/config/config.go (1)
633-636: Add pointer receiver and normalize tracing type checkThis helper looks good, but two small refinements will make it more robust and cheaper to call:
- Avoid copying the large FleetConfig by switching to a pointer receiver.
- Normalize the tracing type to prevent case-sensitivity pitfalls and centralize the literal.
Apply:
-func (f FleetConfig) OTELEnabled() bool { - return f.Logging.TracingEnabled && f.Logging.TracingType == "opentelemetry" -} +func (f *FleetConfig) OTELEnabled() bool { + return f.Logging.TracingEnabled && strings.EqualFold(f.Logging.TracingType, TracingTypeOpenTelemetry) +}Outside this hunk (for clarity), define the constant near the LoggingConfig type:
const TracingTypeOpenTelemetry = "opentelemetry"This also enables consistent reuse across packages.
server/service/handler.go (3)
118-130: Use FleetConfig.OTELEnabled() for consistency and reduce duplicationYou already introduced FleetConfig.OTELEnabled(). Use it here to avoid repeating the enablement logic and string comparisons.
Apply:
- if config.Logging.TracingEnabled { - if config.Logging.TracingType == "opentelemetry" { + if config.OTELEnabled() { r.Use(otmiddleware.Middleware( - "service", + "service", otmiddleware.WithSpanNameFormatter(func(route string, r *http.Request) string { // Use the guideline for span names: {method} {target} // See https://opentelemetry.io/docs/specs/semconv/http/http-spans/ - return r.Method + " " + route + if route == "" { + route = r.URL.Path // fallback if mux route template isn't available + } + return r.Method + " " + route }))) - } else { + } else { apmgorilla.Instrument(r) - } }
120-126: Fallback to URL path when route template is emptySpanNameFormatter receives an empty route for unmatched or unnamed routes. Without a fallback, spans could end up with names like "GET " which hampers analysis.
If you prefer a minimal fix without changing the outer conditions, keep just the inner formatter change:
- return r.Method + " " + route + if route == "" { + route = r.URL.Path + } + return r.Method + " " + route
120-126: Consider a more descriptive instrumentation scope nameThe first argument to otelmux.Middleware is the instrumentation scope name, not service.name. "service" is quite generic. Consider a stable, package-like name (e.g., "github.com/fleetdm/fleet/server/service") to make analytics clearer across libraries.
Example:
- r.Use(otmiddleware.Middleware("service", ...)) + r.Use(otmiddleware.Middleware("github.com/fleetdm/fleet/server/service", ...))server/service/testing_utils.go (1)
97-97: Optionally pass the test FleetConfig to exercise OTEL gating in testsIf you intend to validate OTEL-gated behavior in these tests, consider passing &fleetConfig (or a purpose-built cfg) instead of nil so tasks inherit the correct OTel enablement. Keeping nil is OK if you want to consistently disable spans in tests.
Example:
- task := async.NewTask(ds, nil, c, nil) + task := async.NewTask(ds, nil, c, &fleetConfig)server/service/async/async_policy.go (2)
16-19: Tracer scope naming and reuse (nit)Using otel.Tracer("async") works, but consider defining a package-level tracer with the module path for clearer instrumentation scope and to avoid repeated Tracer lookups.
Add once in this package (or a shared async package file):
var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")Then use otelTracer.Start(...) in collectors.
127-137: Add useful span attributes; keep error-free path low overheadGood, minimal OTEL root span. You can cheaply enrich it with the collector stats so traces are more actionable without logs.
Apply this diff to include key counters as span attributes when the span ends:
- defer span.End() + defer func() { + // Attach final counters to aid debugging and SLOs + span.SetAttributes( + attribute.Int("async.stats.keys", stats.Keys), + attribute.Int("async.stats.items", stats.Items), + attribute.Int("async.stats.inserts", stats.Inserts), + attribute.Int("async.stats.updates", stats.Updates), + attribute.Int("async.stats.deletes", stats.Deletes), + attribute.Int("async.stats.redis_cmds", stats.RedisCmds), + ) + span.End() + }()If you want to capture failures as well, consider switching to a named return
errand recording it in that same defer withspan.RecordError(err)andspan.SetStatus(codes.Error, ...)(requires importinggo.opentelemetry.io/otel/codes). This is optional and can be done later.server/service/async/async_label.go (2)
13-16: Tracer scope naming and reuse (nit)Same note as in async_policy.go: prefer a package-level tracer with the package path as the scope.
var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")Use otelTracer.Start(...) below.
93-105: Expose collector metrics on the spanNice gated root span. Adding final counters as attributes helps triage hot spots in traces without reading logs.
Apply this diff to attach counters:
- defer span.End() + defer func() { + span.SetAttributes( + attribute.Int("async.stats.keys", stats.Keys), + attribute.Int("async.stats.items", stats.Items), + attribute.Int("async.stats.inserts", stats.Inserts), + attribute.Int("async.stats.updates", stats.Updates), + attribute.Int("async.stats.deletes", stats.Deletes), + attribute.Int("async.stats.redis_cmds", stats.RedisCmds), + ) + span.End() + }()Optional next step (future PR): record errors via a named return and
span.RecordError.server/service/async/async_scheduled_query_stats.go (2)
15-18: Tracer scope naming and reuse (nit)Same suggestion: define and reuse a package-level tracer with the package import path as the scope.
var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")
97-107: Attach batch/save counters to the spanThe span wrapper looks good. Adding attributes for processed keys/items and write counts will make traces self-describing when diagnosing exporter backpressure or Redis sizing.
Apply this diff:
- defer span.End() + defer func() { + span.SetAttributes( + attribute.Int("async.stats.keys", stats.Keys), + attribute.Int("async.stats.items", stats.Items), + attribute.Int("async.stats.inserts", stats.Inserts), + attribute.Int("async.stats.updates", stats.Updates), + attribute.Int("async.stats.redis_cmds", stats.RedisCmds), + ) + span.End() + }()As in the other collectors, consider a named return to record errors later.
server/service/async/async_host_seen.go (2)
71-83: Record errors and set span status in the synchronous flush path.If MarkHostsSeen fails, the span currently won’t capture the error. Recommend recording errors and setting status to improve observability.
Apply this diff within FlushHostsLastSeen to capture errors:
- hostIDs := t.seenHostSet.getAndClearHostIDs() - return t.datastore.MarkHostsSeen(ctx, hostIDs, now) + hostIDs := t.seenHostSet.getAndClearHostIDs() + if err := t.datastore.MarkHostsSeen(ctx, hostIDs, now); err != nil { + if t.otelEnabled { + span.RecordError(err) + } + return err + } + return nil
94-105: Capture useful attributes and errors in the async collector span.Consider annotating the span with batch sizes and item counts, and recording errors from downstream calls (ds.MarkHostsSeen, Redis ops). This makes async task health much easier to diagnose.
Example (illustrative; place where values are available):
// After loading hostIDs span.SetAttributes( attribute.Int("async.items_total", len(hostIDs)), attribute.Int("async.insert_batch", int(cfg.InsertBatch)), ) // On error paths: span.RecordError(err)cmd/fleet/serve.go (1)
157-165: Graceful tracer shutdown and resource attribution.Two improvements:
- Ensure tracerProvider.Shutdown is called on process shutdown to flush spans.
- Attach a Resource (service.name, version, environment) so traces are attributed correctly across backends.
Example adjustments (conceptual):
res, _ := resource.New(ctx, resource.WithFromEnv(), resource.WithTelemetrySDK(), resource.WithHost(), resource.WithAttributes( semconv.ServiceNameKey.String("fleet"), semconv.ServiceVersionKey.String(version.Version().Version), ), ) tracerProvider := sdktrace.NewTracerProvider( sdktrace.WithResource(res), sdktrace.WithSpanProcessor(batchSpanProcessor), ) otel.SetTracerProvider(tracerProvider) // In the shutdown goroutine (before srv.Shutdown): _ = tracerProvider.Shutdown(ctx)Optionally, only initialize OTEL when config.OTELEnabled() is true to avoid initializing two tracing systems when TracingType != "opentelemetry".
server/service/async/async.go (2)
26-26: OTEL enablement captured at construction time — acceptable trade-off; consider dynamic read if hot-reloads are expectedStoring a snapshot boolean keeps per-collector decisions cheap. If Fleet supports live OTEL toggling without process restart, instrumentation gates won’t reflect changes until NewTask is reconstructed.
If hot-reloads are a requirement, store a pointer to FleetConfig (or a function) and read OTELEnabled() at use sites instead of copying the value. Otherwise, current approach is fine.
Also applies to: 50-50
30-38: Clarify nil FleetConfig disables async collectors & OpenTelemetry in NewTask docstringVerification confirms:
- A zero-value OsqueryConfig yields Enabled=false in AsyncConfigForTask, so no async collectors start when fleetConfig is nil.
- All call sites to NewTask (both qualified and unqualified) accept *config.FleetConfig (including nil) and remain correct.
- There are no direct Task composite literals elsewhere that would break with this API change.
Apply this docstring tweak for clarity:
-// NewTask configures and returns a Task. +// NewTask configures and returns a Task. // If fleetConfig is nil, async collectors default to disabled (via zero-value OsqueryConfig) // and OpenTelemetry tracing is disabled. func NewTask(ds fleet.Datastore, pool fleet.RedisPool, clck clock.Clock, fleetConfig *config.FleetConfig) *Task {server/service/async/async_scheduled_query_stats_test.go (1)
220-227: Minor consistency: include Update/Delete batch sizes for parityFor parity with earlier setup, consider specifying AsyncHostUpdateBatch and AsyncHostDeleteBatch too. It won’t change behavior here but keeps test intent uniform across cases.
Apply this small addition:
task := NewTask(ds, pool, clock.C, &config.FleetConfig{ Osquery: config.OsqueryConfig{ EnableAsyncHostProcessing: "true", AsyncHostInsertBatch: 3, + AsyncHostUpdateBatch: 3, + AsyncHostDeleteBatch: 3, AsyncHostRedisPopCount: 3, AsyncHostRedisScanKeysCount: 10, }, })
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (17)
changes/32313-otel-improvements(1 hunks)cmd/fleet/serve.go(2 hunks)server/config/config.go(1 hunks)server/service/async/async.go(1 hunks)server/service/async/async_host_seen.go(3 hunks)server/service/async/async_host_seen_test.go(3 hunks)server/service/async/async_label.go(2 hunks)server/service/async/async_label_test.go(4 hunks)server/service/async/async_policy.go(2 hunks)server/service/async/async_policy_test.go(6 hunks)server/service/async/async_scheduled_query_stats.go(2 hunks)server/service/async/async_scheduled_query_stats_test.go(3 hunks)server/service/handler.go(1 hunks)server/service/integration_core_test.go(1 hunks)server/service/osquery_test.go(1 hunks)server/service/osquery_utils/queries_test.go(1 hunks)server/service/testing_utils.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
⚙️ CodeRabbit configuration file
When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.
Files:
server/service/integration_core_test.goserver/service/async/async_label.goserver/config/config.goserver/service/async/async_policy.goserver/service/async/async_scheduled_query_stats.goserver/service/osquery_utils/queries_test.goserver/service/testing_utils.goserver/service/handler.goserver/service/async/async_host_seen.goserver/service/osquery_test.goserver/service/async/async_host_seen_test.goserver/service/async/async.goserver/service/async/async_scheduled_query_stats_test.goserver/service/async/async_label_test.goserver/service/async/async_policy_test.gocmd/fleet/serve.go
🔇 Additional comments (22)
server/service/testing_utils.go (1)
97-97: Nil-safety confirmed for async.NewTaskThe implementation in
server/service/async/async.godefaults bothosqueryConfandotelEnabledto zero values, then only reads fromfleetConfiginside anif fleetConfig != nil { … }guard (lines 34–38). No nil-dereference is possible when passingnil, so no change is needed.changes/32313-otel-improvements (1)
1-1: Changelog entry reads clearly and matches PR scopeConcise, accurate summary of the OTEL additions, span naming change, compression, and batch sizing. Good to land as-is.
server/service/osquery_test.go (1)
461-461: I've requested theNewTasksignature to confirm expected parameters. Once we have its definition, we can verify that all call sites (intesting_utils.go,osquery_test.go,integration_core_test.go, andcmd/fleet/serve.go) match the updated signature (accepting*config.FleetConfigas the last argument).Please provide the output of the above script so I can ensure every call site is correct and then update the review comment accordingly.
server/service/async/async_host_seen.go (1)
14-16: LGTM on OTEL imports.Imports are minimal and correct for the added instrumentation.
cmd/fleet/serve.go (1)
407-407: *NewTask now takes config.FleetConfig — looks good.Passing &config aligns with the updated API and enables Task to derive both Osquery and OTEL enablement.
server/service/osquery_utils/queries_test.go (1)
108-111: *Test adjustments match NewTask(config.FleetConfig).Wrapper with FleetConfig{Osquery: …} is correct and backward-compatible for this test.
server/service/async/async_host_seen_test.go (3)
147-155: *Test updates align with NewTask(config.FleetConfig).Batch-related fields under Osquery are preserved; this keeps collector behavior identical under the new API.
219-225: Async path test updated correctly.EnableAsyncHostProcessing and Redis scan/pop config are passed via FleetConfig as expected.
182-182: Nil‐safety Verified in NewTaskThe
NewTaskimplementation inserver/service/async/async.goalready guards against a nilfleetConfig:
- Default values are set up before the conditional (lines 33–34).
- The code checks
if fleetConfig != nilbefore accessingfleetConfig.Osqueryor callingfleetConfig.OTELEnabled()(lines 35–38).Since these nil checks prevent any dereference when
fleetConfigis nil, no further changes are needed here.server/service/async/async_policy_test.go (6)
232-240: Collector test config migrated correctly to FleetConfig.Batch sizing and Redis parameters preserved.
282-290: Follow-up collector test config migration looks good.
362-371: Async-policy test config migration is correct.
474-483: Async “no policies” test updated correctly.Config moves are consistent with the new API.
319-319: NewTask correctly handles nil FleetConfig — no action needed.NewTask in server/service/async/async.go (lines 35–38) explicitly checks
fleetConfig != nilbefore accessing any of its fields, leavingosqueryConfat its zero-value when nil. Passingnilfor FleetConfig is therefore safe and will not cause a panic.
433-433: Nil FleetConfig is handled safely in NewTask—no changes required.NewTask explicitly checks for a nil fleetConfig before touching any of its fields, initializing a zero-value OsqueryConfig and disabling async processing by default. The “no policies sync” test’s use of NewTask(ds, pool, clock.C, nil) is therefore valid and safe.
server/service/async/async.go (1)
40-44: Task configs derived from zero-value OsqueryConfig when fleetConfig is nil — confirm disabled-by-default behaviorPopulating taskCfgs using osqueryConf.AsyncConfigForTask is clear. This relies on AsyncConfigForTask returning cfg.Enabled=false for a zero-value OsqueryConfig. If that assumption ever changes, collectors could start unintentionally when fleetConfig is nil.
Please confirm AsyncConfigForTask returns disabled configs for a zero-value OsqueryConfig with the script in my previous comment (it prints the implementation context). If it doesn’t, consider guarding with an early return when fleetConfig == nil:
- taskCfgs[config.AsyncTaskLabelMembership] = osqueryConf.AsyncConfigForTask(config.AsyncTaskLabelMembership) + if fleetConfig == nil { + // Nothing enabled when no fleetConfig provided. + return &Task{datastore: ds, pool: pool, clock: clck, taskConfigs: taskCfgs, otelEnabled: false} + } + taskCfgs[config.AsyncTaskLabelMembership] = osqueryConf.AsyncConfigForTask(config.AsyncTaskLabelMembership)server/service/async/async_scheduled_query_stats_test.go (2)
147-156: LGTM: Updated NewTask call to pass FleetConfig wrapperThe test now mirrors the production API change and keeps the async host processing knobs explicit. Using "true" (string) aligns with OsqueryConfig’s field type in this codebase.
180-180: LGTM: Sync path exercised with nil FleetConfigGood coverage of the nil-config path to ensure immediate persistence (no Redis fanout) when async is not enabled.
server/service/async/async_label_test.go (4)
208-216: LGTM: Test wiring updated to FleetConfig; leaves behavior unchangedThe wrapper preserves previous async behavior while aligning with the new NewTask signature.
257-265: LGTM: Second collector run uses same FleetConfig wrapperConsistent with the earlier case; good to validate update semantics independently.
294-294: LGTM: Sync path with nil FleetConfigCovers immediate MySQL path and ensures no Redis side-effects; assertions below confirm expected emptiness of Redis keys.
337-346: LGTM: Async path config carried via FleetConfigThe async path is correctly configured and validated via Redis assertions.
ksykulev
left a comment
There was a problem hiding this comment.
Good stuff. 2 non-blocking comments/questions.
| if t.otelEnabled { | ||
| tracer := otel.Tracer("async") | ||
| var span trace.Span | ||
| ctx, span = tracer.Start(ctx, "async.flush_hosts_last_seen", |
There was a problem hiding this comment.
Will the returned/wrapped ctx capture all the code that uses that ctx? Or do we have to manually add spans every place we want to capture data?
There was a problem hiding this comment.
Yes, the idea here is that this is a root span, and the SQL spans created by the library github.com/XSAM/otelsql will automatically be child spans of this span.
Generally, yes, we have add spans where we want them. This is made easier by libraries, middleware, or reusable methods. In this case, I considered factoring this out into its own method/wrapper, but didn't see that much benefit since there aren't that many async methods to instrument.
| otmiddleware.WithSpanNameFormatter(func(route string, r *http.Request) string { | ||
| // Use the guideline for span names: {method} {target} | ||
| // See https://opentelemetry.io/docs/specs/semconv/http/http-spans/ | ||
| return r.Method + " " + route |
There was a problem hiding this comment.
Think maybe fleet version might be valuable to add here? Or maybe we should add those to the spans as an attribute?
There was a problem hiding this comment.
Fleet version, like 4.72.0? Hmm, that would be a lot of traces with the same fleet version. Yes, any extra data can be added as attributes.
Fixes #32313
OpenTelemetry Tracing
Optimizations
NOTE: I tried to improve OTEL instrumentation for cron jobs, but it got too complicated due to goroutines in
schedule.goso that effort should be separate. We do have SQL instrumentation for cron jobs, but we are missing root spans for cron jobs as a whole.Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.Testing
Summary by CodeRabbit
New Features
Bug Fixes