Skip to content

OpenTelemetry minor improvements#32324

Merged
getvictor merged 6 commits into
mainfrom
victor/32313-otel-issues
Aug 29, 2025
Merged

OpenTelemetry minor improvements#32324
getvictor merged 6 commits into
mainfrom
victor/32313-otel-issues

Conversation

@getvictor
Copy link
Copy Markdown
Member

@getvictor getvictor commented Aug 26, 2025

Fixes #32313

OpenTelemetry Tracing

  • Added tracing to async task collectors: FlushHostsLastSeen, collectHostsLastSeen, collectLabelQueryExecutions, collectPolicyQueryExecutions, collectScheduledQueryStats
  • Updated HTTP middleware to use OTEL semantic convention for span names ({method} {route})
  • Added OTELEnabled() helper to FleetConfig

Optimizations

  • Reduced OTEL batch size from 512 to 256 spans to prevent gRPC message size errors
  • Enabled gzip compression for trace exports

NOTE: I tried to improve OTEL instrumentation for cron jobs, but it got too complicated due to goroutines in schedule.go so that effort should be separate. We do have SQL instrumentation for cron jobs, but we are missing root spans for cron jobs as a whole.

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.

Testing

  • QA'd all new/changed functionality manually

Summary by CodeRabbit

  • New Features

    • Expanded OpenTelemetry tracing for async tasks (host last seen, label membership, policy membership, scheduled query stats) to provide richer observability.
    • More descriptive HTTP span names using “METHOD /route” for clearer trace analysis.
  • Bug Fixes

    • Improved OTLP gRPC exporter reliability by enabling gzip compression and reducing export batch size, mitigating intermittent gRPC errors.

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 26, 2025

Codecov Report

❌ Patch coverage is 25.00000% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.02%. Comparing base (99705cc) to head (f5f8add).
⚠️ Report is 55 commits behind head on main.

Files with missing lines Patch % Lines
server/service/async/async_host_seen.go 8.69% 19 Missing and 2 partials ⚠️
server/service/async/async_label.go 9.09% 9 Missing and 1 partial ⚠️
server/service/async/async_policy.go 9.09% 9 Missing and 1 partial ⚠️
...erver/service/async/async_scheduled_query_stats.go 9.09% 9 Missing and 1 partial ⚠️
cmd/fleet/serve.go 0.00% 8 Missing ⚠️
server/service/handler.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #32324      +/-   ##
==========================================
- Coverage   64.03%   64.02%   -0.02%     
==========================================
  Files        1987     1987              
  Lines      194542   194650     +108     
  Branches     6481     6481              
==========================================
+ Hits       124572   124618      +46     
- Misses      60254    60311      +57     
- Partials     9716     9721       +5     
Flag Coverage Δ
backend 65.28% <25.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…se `FleetConfig` for configuration instead of `OsqueryConfig`.
…c tasks and add OpenTelemetry tracing to `FlushHostsLastSeen`. Adjust batch span processor for OTEL exporter with reduced batch size.
@getvictor getvictor changed the title Enable OpenTelemetry tracing for cron jobs and HTTP requests, adding … OpenTelemetry minor improvements Aug 26, 2025
@getvictor getvictor marked this pull request as ready for review August 26, 2025 21:05
@getvictor getvictor requested a review from a team as a code owner August 26, 2025 21:05
@getvictor
Copy link
Copy Markdown
Member Author

@ksykulev assigning to you for review since you've been involved in this effort. I will file a separate issue for instrumenting the missing API endpoints.

@getvictor
Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 26, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 26, 2025

Walkthrough

Adds OpenTelemetry tracing to async tasks (labels, policies, host last seen, scheduled query stats), updates HTTP middleware span naming, enables gzip compression for OTLP gRPC, adjusts exporter batch size, and changes async.NewTask to accept a pointer to FleetConfig. Introduces FleetConfig.OTELEnabled().

Changes

Cohort / File(s) Summary
OTLP exporter tweaks
cmd/fleet/serve.go
Enable gzip compression for OTLP gRPC client; configure BatchSpanProcessor with MaxExportBatchSize=256.
Async task constructor signature
server/service/async/async.go, server/service/testing_utils.go, server/service/integration_core_test.go, server/service/osquery_test.go, server/service/osquery_utils/queries_test.go, server/service/async/async_*_test.go
Change NewTask last arg from config.OsqueryConfig to *config.FleetConfig (nil allowed); propagate to tests.
Config helper
server/config/config.go
Add FleetConfig.OTELEnabled() bool.
Async tracing instrumentation
server/service/async/async_host_seen.go, .../async_label.go, .../async_policy.go, .../async_scheduled_query_stats.go
Add OTEL spans to collect/flush functions; gated by Task.otelEnabled (derived from FleetConfig).
HTTP tracing middleware
server/service/handler.go
Switch to otel mux middleware with custom span name formatter: “METHOD ROUTE”; service name “service”.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Cron as Async Scheduler
  participant Async as Async Task
  participant DS as Datastore
  participant OTEL as OpenTelemetry

  rect rgba(230,240,255,0.5)
  note over Async: New/changed
  Cron->>Async: Run collect_* task
  alt OTEL enabled
    Async->>OTEL: Start span "async.collect_*" (attr: async.task=…)
    Async->>DS: Read pending work
    DS-->>Async: Results
    Async->>DS: Write updates
    DS-->>Async: Ack
    Async->>OTEL: End span
  else OTEL disabled
    Async->>DS: Read/write without span
  end
  end
Loading
sequenceDiagram
  autonumber
  participant Client as HTTP Client
  participant Mux as Router (otel mw)
  participant Handler as API Handler
  participant OTEL as OpenTelemetry

  Client->>Mux: HTTP request
  note over Mux: New span name "METHOD ROUTE"
  Mux->>OTEL: Start server span
  Mux->>Handler: ServeHTTP(ctx with span)
  Handler-->>Mux: Response
  Mux->>OTEL: End server span
  Mux-->>Client: HTTP response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Assessment against linked issues

Objective Addressed Explanation
Identify/investigate current OTEL shortcomings and document findings (#32313) Code adds instrumentation/config but evidence of investigation or documentation is not visible in code changes.

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Enable gzip compression and adjust OTLP batch size (cmd/fleet/serve.go: multiple lines) Performance/configuration change; not part of “identify shortcomings” objective.
Add FleetConfig.OTELEnabled() (server/config/config.go: multiple lines) New helper method; not tied to identification of issues.
Add OTEL spans to async tasks (server/service/async/async_host_seen.go, async_label.go, async_policy.go, async_scheduled_query_stats.go: multiple lines) Implements tracing rather than identifying shortcomings.
Change async.NewTask API to accept *FleetConfig (server/service/async/async.go: multiple lines) API refactor to pass broader config; outside scope of identification.
Change HTTP middleware span naming (server/service/handler.go: multiple lines) Instrumentation improvement; not an identification activity.

Suggested reviewers

  • sgress454

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch victor/32313-otel-issues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (17)
server/config/config.go (1)

633-636: Add pointer receiver and normalize tracing type check

This helper looks good, but two small refinements will make it more robust and cheaper to call:

  • Avoid copying the large FleetConfig by switching to a pointer receiver.
  • Normalize the tracing type to prevent case-sensitivity pitfalls and centralize the literal.

Apply:

-func (f FleetConfig) OTELEnabled() bool {
-	return f.Logging.TracingEnabled && f.Logging.TracingType == "opentelemetry"
-}
+func (f *FleetConfig) OTELEnabled() bool {
+	return f.Logging.TracingEnabled && strings.EqualFold(f.Logging.TracingType, TracingTypeOpenTelemetry)
+}

Outside this hunk (for clarity), define the constant near the LoggingConfig type:

const TracingTypeOpenTelemetry = "opentelemetry"

This also enables consistent reuse across packages.

server/service/handler.go (3)

118-130: Use FleetConfig.OTELEnabled() for consistency and reduce duplication

You already introduced FleetConfig.OTELEnabled(). Use it here to avoid repeating the enablement logic and string comparisons.

Apply:

-	if config.Logging.TracingEnabled {
-		if config.Logging.TracingType == "opentelemetry" {
+	if config.OTELEnabled() {
 			r.Use(otmiddleware.Middleware(
-				"service",
+				"service",
 				otmiddleware.WithSpanNameFormatter(func(route string, r *http.Request) string {
 					// Use the guideline for span names: {method} {target}
 					// See https://opentelemetry.io/docs/specs/semconv/http/http-spans/
-					return r.Method + " " + route
+					if route == "" {
+						route = r.URL.Path // fallback if mux route template isn't available
+					}
+					return r.Method + " " + route
 				})))
-		} else {
+	} else {
 			apmgorilla.Instrument(r)
-		}
 	}

120-126: Fallback to URL path when route template is empty

SpanNameFormatter receives an empty route for unmatched or unnamed routes. Without a fallback, spans could end up with names like "GET " which hampers analysis.

If you prefer a minimal fix without changing the outer conditions, keep just the inner formatter change:

-					return r.Method + " " + route
+					if route == "" {
+						route = r.URL.Path
+					}
+					return r.Method + " " + route

120-126: Consider a more descriptive instrumentation scope name

The first argument to otelmux.Middleware is the instrumentation scope name, not service.name. "service" is quite generic. Consider a stable, package-like name (e.g., "github.com/fleetdm/fleet/server/service") to make analytics clearer across libraries.

Example:

- r.Use(otmiddleware.Middleware("service", ...))
+ r.Use(otmiddleware.Middleware("github.com/fleetdm/fleet/server/service", ...))
server/service/testing_utils.go (1)

97-97: Optionally pass the test FleetConfig to exercise OTEL gating in tests

If you intend to validate OTEL-gated behavior in these tests, consider passing &fleetConfig (or a purpose-built cfg) instead of nil so tasks inherit the correct OTel enablement. Keeping nil is OK if you want to consistently disable spans in tests.

Example:

- task := async.NewTask(ds, nil, c, nil)
+ task := async.NewTask(ds, nil, c, &fleetConfig)
server/service/async/async_policy.go (2)

16-19: Tracer scope naming and reuse (nit)

Using otel.Tracer("async") works, but consider defining a package-level tracer with the module path for clearer instrumentation scope and to avoid repeated Tracer lookups.

Add once in this package (or a shared async package file):

var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")

Then use otelTracer.Start(...) in collectors.


127-137: Add useful span attributes; keep error-free path low overhead

Good, minimal OTEL root span. You can cheaply enrich it with the collector stats so traces are more actionable without logs.

Apply this diff to include key counters as span attributes when the span ends:

-    defer span.End()
+    defer func() {
+      // Attach final counters to aid debugging and SLOs
+      span.SetAttributes(
+        attribute.Int("async.stats.keys",       stats.Keys),
+        attribute.Int("async.stats.items",      stats.Items),
+        attribute.Int("async.stats.inserts",    stats.Inserts),
+        attribute.Int("async.stats.updates",    stats.Updates),
+        attribute.Int("async.stats.deletes",    stats.Deletes),
+        attribute.Int("async.stats.redis_cmds", stats.RedisCmds),
+      )
+      span.End()
+    }()

If you want to capture failures as well, consider switching to a named return err and recording it in that same defer with span.RecordError(err) and span.SetStatus(codes.Error, ...) (requires importing go.opentelemetry.io/otel/codes). This is optional and can be done later.

server/service/async/async_label.go (2)

13-16: Tracer scope naming and reuse (nit)

Same note as in async_policy.go: prefer a package-level tracer with the package path as the scope.

var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")

Use otelTracer.Start(...) below.


93-105: Expose collector metrics on the span

Nice gated root span. Adding final counters as attributes helps triage hot spots in traces without reading logs.

Apply this diff to attach counters:

-    defer span.End()
+    defer func() {
+      span.SetAttributes(
+        attribute.Int("async.stats.keys",       stats.Keys),
+        attribute.Int("async.stats.items",      stats.Items),
+        attribute.Int("async.stats.inserts",    stats.Inserts),
+        attribute.Int("async.stats.updates",    stats.Updates),
+        attribute.Int("async.stats.deletes",    stats.Deletes),
+        attribute.Int("async.stats.redis_cmds", stats.RedisCmds),
+      )
+      span.End()
+    }()

Optional next step (future PR): record errors via a named return and span.RecordError.

server/service/async/async_scheduled_query_stats.go (2)

15-18: Tracer scope naming and reuse (nit)

Same suggestion: define and reuse a package-level tracer with the package import path as the scope.

var otelTracer = otel.Tracer("github.com/fleetdm/fleet/v4/server/service/async")

97-107: Attach batch/save counters to the span

The span wrapper looks good. Adding attributes for processed keys/items and write counts will make traces self-describing when diagnosing exporter backpressure or Redis sizing.

Apply this diff:

-    defer span.End()
+    defer func() {
+      span.SetAttributes(
+        attribute.Int("async.stats.keys",       stats.Keys),
+        attribute.Int("async.stats.items",      stats.Items),
+        attribute.Int("async.stats.inserts",    stats.Inserts),
+        attribute.Int("async.stats.updates",    stats.Updates),
+        attribute.Int("async.stats.redis_cmds", stats.RedisCmds),
+      )
+      span.End()
+    }()

As in the other collectors, consider a named return to record errors later.

server/service/async/async_host_seen.go (2)

71-83: Record errors and set span status in the synchronous flush path.

If MarkHostsSeen fails, the span currently won’t capture the error. Recommend recording errors and setting status to improve observability.

Apply this diff within FlushHostsLastSeen to capture errors:

-    hostIDs := t.seenHostSet.getAndClearHostIDs()
-    return t.datastore.MarkHostsSeen(ctx, hostIDs, now)
+    hostIDs := t.seenHostSet.getAndClearHostIDs()
+    if err := t.datastore.MarkHostsSeen(ctx, hostIDs, now); err != nil {
+      if t.otelEnabled {
+        span.RecordError(err)
+      }
+      return err
+    }
+    return nil

94-105: Capture useful attributes and errors in the async collector span.

Consider annotating the span with batch sizes and item counts, and recording errors from downstream calls (ds.MarkHostsSeen, Redis ops). This makes async task health much easier to diagnose.

Example (illustrative; place where values are available):

// After loading hostIDs
span.SetAttributes(
  attribute.Int("async.items_total", len(hostIDs)),
  attribute.Int("async.insert_batch", int(cfg.InsertBatch)),
)
// On error paths:
span.RecordError(err)
cmd/fleet/serve.go (1)

157-165: Graceful tracer shutdown and resource attribution.

Two improvements:

  • Ensure tracerProvider.Shutdown is called on process shutdown to flush spans.
  • Attach a Resource (service.name, version, environment) so traces are attributed correctly across backends.

Example adjustments (conceptual):

res, _ := resource.New(ctx,
  resource.WithFromEnv(),
  resource.WithTelemetrySDK(),
  resource.WithHost(),
  resource.WithAttributes(
    semconv.ServiceNameKey.String("fleet"),
    semconv.ServiceVersionKey.String(version.Version().Version),
  ),
)
tracerProvider := sdktrace.NewTracerProvider(
  sdktrace.WithResource(res),
  sdktrace.WithSpanProcessor(batchSpanProcessor),
)
otel.SetTracerProvider(tracerProvider)

// In the shutdown goroutine (before srv.Shutdown):
_ = tracerProvider.Shutdown(ctx)

Optionally, only initialize OTEL when config.OTELEnabled() is true to avoid initializing two tracing systems when TracingType != "opentelemetry".

server/service/async/async.go (2)

26-26: OTEL enablement captured at construction time — acceptable trade-off; consider dynamic read if hot-reloads are expected

Storing a snapshot boolean keeps per-collector decisions cheap. If Fleet supports live OTEL toggling without process restart, instrumentation gates won’t reflect changes until NewTask is reconstructed.

If hot-reloads are a requirement, store a pointer to FleetConfig (or a function) and read OTELEnabled() at use sites instead of copying the value. Otherwise, current approach is fine.

Also applies to: 50-50


30-38: Clarify nil FleetConfig disables async collectors & OpenTelemetry in NewTask docstring

Verification confirms:

  • A zero-value OsqueryConfig yields Enabled=false in AsyncConfigForTask, so no async collectors start when fleetConfig is nil.
  • All call sites to NewTask (both qualified and unqualified) accept *config.FleetConfig (including nil) and remain correct.
  • There are no direct Task composite literals elsewhere that would break with this API change.

Apply this docstring tweak for clarity:

-// NewTask configures and returns a Task.
+// NewTask configures and returns a Task.
// If fleetConfig is nil, async collectors default to disabled (via zero-value OsqueryConfig)
// and OpenTelemetry tracing is disabled.
 func NewTask(ds fleet.Datastore, pool fleet.RedisPool, clck clock.Clock, fleetConfig *config.FleetConfig) *Task {
server/service/async/async_scheduled_query_stats_test.go (1)

220-227: Minor consistency: include Update/Delete batch sizes for parity

For parity with earlier setup, consider specifying AsyncHostUpdateBatch and AsyncHostDeleteBatch too. It won’t change behavior here but keeps test intent uniform across cases.

Apply this small addition:

 task := NewTask(ds, pool, clock.C, &config.FleetConfig{
   Osquery: config.OsqueryConfig{
     EnableAsyncHostProcessing:   "true",
     AsyncHostInsertBatch:        3,
+    AsyncHostUpdateBatch:        3,
+    AsyncHostDeleteBatch:        3,
     AsyncHostRedisPopCount:      3,
     AsyncHostRedisScanKeysCount: 10,
   },
 })
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 99705cc and 8952369.

📒 Files selected for processing (17)
  • changes/32313-otel-improvements (1 hunks)
  • cmd/fleet/serve.go (2 hunks)
  • server/config/config.go (1 hunks)
  • server/service/async/async.go (1 hunks)
  • server/service/async/async_host_seen.go (3 hunks)
  • server/service/async/async_host_seen_test.go (3 hunks)
  • server/service/async/async_label.go (2 hunks)
  • server/service/async/async_label_test.go (4 hunks)
  • server/service/async/async_policy.go (2 hunks)
  • server/service/async/async_policy_test.go (6 hunks)
  • server/service/async/async_scheduled_query_stats.go (2 hunks)
  • server/service/async/async_scheduled_query_stats_test.go (3 hunks)
  • server/service/handler.go (1 hunks)
  • server/service/integration_core_test.go (1 hunks)
  • server/service/osquery_test.go (1 hunks)
  • server/service/osquery_utils/queries_test.go (1 hunks)
  • server/service/testing_utils.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

⚙️ CodeRabbit configuration file

When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.

Files:

  • server/service/integration_core_test.go
  • server/service/async/async_label.go
  • server/config/config.go
  • server/service/async/async_policy.go
  • server/service/async/async_scheduled_query_stats.go
  • server/service/osquery_utils/queries_test.go
  • server/service/testing_utils.go
  • server/service/handler.go
  • server/service/async/async_host_seen.go
  • server/service/osquery_test.go
  • server/service/async/async_host_seen_test.go
  • server/service/async/async.go
  • server/service/async/async_scheduled_query_stats_test.go
  • server/service/async/async_label_test.go
  • server/service/async/async_policy_test.go
  • cmd/fleet/serve.go
🔇 Additional comments (22)
server/service/testing_utils.go (1)

97-97: Nil-safety confirmed for async.NewTask

The implementation in server/service/async/async.go defaults both osqueryConf and otelEnabled to zero values, then only reads from fleetConfig inside an if fleetConfig != nil { … } guard (lines 34–38). No nil-dereference is possible when passing nil, so no change is needed.

changes/32313-otel-improvements (1)

1-1: Changelog entry reads clearly and matches PR scope

Concise, accurate summary of the OTEL additions, span naming change, compression, and batch sizing. Good to land as-is.

server/service/osquery_test.go (1)

461-461: I've requested the NewTask signature to confirm expected parameters. Once we have its definition, we can verify that all call sites (in testing_utils.go, osquery_test.go, integration_core_test.go, and cmd/fleet/serve.go) match the updated signature (accepting *config.FleetConfig as the last argument).

Please provide the output of the above script so I can ensure every call site is correct and then update the review comment accordingly.

server/service/async/async_host_seen.go (1)

14-16: LGTM on OTEL imports.

Imports are minimal and correct for the added instrumentation.

cmd/fleet/serve.go (1)

407-407: *NewTask now takes config.FleetConfig — looks good.

Passing &config aligns with the updated API and enables Task to derive both Osquery and OTEL enablement.

server/service/osquery_utils/queries_test.go (1)

108-111: *Test adjustments match NewTask(config.FleetConfig).

Wrapper with FleetConfig{Osquery: …} is correct and backward-compatible for this test.

server/service/async/async_host_seen_test.go (3)

147-155: *Test updates align with NewTask(config.FleetConfig).

Batch-related fields under Osquery are preserved; this keeps collector behavior identical under the new API.


219-225: Async path test updated correctly.

EnableAsyncHostProcessing and Redis scan/pop config are passed via FleetConfig as expected.


182-182: Nil‐safety Verified in NewTask

The NewTask implementation in server/service/async/async.go already guards against a nil fleetConfig:

  • Default values are set up before the conditional (lines 33–34).
  • The code checks if fleetConfig != nil before accessing fleetConfig.Osquery or calling fleetConfig.OTELEnabled() (lines 35–38).

Since these nil checks prevent any dereference when fleetConfig is nil, no further changes are needed here.

server/service/async/async_policy_test.go (6)

232-240: Collector test config migrated correctly to FleetConfig.

Batch sizing and Redis parameters preserved.


282-290: Follow-up collector test config migration looks good.


362-371: Async-policy test config migration is correct.


474-483: Async “no policies” test updated correctly.

Config moves are consistent with the new API.


319-319: NewTask correctly handles nil FleetConfig — no action needed.

NewTask in server/service/async/async.go (lines 35–38) explicitly checks fleetConfig != nil before accessing any of its fields, leaving osqueryConf at its zero-value when nil. Passing nil for FleetConfig is therefore safe and will not cause a panic.


433-433: Nil FleetConfig is handled safely in NewTask—no changes required.

NewTask explicitly checks for a nil fleetConfig before touching any of its fields, initializing a zero-value OsqueryConfig and disabling async processing by default. The “no policies sync” test’s use of NewTask(ds, pool, clock.C, nil) is therefore valid and safe.

server/service/async/async.go (1)

40-44: Task configs derived from zero-value OsqueryConfig when fleetConfig is nil — confirm disabled-by-default behavior

Populating taskCfgs using osqueryConf.AsyncConfigForTask is clear. This relies on AsyncConfigForTask returning cfg.Enabled=false for a zero-value OsqueryConfig. If that assumption ever changes, collectors could start unintentionally when fleetConfig is nil.

Please confirm AsyncConfigForTask returns disabled configs for a zero-value OsqueryConfig with the script in my previous comment (it prints the implementation context). If it doesn’t, consider guarding with an early return when fleetConfig == nil:

- taskCfgs[config.AsyncTaskLabelMembership] = osqueryConf.AsyncConfigForTask(config.AsyncTaskLabelMembership)
+ if fleetConfig == nil {
+   // Nothing enabled when no fleetConfig provided.
+   return &Task{datastore: ds, pool: pool, clock: clck, taskConfigs: taskCfgs, otelEnabled: false}
+ }
+ taskCfgs[config.AsyncTaskLabelMembership] = osqueryConf.AsyncConfigForTask(config.AsyncTaskLabelMembership)
server/service/async/async_scheduled_query_stats_test.go (2)

147-156: LGTM: Updated NewTask call to pass FleetConfig wrapper

The test now mirrors the production API change and keeps the async host processing knobs explicit. Using "true" (string) aligns with OsqueryConfig’s field type in this codebase.


180-180: LGTM: Sync path exercised with nil FleetConfig

Good coverage of the nil-config path to ensure immediate persistence (no Redis fanout) when async is not enabled.

server/service/async/async_label_test.go (4)

208-216: LGTM: Test wiring updated to FleetConfig; leaves behavior unchanged

The wrapper preserves previous async behavior while aligning with the new NewTask signature.


257-265: LGTM: Second collector run uses same FleetConfig wrapper

Consistent with the earlier case; good to validate update semantics independently.


294-294: LGTM: Sync path with nil FleetConfig

Covers immediate MySQL path and ensures no Redis side-effects; assertions below confirm expected emptiness of Redis keys.


337-346: LGTM: Async path config carried via FleetConfig

The async path is correctly configured and validated via Redis assertions.

Comment thread cmd/fleet/serve.go
Comment thread server/service/integration_core_test.go
Copy link
Copy Markdown
Contributor

@ksykulev ksykulev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. 2 non-blocking comments/questions.

if t.otelEnabled {
tracer := otel.Tracer("async")
var span trace.Span
ctx, span = tracer.Start(ctx, "async.flush_hosts_last_seen",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the returned/wrapped ctx capture all the code that uses that ctx? Or do we have to manually add spans every place we want to capture data?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea here is that this is a root span, and the SQL spans created by the library github.com/XSAM/otelsql will automatically be child spans of this span.

Generally, yes, we have add spans where we want them. This is made easier by libraries, middleware, or reusable methods. In this case, I considered factoring this out into its own method/wrapper, but didn't see that much benefit since there aren't that many async methods to instrument.

Comment thread server/service/handler.go
otmiddleware.WithSpanNameFormatter(func(route string, r *http.Request) string {
// Use the guideline for span names: {method} {target}
// See https://opentelemetry.io/docs/specs/semconv/http/http-spans/
return r.Method + " " + route
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think maybe fleet version might be valuable to add here? Or maybe we should add those to the spans as an attribute?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet version, like 4.72.0? Hmm, that would be a lot of traces with the same fleet version. Yes, any extra data can be added as attributes.

@getvictor getvictor merged commit 84e45f6 into main Aug 29, 2025
42 checks passed
@getvictor getvictor deleted the victor/32313-otel-issues branch August 29, 2025 00:32
@coderabbitai coderabbitai Bot mentioned this pull request Feb 3, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Identify current issues with OTEL and fix minor issues

2 participants