Skip to content

Comments

[Feat]: OpenTelemetry metrics aggregation and gap filling #21522#21585

Open
sarika-03 wants to merge 3 commits intonetdata:masterfrom
sarika-03:feature
Open

[Feat]: OpenTelemetry metrics aggregation and gap filling #21522#21585
sarika-03 wants to merge 3 commits intonetdata:masterfrom
sarika-03:feature

Conversation

@sarika-03
Copy link

@sarika-03 sarika-03 commented Jan 17, 2026

Summary

This change implements slot-based aggregation and semantic gap filling for OpenTelemetry (OTLP) metrics in the otel-plugin, addressing the mismatch between OTLP’s event-based metric model and Netdata’s fixed-interval, slot-based storage model.

Netdata requires exactly one value per dimension per update interval, while OTLP metrics may arrive multiple times within an interval or not at all. This implementation introduces a deterministic aggregation layer that groups incoming OTLP datapoints into fixed time slots, aggregates them according to metric semantics (temporality and monotonicity), and fills gaps when data is missing.

Key design decisions:

  • Introduced an explicit, configurable collection_interval instead of relying on auto-detection.
  • Implemented slot finalization with a configurable grace period to handle late-arriving data.
  • Applied temporality-aware aggregation and gap filling:
    • Gauges repeat the last known value.
    • Delta counters fill gaps with zero.
    • Cumulative monotonic counters repeat the last cumulative value.
  • Optimized memory usage by pre-aggregating values per slot instead of storing all datapoints.
  • Improved histogram handling by separating bucket/count sums from summary statistics (sum/min/max).

Fixes #21522

Test Plan

Automated tests:

  • Added unit tests in netdata_chart.rs:
    • test_delta_counter_gap_fill: verifies that delta counters emit 0 for empty slots.
    • test_cumulative_counter_gap_fill: verifies that cumulative monotonic counters repeat the last value for empty slots.
  • All tests pass with cargo test -p otel-plugin.

Manual testing:

  • Built the plugin using cargo build --release.
  • Ran the plugin in foreground mode with a fixed collection_interval.
  • Sent OTLP metrics via the OTLP HTTP /v1/metrics endpoint with irregular timing.
  • Verified that:
    • Metrics are emitted only on fixed slot boundaries.
    • Multiple datapoints within a slot result in a single aggregated value.
    • Missing slots are filled according to metric semantics.
    • Late data arriving after slot finalization is dropped (no backfilling).
Additional Information

Previously, OTLP metrics without data in a given interval caused dimensions to be archived, resulting in gaps in charts and unreliable alerting. This change ensures stable visualization and alerting by explicitly filling gaps, while still allowing inactive dimensions to be archived after a configurable timeout.

Backfilling finalized slots is intentionally not supported to avoid reordering complexity and performance overhead, and this behavior is enforced consistently.

For users: How does this change affect me?
  • Affected area: OpenTelemetry metrics ingestion via the Netdata otel-plugin.
  • Visibility: Mostly under-the-hood; users will notice smoother charts and fewer gaps.
  • User impact:
    • Charts no longer show gaps when OTLP metrics are temporarily missing.
    • Counters and gauges behave consistently despite asynchronous metric arrival.
    • Late-arriving metrics are handled predictably.
  • Benefits:
    • Improved dashboard and alert reliability.
    • Correct handling of delta versus cumulative counters.
    • Better support for OTLP histograms with correct units and semantics.

Summary by cubic

Adds slot-based aggregation and semantic gap filling to OTLP metrics in the otel-plugin so Netdata emits one value per fixed interval. This stabilizes charts and alerts, and handles late/missing data predictably.

  • New Features

    • Fixed-interval slotting with a configurable collection interval and per-metric overrides.
    • Grace period for late data; finalized slots drop late arrivals (no backfill).
    • Gap fill rules: gauges repeat last value; delta counters fill with 0; cumulative monotonic counters repeat the last value.
    • Improved histogram handling: bucket/count outputs plus sum/min/max as gauges with correct units.
    • Chart algorithms chosen by semantics (incremental for monotonic sums; absolute otherwise).
    • New config flags: --otel-metrics-collection-interval, --otel-metrics-grace-period, --otel-metrics-dimension-archive-timeout; optional per-metric collection_interval.
  • Migration

    • Set --otel-metrics-collection-interval to your desired frequency; use per-metric overrides if needed.
    • Expect emissions only on slot boundaries; late data after the grace period is dropped.
    • Inactive dimensions auto-archive after --otel-metrics-dimension-archive-timeout.

Written for commit a3a4f2a. Summary will update on new commits.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/crates/netdata-otel/otel-plugin/src/samples_table.rs">

<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:28">
P1: Last-value aggregation overwrites based on arrival order; older out-of-order samples can replace newer values within a slot.</violation>

<violation number="2" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:86">
P1: Division by zero possible when interval_nano is zero because no validation precedes the division</violation>
</file>

<file name="src/crates/netdata-otel/otel-plugin/src/plugin_config.rs">

<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/plugin_config.rs:69">
P1: Newly added metrics duration fields are required in YAML but lack serde defaults, causing older otel.yaml configs to fail deserialization and the plugin to refuse to start after upgrade.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements slot-based aggregation and semantic gap filling for OpenTelemetry metrics to bridge the mismatch between OTLP's event-based model and Netdata's fixed-interval storage model. The implementation introduces explicit collection intervals, grace periods for late data, and metric-specific gap-filling strategies.

Changes:

  • Introduced fixed-interval slot aggregation with configurable collection intervals and grace periods
  • Implemented temporality-aware gap filling (gauges repeat last value, delta counters fill with zero, cumulative counters repeat last value)
  • Enhanced histogram handling to separate bucket counts from summary statistics (sum/min/max) with correct units and semantics

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/crates/netdata-otel/otel-plugin/src/samples_table.rs Complete rewrite replacing sample buffers with slot-based storage, aggregation types, and gap-fill strategies
src/crates/netdata-otel/otel-plugin/src/plugin_config.rs Added configuration for collection interval, grace period, dimension archive timeout, and per-metric overrides
src/crates/netdata-otel/otel-plugin/src/netdata_chart.rs Refactored chart processing to use slot-based emission with semantic detection and gap filling logic
src/crates/netdata-otel/otel-plugin/src/metrics_service.rs Updated to pass MetricsConfig and current time to chart processing
src/crates/netdata-otel/otel-plugin/src/flattened_point.rs Added aggregation_temporality field and _nd_name_suffix support for histogram metrics
src/crates/netdata-otel/flatten_otel/src/metrics.rs Enhanced histogram flattening to emit count, sum, min, and max as separate dimensions with proper typing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/crates/netdata-otel/otel-plugin/src/samples_table.rs">

<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:100">
P2: Late-arriving data is accepted after finalized slots are popped, allowing backfill of already-finalized intervals</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/crates/netdata-otel/otel-plugin/src/metrics_service.rs">

<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/metrics_service.rs:135">
P2: Cleanup tied to every 60 exports can exceed dimension_archive_timeout, leaving stale charts when traffic is low</violation>
</file>

<file name="src/crates/netdata-otel/flatten_otel/src/metrics.rs">

<violation number="1" location="src/crates/netdata-otel/flatten_otel/src/metrics.rs:295">
P2: Sum/min/max histogram entries are tagged as gauges but later overwritten to histogram type, so they are exported with incorrect histogram semantics.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: OpenTelemetry metrics aggregation and gap filling

2 participants