[Feat]: OpenTelemetry metrics aggregation and gap filling #21522#21585
[Feat]: OpenTelemetry metrics aggregation and gap filling #21522#21585sarika-03 wants to merge 3 commits intonetdata:masterfrom
Conversation
There was a problem hiding this comment.
3 issues found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/crates/netdata-otel/otel-plugin/src/samples_table.rs">
<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:28">
P1: Last-value aggregation overwrites based on arrival order; older out-of-order samples can replace newer values within a slot.</violation>
<violation number="2" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:86">
P1: Division by zero possible when interval_nano is zero because no validation precedes the division</violation>
</file>
<file name="src/crates/netdata-otel/otel-plugin/src/plugin_config.rs">
<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/plugin_config.rs:69">
P1: Newly added metrics duration fields are required in YAML but lack serde defaults, causing older otel.yaml configs to fail deserialization and the plugin to refuse to start after upgrade.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Pull request overview
This PR implements slot-based aggregation and semantic gap filling for OpenTelemetry metrics to bridge the mismatch between OTLP's event-based model and Netdata's fixed-interval storage model. The implementation introduces explicit collection intervals, grace periods for late data, and metric-specific gap-filling strategies.
Changes:
- Introduced fixed-interval slot aggregation with configurable collection intervals and grace periods
- Implemented temporality-aware gap filling (gauges repeat last value, delta counters fill with zero, cumulative counters repeat last value)
- Enhanced histogram handling to separate bucket counts from summary statistics (sum/min/max) with correct units and semantics
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/crates/netdata-otel/otel-plugin/src/samples_table.rs | Complete rewrite replacing sample buffers with slot-based storage, aggregation types, and gap-fill strategies |
| src/crates/netdata-otel/otel-plugin/src/plugin_config.rs | Added configuration for collection interval, grace period, dimension archive timeout, and per-metric overrides |
| src/crates/netdata-otel/otel-plugin/src/netdata_chart.rs | Refactored chart processing to use slot-based emission with semantic detection and gap filling logic |
| src/crates/netdata-otel/otel-plugin/src/metrics_service.rs | Updated to pass MetricsConfig and current time to chart processing |
| src/crates/netdata-otel/otel-plugin/src/flattened_point.rs | Added aggregation_temporality field and _nd_name_suffix support for histogram metrics |
| src/crates/netdata-otel/flatten_otel/src/metrics.rs | Enhanced histogram flattening to emit count, sum, min, and max as separate dimensions with proper typing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…, and config compatibility
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/crates/netdata-otel/otel-plugin/src/samples_table.rs">
<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/samples_table.rs:100">
P2: Late-arriving data is accepted after finalized slots are popped, allowing backfill of already-finalized intervals</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
2 issues found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/crates/netdata-otel/otel-plugin/src/metrics_service.rs">
<violation number="1" location="src/crates/netdata-otel/otel-plugin/src/metrics_service.rs:135">
P2: Cleanup tied to every 60 exports can exceed dimension_archive_timeout, leaving stale charts when traffic is low</violation>
</file>
<file name="src/crates/netdata-otel/flatten_otel/src/metrics.rs">
<violation number="1" location="src/crates/netdata-otel/flatten_otel/src/metrics.rs:295">
P2: Sum/min/max histogram entries are tagged as gauges but later overwritten to histogram type, so they are exported with incorrect histogram semantics.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Summary
This change implements slot-based aggregation and semantic gap filling for OpenTelemetry (OTLP) metrics in the otel-plugin, addressing the mismatch between OTLP’s event-based metric model and Netdata’s fixed-interval, slot-based storage model.
Netdata requires exactly one value per dimension per update interval, while OTLP metrics may arrive multiple times within an interval or not at all. This implementation introduces a deterministic aggregation layer that groups incoming OTLP datapoints into fixed time slots, aggregates them according to metric semantics (temporality and monotonicity), and fills gaps when data is missing.
Key design decisions:
Fixes #21522
Test Plan
Automated tests:
cargo test -p otel-plugin.Manual testing:
cargo build --release.Additional Information
Previously, OTLP metrics without data in a given interval caused dimensions to be archived, resulting in gaps in charts and unreliable alerting. This change ensures stable visualization and alerting by explicitly filling gaps, while still allowing inactive dimensions to be archived after a configurable timeout.
Backfilling finalized slots is intentionally not supported to avoid reordering complexity and performance overhead, and this behavior is enforced consistently.
For users: How does this change affect me?
Summary by cubic
Adds slot-based aggregation and semantic gap filling to OTLP metrics in the otel-plugin so Netdata emits one value per fixed interval. This stabilizes charts and alerts, and handles late/missing data predictably.
New Features
Migration
Written for commit a3a4f2a. Summary will update on new commits.