Skip to content

Comments

Prevent race condition during pluginsd array operations#21628

Draft
stelfrag wants to merge 10 commits intonetdata:masterfrom
stelfrag:improve_parser_dimension_lifecycle
Draft

Prevent race condition during pluginsd array operations#21628
stelfrag wants to merge 10 commits intonetdata:masterfrom
stelfrag:improve_parser_dimension_lifecycle

Conversation

@stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Jan 24, 2026

Summary
  • Improve parser dimension caching

Summary by cubic

Refactors pluginsd dimension caching to a refcounted PRD_ARRAY with atomic replace, lock-free collector access, and strict lifecycle separation via collector_tid to prevent race conditions and use-after-free. Improves chart slot cleanup under spinlocks and adds a lifecycle stress test.

  • Bug Fixes

    • Introduced PRD_ARRAY (refcounted) with create/acquire_locked/acquire/get_unsafe/replace/release; size kept in the header.
    • Collector path: sets collector_tid before any access, clears it after, and avoids clearing when old_st == st; lock-free reads/replaces; pos uses atomics; slot updates release old RRDDIM refs.
    • Unslot path: if another collector is active, only clear slot mapping and skip touching the array; otherwise release RRDDIM refs safely (refcount for cleanup, lock-free for collector).
    • Cleanup path: only runs when the collector is fully stopped; checks collector_tid under spinlock, atomically replaces the array with NULL, resets pos, clears chart-slot mapping under spinlock, then releases RRDDIM entries and frees the old array outside the lock.
    • rrdhost pluginsd_chart_slots free: clears per-chart collector_tid and last_slot under the host spinlock before unslot-and-cleanup.
    • Added internal safety checks for refcount underflow and lifecycle violations; memory accounting includes the PRD_ARRAY header.
  • New Features

    • Added PRD_ARRAY lifecycle stress test; run with: netdata -W prd-array-stress. Validates non-overlapping writer/cleanup phases and refcount correctness.

Written for commit 8d19d84. Summary will update on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Reader as "Parser Thread"
    participant State as "RRDSET (Shared Memory)"
    participant Writer as "Cleanup Thread"

    Note over Reader,Writer: Critical Section: Handling Array Lifecycle

    par Concurrent Read Operation
        Reader->>State: CHANGED: Read array pointer & size to LOCAL variables
        Note right of Reader: "Snapshot" state immediately.<br/>Avoids double-dereference race.

        alt NEW: Local Pointer is NULL or Size is 0
            Reader->>Reader: Log error / Return NULL
        else Local Pointer is Valid
            Reader->>Reader: Bounds check using LOCAL size
            Reader->>State: Access memory via LOCAL pointer
            Note right of Reader: Iterates safely using snapshot
        end

    and Concurrent Cleanup Operation
        Writer->>State: Lock pluginsd.spinlock
        Writer->>State: NEW: Copy address to 'old_array' local var

        Note right of Writer: 1. Invalidate Global State
        Writer->>State: NEW: Set pluginsd.prd_array = NULL
        Writer->>State: NEW: Set pluginsd.size = 0

        Writer->>State: Unlock pluginsd.spinlock

        Note right of Writer: 2. Release Memory (Safe)
        Writer->>Writer: NEW: freez(old_array)
        Note right of Writer: Frees memory only AFTER<br/>global pointer is NULL
    end
Loading

@stelfrag stelfrag marked this pull request as ready for review January 28, 2026 17:05
thiagoftsm
thiagoftsm previously approved these changes Jan 29, 2026
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugins are running as expected, LGTM!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to fix race conditions during pluginsd array operations by implementing a lock-free reader pattern where cleanup nullifies pointers before freeing, and readers snapshot pointers/sizes before use.

Changes:

  • Modified cleanup to save old pointer/size, NULL them under spinlock, then free after unlocking
  • Updated readers to snapshot prd_array and size into local variables before accessing
  • Changed realloc logic to use local pointer variable before updating the shared pointer

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/database/rrdset-slots.c Implements safe cleanup by NULLing prd_array and zeroing size under spinlock before freeing memory
src/plugins.d/pluginsd_internals.h Updates dimension acquisition and slot management to snapshot prd_array/size locally and check for NULL
src/plugins.d/pluginsd_parser.c Modifies chart cleanup in pluginsd_end_v2 to snapshot prd_array/size before iteration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/plugins.d/pluginsd_internals.h:132

  • Data race on st->pluginsd.dims_with_slots: This field is written non-atomically here but is read without holding the spinlock in multiple reader functions (e.g., pluginsd_end_v2 line 1100, pluginsd_acquire_dimension line 191). If the realloc path is taken, this could create a race where readers see an inconsistent state. Either protect writes to dims_with_slots with the spinlock and require readers to acquire it, or use atomic operations for this field as well.
        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/database/rrdset-slots.c:67

  • The function rrdset_pluginsd_receive_unslot modifies array elements (lines 52-54) without holding the spinlock, yet it can be called from multiple contexts including collector threads. This creates a race condition where:
  1. A collector thread could be in pluginsd_clear_scope_chart() or pluginsd_rrdset_cache_put_to_slot() calling this function
  2. Meanwhile, the cleanup thread in rrdset_pluginsd_receive_unslot_and_cleanup() could have checked collector_tid, found it to be 0, and proceeded to free the array

Even though the cleanup code now uses atomic operations to NULL the array pointer, there's still a window where a collector thread could have loaded the array pointer before it was NULLed, then the cleanup thread frees it, and then the collector thread tries to modify the freed memory.

The atomics for loading prd_array and prd_size protect against use-after-free when reading, but not when modifying the array contents. This function should either:

  1. Be called only while holding the spinlock, OR
  2. The modifications to array elements should be made atomic/synchronized, OR
  3. The collector_tid should remain set during the entire time this function is executing
void rrdset_pluginsd_receive_unslot(RRDSET *st) {
    // Use atomic loads with ACQUIRE semantics to synchronize with cleanup code
    // that uses RELEASE semantics when freeing the array
    struct pluginsd_rrddim *prd_array = __atomic_load_n(&st->pluginsd.prd_array, __ATOMIC_ACQUIRE);
    size_t prd_size = __atomic_load_n(&st->pluginsd.size, __ATOMIC_ACQUIRE);

    for(size_t i = 0; i < prd_size && prd_array; i++) {
        rrddim_acquired_release(prd_array[i].rda); // can be NULL
        prd_array[i].rda = NULL;
        prd_array[i].rd = NULL;
        prd_array[i].id = NULL;
    }

    RRDHOST *host = st->rrdhost;

    if(st->pluginsd.last_slot >= 0 &&
        (uint32_t)st->pluginsd.last_slot < host->stream.rcv.pluginsd_chart_slots.size &&
        host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] == st) {
        host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] = NULL;
    }

    st->pluginsd.last_slot = -1;
    st->pluginsd.dims_with_slots = false;
}

src/plugins.d/pluginsd_internals.h:140

  • The non-atomic writes to st->pluginsd.dims_with_slots (lines 136, 140) create a data race with the cleanup code that reads this field (line 1100 in pluginsd_parser.c and elsewhere). Since dims_with_slots is accessed without synchronization by both the collector and cleanup threads, this is a data race under the C memory model.

While the field itself may not cause crashes since it's a boolean, it should be accessed atomically to avoid undefined behavior. The cleanup code sets it to false (line 103 in rrdset-slots.c) without atomics, while readers check it without atomics (line 1100 in pluginsd_parser.c, line 161 in this file, line 196 in this file).

Consider using atomic operations for this field or ensuring it's only modified/read while holding the spinlock.

        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;
    }
    else {
        st->pluginsd.dims_with_slots = false;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

src/plugins.d/pluginsd_internals.h:140

  • After atomically loading prd_array and prd_size, the code still directly modifies st->pluginsd.dims_with_slots without atomic operations or lock protection. This creates inconsistency - a reader might load the old array but see the new dims_with_slots value (or vice versa), leading to undefined behavior.

The dims_with_slots flag should either be set atomically with appropriate memory ordering, or the entire pluginsd_rrddim_put_to_slot operation should be protected by st->pluginsd.spinlock.

    if(slot >= 1) {
        st->pluginsd.dims_with_slots = true;
        wanted_size = slot;
    }
    else {
        st->pluginsd.dims_with_slots = false;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@thiagoftsm
Copy link
Contributor

@stelfrag , we have collectors running as expected, but we still have some additional suggestions for your PR. Please, take a look on them.

@stelfrag stelfrag marked this pull request as draft January 30, 2026 15:05
@stelfrag stelfrag force-pushed the improve_parser_dimension_lifecycle branch from 2688f77 to 804575c Compare January 30, 2026 15:05
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/database/rrdset-slots.c">

<violation number="1" location="src/database/rrdset-slots.c:107">
P2: The new `netdata_log_error` call removes the prior rate limiting. If cleanup is retried while the collector is still active, this will log an error every time and can flood logs. Prefer keeping the previous `nd_log_limit` throttling here.</violation>
</file>

<file name="src/database/rrdset-pluginsd-array.h">

<violation number="1" location="src/database/rrdset-pluginsd-array.h:50">
P1: Potential use-after-free race condition in `prd_array_acquire`. Between loading the array pointer and accessing `arr->refcount`, another thread could replace the array pointer and release the old array (freeing it if refcount was 1). The CAS loop cannot protect against this because the pointer itself may become stale before we read the refcount.

Consider using hazard pointers, RCU-style deferred reclamation, or ensuring callers always hold an external reference during replacement operations.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/database/rrdset-slots.c">

<violation number="1" location="src/database/rrdset-slots.c:238">
P2: prd_array_replace/prd_array_release are called without holding the required spinlock. This violates the API contract and can race with prd_array_acquire, causing use-after-free during the stress test.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@stelfrag stelfrag force-pushed the improve_parser_dimension_lifecycle branch from 34232c2 to cdbb124 Compare February 11, 2026 12:06
@stelfrag stelfrag force-pushed the improve_parser_dimension_lifecycle branch from c24db6d to 95ff15c Compare February 20, 2026 21:01
…tructure, improve memory safety, and address potential race conditions.
…collector_tid` to skip cleanup during active collection, enforce spinlock usage in `prd_array_replace`, and update stress test with skipped reader tracking.
- Add checks to detect refcount underflow and lifecycle violations.
- Improve handling of `collector_tid` for detecting active collectors.
- Ensure proper usage of spinlocks when modifying chart slots and PRD_ARRAY.
- Refine comments and internal safety checks to clarify valid usage scenarios.
@stelfrag stelfrag force-pushed the improve_parser_dimension_lifecycle branch from 1d4c3d7 to 8d19d84 Compare February 21, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants