Prevent race condition during pluginsd array operations#21628
Prevent race condition during pluginsd array operations#21628stelfrag wants to merge 10 commits intonetdata:masterfrom
Conversation
There was a problem hiding this comment.
No issues found across 3 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Reader as "Parser Thread"
participant State as "RRDSET (Shared Memory)"
participant Writer as "Cleanup Thread"
Note over Reader,Writer: Critical Section: Handling Array Lifecycle
par Concurrent Read Operation
Reader->>State: CHANGED: Read array pointer & size to LOCAL variables
Note right of Reader: "Snapshot" state immediately.<br/>Avoids double-dereference race.
alt NEW: Local Pointer is NULL or Size is 0
Reader->>Reader: Log error / Return NULL
else Local Pointer is Valid
Reader->>Reader: Bounds check using LOCAL size
Reader->>State: Access memory via LOCAL pointer
Note right of Reader: Iterates safely using snapshot
end
and Concurrent Cleanup Operation
Writer->>State: Lock pluginsd.spinlock
Writer->>State: NEW: Copy address to 'old_array' local var
Note right of Writer: 1. Invalidate Global State
Writer->>State: NEW: Set pluginsd.prd_array = NULL
Writer->>State: NEW: Set pluginsd.size = 0
Writer->>State: Unlock pluginsd.spinlock
Note right of Writer: 2. Release Memory (Safe)
Writer->>Writer: NEW: freez(old_array)
Note right of Writer: Frees memory only AFTER<br/>global pointer is NULL
end
thiagoftsm
left a comment
There was a problem hiding this comment.
Plugins are running as expected, LGTM!
There was a problem hiding this comment.
Pull request overview
This pull request aims to fix race conditions during pluginsd array operations by implementing a lock-free reader pattern where cleanup nullifies pointers before freeing, and readers snapshot pointers/sizes before use.
Changes:
- Modified cleanup to save old pointer/size, NULL them under spinlock, then free after unlocking
- Updated readers to snapshot prd_array and size into local variables before accessing
- Changed realloc logic to use local pointer variable before updating the shared pointer
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/database/rrdset-slots.c | Implements safe cleanup by NULLing prd_array and zeroing size under spinlock before freeing memory |
| src/plugins.d/pluginsd_internals.h | Updates dimension acquisition and slot management to snapshot prd_array/size locally and check for NULL |
| src/plugins.d/pluginsd_parser.c | Modifies chart cleanup in pluginsd_end_v2 to snapshot prd_array/size before iteration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7bffd32 to
541d5da
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/plugins.d/pluginsd_internals.h:132
- Data race on st->pluginsd.dims_with_slots: This field is written non-atomically here but is read without holding the spinlock in multiple reader functions (e.g., pluginsd_end_v2 line 1100, pluginsd_acquire_dimension line 191). If the realloc path is taken, this could create a race where readers see an inconsistent state. Either protect writes to dims_with_slots with the spinlock and require readers to acquire it, or use atomic operations for this field as well.
st->pluginsd.dims_with_slots = true;
wanted_size = slot;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
src/database/rrdset-slots.c:67
- The function
rrdset_pluginsd_receive_unslotmodifies array elements (lines 52-54) without holding the spinlock, yet it can be called from multiple contexts including collector threads. This creates a race condition where:
- A collector thread could be in
pluginsd_clear_scope_chart()orpluginsd_rrdset_cache_put_to_slot()calling this function - Meanwhile, the cleanup thread in
rrdset_pluginsd_receive_unslot_and_cleanup()could have checkedcollector_tid, found it to be 0, and proceeded to free the array
Even though the cleanup code now uses atomic operations to NULL the array pointer, there's still a window where a collector thread could have loaded the array pointer before it was NULLed, then the cleanup thread frees it, and then the collector thread tries to modify the freed memory.
The atomics for loading prd_array and prd_size protect against use-after-free when reading, but not when modifying the array contents. This function should either:
- Be called only while holding the spinlock, OR
- The modifications to array elements should be made atomic/synchronized, OR
- The collector_tid should remain set during the entire time this function is executing
void rrdset_pluginsd_receive_unslot(RRDSET *st) {
// Use atomic loads with ACQUIRE semantics to synchronize with cleanup code
// that uses RELEASE semantics when freeing the array
struct pluginsd_rrddim *prd_array = __atomic_load_n(&st->pluginsd.prd_array, __ATOMIC_ACQUIRE);
size_t prd_size = __atomic_load_n(&st->pluginsd.size, __ATOMIC_ACQUIRE);
for(size_t i = 0; i < prd_size && prd_array; i++) {
rrddim_acquired_release(prd_array[i].rda); // can be NULL
prd_array[i].rda = NULL;
prd_array[i].rd = NULL;
prd_array[i].id = NULL;
}
RRDHOST *host = st->rrdhost;
if(st->pluginsd.last_slot >= 0 &&
(uint32_t)st->pluginsd.last_slot < host->stream.rcv.pluginsd_chart_slots.size &&
host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] == st) {
host->stream.rcv.pluginsd_chart_slots.array[st->pluginsd.last_slot] = NULL;
}
st->pluginsd.last_slot = -1;
st->pluginsd.dims_with_slots = false;
}
src/plugins.d/pluginsd_internals.h:140
- The non-atomic writes to
st->pluginsd.dims_with_slots(lines 136, 140) create a data race with the cleanup code that reads this field (line 1100 in pluginsd_parser.c and elsewhere). Sincedims_with_slotsis accessed without synchronization by both the collector and cleanup threads, this is a data race under the C memory model.
While the field itself may not cause crashes since it's a boolean, it should be accessed atomically to avoid undefined behavior. The cleanup code sets it to false (line 103 in rrdset-slots.c) without atomics, while readers check it without atomics (line 1100 in pluginsd_parser.c, line 161 in this file, line 196 in this file).
Consider using atomic operations for this field or ensuring it's only modified/read while holding the spinlock.
st->pluginsd.dims_with_slots = true;
wanted_size = slot;
}
else {
st->pluginsd.dims_with_slots = false;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
src/plugins.d/pluginsd_internals.h:140
- After atomically loading prd_array and prd_size, the code still directly modifies st->pluginsd.dims_with_slots without atomic operations or lock protection. This creates inconsistency - a reader might load the old array but see the new dims_with_slots value (or vice versa), leading to undefined behavior.
The dims_with_slots flag should either be set atomically with appropriate memory ordering, or the entire pluginsd_rrddim_put_to_slot operation should be protected by st->pluginsd.spinlock.
if(slot >= 1) {
st->pluginsd.dims_with_slots = true;
wanted_size = slot;
}
else {
st->pluginsd.dims_with_slots = false;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@stelfrag , we have collectors running as expected, but we still have some additional suggestions for your PR. Please, take a look on them. |
2688f77 to
804575c
Compare
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/database/rrdset-slots.c">
<violation number="1" location="src/database/rrdset-slots.c:107">
P2: The new `netdata_log_error` call removes the prior rate limiting. If cleanup is retried while the collector is still active, this will log an error every time and can flood logs. Prefer keeping the previous `nd_log_limit` throttling here.</violation>
</file>
<file name="src/database/rrdset-pluginsd-array.h">
<violation number="1" location="src/database/rrdset-pluginsd-array.h:50">
P1: Potential use-after-free race condition in `prd_array_acquire`. Between loading the array pointer and accessing `arr->refcount`, another thread could replace the array pointer and release the old array (freeing it if refcount was 1). The CAS loop cannot protect against this because the pointer itself may become stale before we read the refcount.
Consider using hazard pointers, RCU-style deferred reclamation, or ensuring callers always hold an external reference during replacement operations.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/database/rrdset-slots.c">
<violation number="1" location="src/database/rrdset-slots.c:238">
P2: prd_array_replace/prd_array_release are called without holding the required spinlock. This violates the API contract and can race with prd_array_acquire, causing use-after-free during the stress test.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
34232c2 to
cdbb124
Compare
c24db6d to
95ff15c
Compare
…tructure, improve memory safety, and address potential race conditions.
More fixes
…collector_tid` to skip cleanup during active collection, enforce spinlock usage in `prd_array_replace`, and update stress test with skipped reader tracking.
Improve stress test
- Add checks to detect refcount underflow and lifecycle violations. - Improve handling of `collector_tid` for detecting active collectors. - Ensure proper usage of spinlocks when modifying chart slots and PRD_ARRAY. - Refine comments and internal safety checks to clarify valid usage scenarios.
1d4c3d7 to
8d19d84
Compare
Summary
Summary by cubic
Refactors pluginsd dimension caching to a refcounted PRD_ARRAY with atomic replace, lock-free collector access, and strict lifecycle separation via collector_tid to prevent race conditions and use-after-free. Improves chart slot cleanup under spinlocks and adds a lifecycle stress test.
Bug Fixes
New Features
Written for commit 8d19d84. Summary will update on new commits.