ROX-34686: Metrics for failed networkflow send#20555
Conversation
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Consider defining constants for the
objectlabel values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere. - It might be helpful to clarify in the metric help text that
NumMessagesDroppedOnSendCounterandNumUpdatesDroppedOnSendCounterare tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider defining constants for the `object` label values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere.
- It might be helpful to clarify in the metric help text that `NumMessagesDroppedOnSendCounter` and `NumUpdatesDroppedOnSendCounter` are tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.
## Individual Comments
### Comment 1
<location path="sensor/common/networkflow/metrics/metrics.go" line_range="135-140" />
<code_context>
Name: netFlowManagerPrefix + "num_sent_to_central_current",
Help: "A gauge that tracks the current number of connections and endpoints being updated (i.e., sent to Central)",
}, []string{"object"})
+ NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
+ Namespace: metrics.PrometheusNamespace,
+ Subsystem: metrics.SensorSubsystem.String(),
+ Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",
+ Help: "Total number of dropped network flow messages",
+ })
+ NumUpdatesDroppedOnSendCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
</code_context>
<issue_to_address>
**suggestion:** Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability
The current help text is generic, but `sendToCentral` indicates this counter tracks messages dropped when the `sensorUpdates` queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.
```suggestion
NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.SensorSubsystem.String(),
Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",
Help: "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full",
})
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{ | ||
| Namespace: metrics.PrometheusNamespace, | ||
| Subsystem: metrics.SensorSubsystem.String(), | ||
| Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total", | ||
| Help: "Total number of dropped network flow messages", | ||
| }) |
There was a problem hiding this comment.
suggestion: Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability
The current help text is generic, but sendToCentral indicates this counter tracks messages dropped when the sensorUpdates queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.
| NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{ | |
| Namespace: metrics.PrometheusNamespace, | |
| Subsystem: metrics.SensorSubsystem.String(), | |
| Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total", | |
| Help: "Total number of dropped network flow messages", | |
| }) | |
| NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{ | |
| Namespace: metrics.PrometheusNamespace, | |
| Subsystem: metrics.SensorSubsystem.String(), | |
| Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total", | |
| Help: "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full", | |
| }) |
📝 WalkthroughWalkthroughThis PR adds observability to the network flow manager by introducing two Prometheus metrics that track dropped messages and updates when the internal send buffer is full or transmission fails. The metrics are declared in the metrics module and then incremented at three failure points in the manager's send paths. ChangesNetwork Flow Drop Metrics
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
sensor/common/networkflow/manager/manager_impl.go (1)
406-417: ⚡ Quick winDisambiguate drop reason before incrementing dropped-update counters
Line 406 through Line 417 increments
NumUpdatesDroppedOnSendCounterfor everysend... == false, including stop-requested shutdown. If this metric is meant for send-buffer pressure, that will overcount and skew alerts/dashboards. Consider returning a failure reason fromsendToCentraland only incrementing this counter for buffer-full drops (or broaden the metric/help text to “failed to send” semantics).As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@sensor/common/networkflow/manager/manager_impl.go` around lines 406 - 417, The dropped-update counter NumUpdatesDroppedOnSendCounter is being incremented for any send failure (e.g., stop-requested shutdown) in the send* paths (sendProcesses, sendEndpoints, sendConns), which skews buffer-pressure metrics; modify the sendToCentral/send* functions to return a typed failure reason (e.g., enum/constant like SendResult{OK, BufferFull, Shutdown, Other} or an error with sentinel values), update callers (e.g., m.sendProcesses and the block that calls it, and updateComputer.OnSend* calls) to only increment NumUpdatesDroppedOnSendCounter when the returned reason equals the buffer-full case (or alternatively rename/repurpose the metric to “failed_to_send” if you prefer broader semantics), and ensure updateComputer is still notified appropriately for non-buffer-drop failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@sensor/common/networkflow/manager/manager_impl.go`:
- Around line 406-417: The dropped-update counter NumUpdatesDroppedOnSendCounter
is being incremented for any send failure (e.g., stop-requested shutdown) in the
send* paths (sendProcesses, sendEndpoints, sendConns), which skews
buffer-pressure metrics; modify the sendToCentral/send* functions to return a
typed failure reason (e.g., enum/constant like SendResult{OK, BufferFull,
Shutdown, Other} or an error with sentinel values), update callers (e.g.,
m.sendProcesses and the block that calls it, and updateComputer.OnSend* calls)
to only increment NumUpdatesDroppedOnSendCounter when the returned reason equals
the buffer-full case (or alternatively rename/repurpose the metric to
“failed_to_send” if you prefer broader semantics), and ensure updateComputer is
still notified appropriately for non-buffer-drop failures.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 7d8bc3ae-0cbb-4c47-a78b-001b5f0997b7
📒 Files selected for processing (2)
sensor/common/networkflow/manager/manager_impl.gosensor/common/networkflow/metrics/metrics.go
🚀 Build Images ReadyImages are ready for commit ecca444. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-942-gecca4445ee |
Description
We want to better understand how frequent it is for network flow update messages to be sent from sensor to central. This will help us to determine if batching of network flow updates should be enabled by default.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
change me!