Skip to content

ROX-34686: Metrics for failed networkflow send#20555

Open
JoukoVirtanen wants to merge 2 commits into
masterfrom
jv-metric-for-failed-networkflow-send
Open

ROX-34686: Metrics for failed networkflow send#20555
JoukoVirtanen wants to merge 2 commits into
masterfrom
jv-metric-for-failed-networkflow-send

Conversation

@JoukoVirtanen
Copy link
Copy Markdown
Contributor

@JoukoVirtanen JoukoVirtanen commented May 13, 2026

Description

We want to better understand how frequent it is for network flow update messages to be sent from sensor to central. This will help us to determine if batching of network flow updates should be enabled by default.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

change me!

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Consider defining constants for the object label values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere.
  • It might be helpful to clarify in the metric help text that NumMessagesDroppedOnSendCounter and NumUpdatesDroppedOnSendCounter are tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider defining constants for the `object` label values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere.
- It might be helpful to clarify in the metric help text that `NumMessagesDroppedOnSendCounter` and `NumUpdatesDroppedOnSendCounter` are tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.

## Individual Comments

### Comment 1
<location path="sensor/common/networkflow/metrics/metrics.go" line_range="135-140" />
<code_context>
 		Name:      netFlowManagerPrefix + "num_sent_to_central_current",
 		Help:      "A gauge that tracks the current number of connections and endpoints being updated (i.e., sent to Central)",
 	}, []string{"object"})
+	NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Namespace: metrics.PrometheusNamespace,
+		Subsystem: metrics.SensorSubsystem.String(),
+		Name:      netFlowManagerPrefix + "num_messages_dropped_on_send_total",
+		Help:      "Total number of dropped network flow messages",
+	})
+	NumUpdatesDroppedOnSendCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
</code_context>
<issue_to_address>
**suggestion:** Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability

The current help text is generic, but `sendToCentral` indicates this counter tracks messages dropped when the `sensorUpdates` queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.

```suggestion
	NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
		Namespace: metrics.PrometheusNamespace,
		Subsystem: metrics.SensorSubsystem.String(),
		Name:      netFlowManagerPrefix + "num_messages_dropped_on_send_total",
		Help:      "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full",
	})
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +135 to +140
NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.SensorSubsystem.String(),
Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",
Help: "Total number of dropped network flow messages",
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability

The current help text is generic, but sendToCentral indicates this counter tracks messages dropped when the sensorUpdates queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.

Suggested change
NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.SensorSubsystem.String(),
Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",
Help: "Total number of dropped network flow messages",
})
NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: metrics.PrometheusNamespace,
Subsystem: metrics.SensorSubsystem.String(),
Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",
Help: "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full",
})

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds observability to the network flow manager by introducing two Prometheus metrics that track dropped messages and updates when the internal send buffer is full or transmission fails. The metrics are declared in the metrics module and then incremented at three failure points in the manager's send paths.

Changes

Network Flow Drop Metrics

Layer / File(s) Summary
Metrics Declaration and Registration
sensor/common/networkflow/metrics/metrics.go
New NumMessagesDroppedOnSendCounter (single counter) and NumUpdatesDroppedOnSendCounter (labeled counter vec) are declared with appropriate metric names and help text, then registered in the metrics init() function.
Manager Instrumentation for Drop Tracking
sensor/common/networkflow/manager/manager_impl.go
Manager increments NumMessagesDroppedOnSendCounter when a Central update is dropped due to full queue in sendToCentral, and increments NumUpdatesDroppedOnSendCounter with the appropriate object label when connection/endpoint updates or process-listening updates fail to send.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The pull request description lacks critical details about the changes and does not adequately address the PR template requirements. Complete the 'How I validated my change' section with details on testing and validation. Clarify whether the PR is production-ready and check the 'CI results are inspected' checkbox. Explain why no automated tests were added, or add them.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically describes the main change: adding metrics to track failed network flow send operations, which aligns with the actual changes in the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jv-metric-for-failed-networkflow-send

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
sensor/common/networkflow/manager/manager_impl.go (1)

406-417: ⚡ Quick win

Disambiguate drop reason before incrementing dropped-update counters

Line 406 through Line 417 increments NumUpdatesDroppedOnSendCounter for every send... == false, including stop-requested shutdown. If this metric is meant for send-buffer pressure, that will overcount and skew alerts/dashboards. Consider returning a failure reason from sendToCentral and only incrementing this counter for buffer-full drops (or broaden the metric/help text to “failed to send” semantics).

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sensor/common/networkflow/manager/manager_impl.go` around lines 406 - 417,
The dropped-update counter NumUpdatesDroppedOnSendCounter is being incremented
for any send failure (e.g., stop-requested shutdown) in the send* paths
(sendProcesses, sendEndpoints, sendConns), which skews buffer-pressure metrics;
modify the sendToCentral/send* functions to return a typed failure reason (e.g.,
enum/constant like SendResult{OK, BufferFull, Shutdown, Other} or an error with
sentinel values), update callers (e.g., m.sendProcesses and the block that calls
it, and updateComputer.OnSend* calls) to only increment
NumUpdatesDroppedOnSendCounter when the returned reason equals the buffer-full
case (or alternatively rename/repurpose the metric to “failed_to_send” if you
prefer broader semantics), and ensure updateComputer is still notified
appropriately for non-buffer-drop failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@sensor/common/networkflow/manager/manager_impl.go`:
- Around line 406-417: The dropped-update counter NumUpdatesDroppedOnSendCounter
is being incremented for any send failure (e.g., stop-requested shutdown) in the
send* paths (sendProcesses, sendEndpoints, sendConns), which skews
buffer-pressure metrics; modify the sendToCentral/send* functions to return a
typed failure reason (e.g., enum/constant like SendResult{OK, BufferFull,
Shutdown, Other} or an error with sentinel values), update callers (e.g.,
m.sendProcesses and the block that calls it, and updateComputer.OnSend* calls)
to only increment NumUpdatesDroppedOnSendCounter when the returned reason equals
the buffer-full case (or alternatively rename/repurpose the metric to
“failed_to_send” if you prefer broader semantics), and ensure updateComputer is
still notified appropriately for non-buffer-drop failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7d8bc3ae-0cbb-4c47-a78b-001b5f0997b7

📥 Commits

Reviewing files that changed from the base of the PR and between 9762b9c and ecca444.

📒 Files selected for processing (2)
  • sensor/common/networkflow/manager/manager_impl.go
  • sensor/common/networkflow/metrics/metrics.go

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

🚀 Build Images Ready

Images are ready for commit ecca444. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-942-gecca4445ee

@JoukoVirtanen JoukoVirtanen changed the title Jv metric for failed networkflow send ROX-34686: Metric for failed networkflow send May 13, 2026
@JoukoVirtanen JoukoVirtanen marked this pull request as ready for review May 13, 2026 17:00
@JoukoVirtanen JoukoVirtanen requested a review from a team as a code owner May 13, 2026 17:00
@JoukoVirtanen JoukoVirtanen requested a review from vikin91 May 13, 2026 17:00
@JoukoVirtanen JoukoVirtanen changed the title ROX-34686: Metric for failed networkflow send ROX-34686: Metrics for failed networkflow send May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant