ROX-34686: Metrics for failed networkflow send by JoukoVirtanen · Pull Request #20555 · stackrox/stackrox

JoukoVirtanen · 2026-05-13T15:59:56Z

Description

We want to better understand how frequent it is for network flow update messages to be sent from sensor to central. This will help us to determine if batching of network flow updates should be enabled by default.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

change me!

openshift-ci · 2026-05-13T16:00:02Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

Consider defining constants for the object label values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere.
It might be helpful to clarify in the metric help text that NumMessagesDroppedOnSendCounter and NumUpdatesDroppedOnSendCounter are tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider defining constants for the `object` label values ("connections", "endpoints", "endpoints_with_processes") so they are reused consistently and typos are avoided if referenced elsewhere.
- It might be helpful to clarify in the metric help text that `NumMessagesDroppedOnSendCounter` and `NumUpdatesDroppedOnSendCounter` are tightly related (one per gRPC message vs. per contained objects) to make dashboards and alerting easier to interpret.

## Individual Comments

### Comment 1
<location path="sensor/common/networkflow/metrics/metrics.go" line_range="135-140" />
<code_context>
 		Name:      netFlowManagerPrefix + "num_sent_to_central_current",
 		Help:      "A gauge that tracks the current number of connections and endpoints being updated (i.e., sent to Central)",
 	}, []string{"object"})
+	NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Namespace: metrics.PrometheusNamespace,
+		Subsystem: metrics.SensorSubsystem.String(),
+		Name:      netFlowManagerPrefix + "num_messages_dropped_on_send_total",
+		Help:      "Total number of dropped network flow messages",
+	})
+	NumUpdatesDroppedOnSendCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
</code_context>
<issue_to_address>
**suggestion:** Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability

The current help text is generic, but `sendToCentral` indicates this counter tracks messages dropped when the `sensorUpdates` queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.

```suggestion
	NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
		Namespace: metrics.PrometheusNamespace,
		Subsystem: metrics.SensorSubsystem.String(),
		Name:      netFlowManagerPrefix + "num_messages_dropped_on_send_total",
		Help:      "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full",
	})
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-13T16:02:20Z

+	NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{
+		Namespace: metrics.PrometheusNamespace,
+		Subsystem: metrics.SensorSubsystem.String(),
+		Name:      netFlowManagerPrefix + "num_messages_dropped_on_send_total",
+		Help:      "Total number of dropped network flow messages",
+	})


suggestion: Clarify metric help to specify the drop condition (e.g., buffer-full) for better observability

The current help text is generic, but sendToCentral indicates this counter tracks messages dropped when the sensorUpdates queue is full. Please update the help to explicitly mention the buffer-full condition so dashboards and alerts remain clear, especially if other drop causes are added later.

Suggested change

NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{

Namespace: metrics.PrometheusNamespace,

Subsystem: metrics.SensorSubsystem.String(),

Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",

Help: "Total number of dropped network flow messages",

})

NumMessagesDroppedOnSendCounter = prometheus.NewCounter(prometheus.CounterOpts{

Namespace: metrics.PrometheusNamespace,

Subsystem: metrics.SensorSubsystem.String(),

Name: netFlowManagerPrefix + "num_messages_dropped_on_send_total",

Help: "Total number of network flow messages dropped because the send buffer (sensorUpdates queue) was full",

})

coderabbitai · 2026-05-13T16:03:09Z

📝 Walkthrough

Walkthrough

This PR adds observability to the network flow manager by introducing two Prometheus metrics that track dropped messages and updates when the internal send buffer is full or transmission fails. The metrics are declared in the metrics module and then incremented at three failure points in the manager's send paths.

Changes

Network Flow Drop Metrics

Layer / File(s)	Summary
Metrics Declaration and Registration `sensor/common/networkflow/metrics/metrics.go`	New `NumMessagesDroppedOnSendCounter` (single counter) and `NumUpdatesDroppedOnSendCounter` (labeled counter vec) are declared with appropriate metric names and help text, then registered in the metrics `init()` function.
Manager Instrumentation for Drop Tracking `sensor/common/networkflow/manager/manager_impl.go`	Manager increments `NumMessagesDroppedOnSendCounter` when a Central update is dropped due to full queue in `sendToCentral`, and increments `NumUpdatesDroppedOnSendCounter` with the appropriate object label when connection/endpoint updates or process-listening updates fail to send.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The pull request description lacks critical details about the changes and does not adequately address the PR template requirements.	Complete the 'How I validated my change' section with details on testing and validation. Clarify whether the PR is production-ready and check the 'CI results are inspected' checkbox. Explain why no automated tests were added, or add them.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding metrics to track failed network flow send operations, which aligns with the actual changes in the pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jv-metric-for-failed-networkflow-send

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

sensor/common/networkflow/manager/manager_impl.go (1)
406-417: ⚡ Quick win

Disambiguate drop reason before incrementing dropped-update counters

Line 406 through Line 417 increments NumUpdatesDroppedOnSendCounter for every send... == false, including stop-requested shutdown. If this metric is meant for send-buffer pressure, that will overcount and skew alerts/dashboards. Consider returning a failure reason from sendToCentral and only incrementing this counter for buffer-full drops (or broaden the metric/help text to “failed to send” semantics).

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sensor/common/networkflow/manager/manager_impl.go` around lines 406 - 417,
The dropped-update counter NumUpdatesDroppedOnSendCounter is being incremented
for any send failure (e.g., stop-requested shutdown) in the send* paths
(sendProcesses, sendEndpoints, sendConns), which skews buffer-pressure metrics;
modify the sendToCentral/send* functions to return a typed failure reason (e.g.,
enum/constant like SendResult{OK, BufferFull, Shutdown, Other} or an error with
sentinel values), update callers (e.g., m.sendProcesses and the block that calls
it, and updateComputer.OnSend* calls) to only increment
NumUpdatesDroppedOnSendCounter when the returned reason equals the buffer-full
case (or alternatively rename/repurpose the metric to “failed_to_send” if you
prefer broader semantics), and ensure updateComputer is still notified
appropriately for non-buffer-drop failures.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@sensor/common/networkflow/manager/manager_impl.go`:
- Around line 406-417: The dropped-update counter NumUpdatesDroppedOnSendCounter
is being incremented for any send failure (e.g., stop-requested shutdown) in the
send* paths (sendProcesses, sendEndpoints, sendConns), which skews
buffer-pressure metrics; modify the sendToCentral/send* functions to return a
typed failure reason (e.g., enum/constant like SendResult{OK, BufferFull,
Shutdown, Other} or an error with sentinel values), update callers (e.g.,
m.sendProcesses and the block that calls it, and updateComputer.OnSend* calls)
to only increment NumUpdatesDroppedOnSendCounter when the returned reason equals
the buffer-full case (or alternatively rename/repurpose the metric to
“failed_to_send” if you prefer broader semantics), and ensure updateComputer is
still notified appropriately for non-buffer-drop failures.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7d8bc3ae-0cbb-4c47-a78b-001b5f0997b7

📥 Commits

Reviewing files that changed from the base of the PR and between 9762b9c and ecca444.

📒 Files selected for processing (2)

sensor/common/networkflow/manager/manager_impl.go
sensor/common/networkflow/metrics/metrics.go

github-actions · 2026-05-13T16:10:34Z

🚀 Build Images Ready

Images are ready for commit ecca444. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-942-gecca4445ee

JoukoVirtanen added 2 commits May 12, 2026 16:22

X-Smart-Branch-Parent: master

67add55

Metrics that track network flow update send failures

ecca444

openshift-ci Bot added the do-not-merge/work-in-progress label May 13, 2026

github-actions Bot added area/sensor ai-review coderabbit-review labels May 13, 2026

sourcery-ai Bot reviewed May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

JoukoVirtanen changed the title ~~Jv metric for failed networkflow send~~ ROX-34686: Metric for failed networkflow send May 13, 2026

JoukoVirtanen marked this pull request as ready for review May 13, 2026 17:00

JoukoVirtanen requested a review from a team as a code owner May 13, 2026 17:00

openshift-ci Bot removed the do-not-merge/work-in-progress label May 13, 2026

JoukoVirtanen requested a review from vikin91 May 13, 2026 17:00

JoukoVirtanen changed the title ~~ROX-34686: Metric for failed networkflow send~~ ROX-34686: Metrics for failed networkflow send May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-34686: Metrics for failed networkflow send#20555

ROX-34686: Metrics for failed networkflow send#20555
JoukoVirtanen wants to merge 2 commits into
masterfrom
jv-metric-for-failed-networkflow-send

JoukoVirtanen commented May 13, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoukoVirtanen commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoukoVirtanen commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading