Indirect agent connection improvements#13028
Conversation
|
@blueorangutan package |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #13028 +/- ##
============================================
+ Coverage 18.09% 18.11% +0.02%
- Complexity 16717 16769 +52
============================================
Files 6037 6047 +10
Lines 542546 544074 +1528
Branches 66431 66569 +138
============================================
+ Hits 98148 98548 +400
- Misses 433378 434451 +1073
- Partials 11020 11075 +55
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17497 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-15880) |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
d7a044d to
d736a4a
Compare
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
@blueorangutan package |
|
@borisstoyanov a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 17752 |
…ements. - Enhances the Host connecting logic to avoid connecting storm (where Agent opens multiple sockets against Management Server). - Implements HostConnectProcess task where Host upon connection checks whether lock is available, traces Host connecting progress, status and timeout. - Introduces AgentConnectStatusCommand, where Host checks whether lock for the Host is available (i.e. "previous" connect process is finished). - Implementes logic to check whether Management Server has lock against Host (exposed MySQL DB lock presence via API) - Removes synchronization on Host disconnect process, double-disconnect logic in clustered Management Server environment, added early removal from ping map (in case of combination ping timeout delay + synchronized disconnect process the Agent Manager submits more disconnect requests) - Introduces parameterized connection and status check timeouts - Implements backoff algorithm abstraction - can be used either constant backoff timeout or exponential with jitter to wait between connection Host attempts to Management Server - Implements ServerAttache to be used on the Agent side of communication (similar to Attache on Management Server side) - Enhances/Adds logs significantly to Host Agent and Agent Manager logic to trace Host connecting and disconnecting process, including ids, names, context UUIDs and timings (how much time took overall initialization/deinitialization) - Adds logs to communication between Management Servers (PDU requests) - Adds DB indexes to improve search performance, uses IDEMPOTENT_ADD_INDEX for safer DB schema updates
d736a4a to
e633b54
Compare
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17753 |
|
@blueorangutan test keepEnv |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-16061) |
|
@blueorangutan test |
|
@borisstoyanov a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-16062) |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17810 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian Build Failed (tid-16065) |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17834 |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Pull request overview
This PR overhauls the Indirect Agent connection lifecycle in CloudStack to prevent connection storms, improve observability, and harden the NIO layer. It introduces a HostConnectProcess task, an AgentConnectStatusCommand for lock-availability checks, a backoff abstraction (constant or exponential-with-jitter), a ServerAttache analog on the agent side, removes risky synchronization on disconnect, parameterizes connect/status timeouts, and adds DB indexes (via IDEMPOTENT_ADD_INDEX) plus extensive logging across both Management Server and Agent paths.
Changes:
- NIO layer hardening in
NioConnection/NioServer(volatile flags, executor lifecycle instart()/stop(), broader exception handling in the selector loop, explicit socket shutdown when rejecting connections, richer trace logging). - Test adjustments:
ConstantTimeBackoffTestupdated to the newbackoff.secondsconfig key;NioTestminor logging/whitespace cleanup. - (Per PR description, also includes connect-process orchestration, backoff abstraction, ServerAttache, DB indexes, and PDU-level logging — only a subset visible in the supplied diff.)
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| utils/src/main/java/com/cloud/utils/nio/NioConnection.java | Makes lifecycle flags volatile, re-initializes executors in start(), uses shutdownNow() in stop(), shuts down rejected sockets, broadens selector-loop exception handling, and improves logging/terminate signature. |
| utils/src/main/java/com/cloud/utils/nio/NioServer.java | Switches to Selector.open(), removes redundant init, drops unused null attachment on OP_ACCEPT, adds class Javadoc. |
| utils/src/test/java/com/cloud/utils/backoff/impl/ConstantTimeBackoffTest.java | Updates configure test to use the renamed backoff.seconds parameter key. |
| utils/src/test/java/com/cloud/utils/testcase/NioTest.java | Cosmetic whitespace fix and consolidated logger error call. |
Note: The supplied diff is a small slice of a much larger PR (HostConnectProcess, AgentConnectStatusCommand, backoff abstraction, ServerAttache, DB schema/index changes, AgentManagerImpl changes, etc.) that I cannot inspect from this hunk alone.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17860 |
|
@blueorangutan test keepEnv |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
Description
This PR improves the Indirect agent connection handling, has the following improvements.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?