Capacity discarded on UCX Port, Fails on multiple emits

Does UCX support setting of capacity > 1 on ports that cross fragment boundaries?
I tried explicitly setting the capacity.

    spec.output("out_host").connector(
        IOSpec.ConnectorType.UCX, capacity=128
    )

But at runtime, I observe that the capacity has changed to 1.

DEBUG:root:embed_node Host: ConnType: ConnectorType.UCX Capacity: 1, Size: 0, BackSize: 0

I’m running this on an x86_64 machine. The fragments are currently just different processes right now for testing purposes.

Ping,
Any updates on this?

Hi @utkarsh02t ,

Thanks for the details. Given the title, it sounds like the immediate issue is multiple emit() calls on the same UCX output port from one compute() call.

The Python API does accept:

spec.output("out_host").connector(IOSpec.ConnectorType.UCX, capacity=128)

and UCX transmitter/receiver resources do have a capacity parameter. However, for ports crossing fragment boundaries, Holoscan currently inserts internal virtual UCX operators and reconstructs the connection metadata for the worker process. In that distributed path, the user-specified connector args such as capacity are not preserved; only the UCX address/port information is propagated. The UCX connector therefore falls back to its default capacity of 1.

That means capacity > 1 is currently not supported as expected for UCX ports crossing fragment/process boundaries. In practice, if an operator calls emit() multiple times to the same cross-fragment UCX port during one compute(), the first emit may fill the UCX transmitter queue and subsequent emits can fail or be dropped/rejected depending on policy. This is a Holoscan distributed-graph limitation, not an x86_64 or UCX transport limitation.

For a workaround, the safest option is to avoid multiple emits to the same cross-fragment UCX port in a single compute() call. Instead, consider one of these patterns:

  • Emit one composite message that contains the multiple payloads, then unpack it in the receiving fragment.
  • Restructure the operator so it emits one item per compute() invocation and lets the scheduler call it repeatedly.
  • Add buffering in the application graph before/after the UCX boundary, but note that buffering after the UCX boundary will not help if the failure occurs on the transmitting UCX port itself.

Could you share whether the multiple emits are meant to represent batching, burst absorption, or multiple independent messages produced from one input? That would help us recommend the best pattern.

We should track this as a Holoscan issue, since the API accepts capacity but the distributed UCX path does not currently preserve it. The likely fix would be to propagate connector args such as capacity and policy through the distributed ConnectionItem and virtual-operator path.

Thanks.

Hey, @gigony thanks for the response!

I was trying to set up a larger capacity to absorb some bursty production. I ended up doing a PeriodicCondition + EmitQueue, which is currently working (yet to test it more robustly)

I compiled Holoscan with some patches and was able to propagate capacity as well.

--- a/src/core/executors/gxf/gxf_executor.cpp
+++ b/src/core/executors/gxf/gxf_executor.cpp
@@ -1248,12 +1248,25 @@ void GXFExecutor::connect_ucx_transmitters_to_virtual_ops(
           }
         }
 
+        HOLOSCAN_LOG_ERROR("CHECK COUNT");
         if (connection_count == 1) {
           auto out_spec =
               get_operator_port_iospec(last_transmitter_op, port_name, IOSpec::IOType::kOutput);
 
+          auto merged_args = virtual_op->arg_list();
+          auto existing_connector = out_spec->connector();
+          if(existing_connector) {
+            for(const auto& arg: existing_connector->args()) {
+              HOLOSCAN_LOG_ERROR("CHECK CAP");
+              if(arg.name() == "capacity"){
+              HOLOSCAN_LOG_ERROR("ADD CAP");
+                merged_args.add(arg);
+              }
+            }
+          }
+        HOLOSCAN_LOG_ERROR("CREATE CONN");
           // Create the connector for out_spec from the virtual_op
-          out_spec->connector(virtual_op->connector_type(), virtual_op->arg_list());
+          out_spec->connector(virtual_op->connector_type(), merged_args);
         }

Pardon the bad logging method.

This helped me set up a capacity of 128, but there was an issue with EntityWarden and reference counting. I couldn’t figure that out. So, fixing this path went on a pause.

Thanks @utkarsh02t for the update, and glad to hear the PeriodicCondition + EmitQueue workaround is helping for now!

Could you clarify what you mean by “there was an issue with EntityWarden and reference counting”? It would be useful to know the exact error messages, stack trace, or logs you saw when the patched path propagated capacity=128.

If you have a minimal example pipeline that reproduces the issue, that would also help a lot. Ideally something small that shows the cross-fragment UCX port, the larger capacity setting, and the multiple/bursty emits that trigger the EntityWarden or reference-counting problem.

Thanks again for reporting this. The team is tracking the distributed UCX capacity propagation issue and working on a proper fix.

@utkarsh02t Also, could you let us know which Holoscan SDK version you were using, and on which platform/setup? For example, was this on x86 with the Docker container, the PyPI package, or the Debian package installed on the host side?


DEBUG:root:embed_node Host: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 0
DEBUG:root:embed_node GPU: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 0
DEBUG:root:1th emitted
DEBUG:root:embed_node Host: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 1
DEBUG:root:embed_node GPU: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 1
DEBUG:root:2th emitted
DEBUG:root:embed_node Host: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 2
DEBUG:root:embed_node GPU: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 2
DEBUG:root:3th emitted
DEBUG:root:embed_node Host: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 3
DEBUG:root:embed_node GPU: ConnType: ConnectorType.UCX Capacity: 128, Size: 0, BackSize: 3
[error] [entity_warden.cpp:1172] [E138362237204864] Ref count for the entity is 0. Cannot decrement
```

This was the error log.
I will try to get a minimal example, but I have been occupied a bit lately.

This was on x86 with the Docker container for Holoscan 4.1

The logs were emitted from here:


for i, saved_ctx in enumerate(completed_ctxs):
  logging.debug(f"{i}th emitted")
  transmitter_host = self.transmitter(“out_host”)
  host_port_spec = self.spec.outputs[“out_host”]
  transmitter_gpu = self.transmitter(“out_gpu”)
  gpu_port_spec = self.spec.outputs[“out_gpu”]

  logging.debug(f"{self.node_name} Host: ConnType: {host_port_spec.connector_type} Capacity: {transmitter_host.capacity}, Size: {transmitter_host.size}, BackSize: {transmitter_host.back_size}")
  logging.debug(f"{self.node_name} GPU: ConnType: {gpu_port_spec.connector_type} Capacity: {transmitter_gpu.capacity}, Size: {transmitter_gpu.size}, BackSize: {transmitter_gpu.back_size}")
  self._emit_segregated(op_output, saved_ctx)

emit_segregated just calls emits on both the ports.

Thanks @utkarsh02t ! Please let us know if you are able to get a minimal example.

Regarding the ref count issue, it could be worth trying with environment variable GXF_ENTITY_POOL_SIZE=0 set .
Setting it disables the new GXF EntityPool performance optimization introduced in 4.1 as a possible source of the issue.

Thanks!