cublasSgemmGroupedBatched requires host-side synchronization after preceding TRSM on A5000 (device-side ordering insufficient)

Hi,

We are observing a reproducible issue on NVIDIA A5000 where device-side ordering mechanisms are not sufficient to guarantee correct execution order between a preceding TRSM and a subsequent cublasSgemmGroupedBatched call.

Summary:

A TRSM kernel is issued on a CUDA stream, followed by cublasSgemmGroupedBatched on the same stream. Under expected CUDA stream semantics, same-stream FIFO ordering should guarantee correctness. However, on A5000, this fails deterministically unless a host-side synchronization is inserted before calling the cuBLAS GEMM API.

Key observation:

Only host-observed synchronization (cudaEventSynchronize or cudaStreamSynchronize) resolves the issue. All device-side ordering mechanisms fail.

Experiment results (A5000):

All tests run 10 times on FloatFactor3DPoisson56IsDeterministic:

Synchronization strategy Result
Same-stream FIFO only (no explicit barrier) FAIL 10/10
cudaStreamWaitEvent(mainStream, trsmReadyEvent) FAIL 10/10
Cross-stream event bounce (prepareStream → mainStream wait) FAIL 10/10
__threadfence() at TRSM kernel tail FAIL 10/10
cudaEventSynchronize(trsmReadyEvent) PASS 10/10
cudaStreamSynchronize(mainStream) PASS

Interpretation:

  • This does NOT appear to be a memory visibility issue:
    __threadfence() does not resolve it.
  • This does NOT appear to be a standard stream ordering issue:
    same-stream FIFO and streamWaitEvent both fail.
  • This suggests that device-side ordering primitives alone are insufficient in this case.

Instead, the results indicate that a host-observed completion of the TRSM operation is required before entering the cublasSgemmGroupedBatched call.

Hardware comparison:

  • A5000: issue reproduces consistently
  • RTX 3090 / RTX 5090: issue does NOT reproduce

All GPUs are based on GA102-class silicon, suggesting this may be timing-sensitive (e.g., dependent on sustained clocks or memory bandwidth), exposing a race window only on A5000.

Workaround:

Replacing full stream synchronization with a minimal host-side event sync resolves the issue:

cudaEventSynchronize(trsmReadyEvent);

This avoids draining unrelated work on the stream while still guaranteeing correctness.

Question:

Is there any known limitation or requirement for cublasSgemmGroupedBatched regarding host-side synchronization when consuming results from preceding kernels?

Alternatively, could this indicate a potential ordering or API boundary issue within cuBLAS grouped batched GEMM?

Additional info:

  • Driver version: 570.133.20
  • CUDA version: 12.8
  • cuBLAS version: (please fill)
  • Reproducible in release builds
  • compute-sanitizer: (please fill if available)

Should close the ticket. Currently I am using custom vbatched GETRF kernel. And this seems the reason of the flaky. If I use cuSolver GETRF then all good.