Hi,
We are observing a reproducible issue on NVIDIA A5000 where device-side ordering mechanisms are not sufficient to guarantee correct execution order between a preceding TRSM and a subsequent cublasSgemmGroupedBatched call.
Summary:
A TRSM kernel is issued on a CUDA stream, followed by cublasSgemmGroupedBatched on the same stream. Under expected CUDA stream semantics, same-stream FIFO ordering should guarantee correctness. However, on A5000, this fails deterministically unless a host-side synchronization is inserted before calling the cuBLAS GEMM API.
Key observation:
Only host-observed synchronization (cudaEventSynchronize or cudaStreamSynchronize) resolves the issue. All device-side ordering mechanisms fail.
Experiment results (A5000):
All tests run 10 times on FloatFactor3DPoisson56IsDeterministic:
| Synchronization strategy | Result |
|---|---|
| Same-stream FIFO only (no explicit barrier) | FAIL 10/10 |
| cudaStreamWaitEvent(mainStream, trsmReadyEvent) | FAIL 10/10 |
| Cross-stream event bounce (prepareStream → mainStream wait) | FAIL 10/10 |
| __threadfence() at TRSM kernel tail | FAIL 10/10 |
| cudaEventSynchronize(trsmReadyEvent) | PASS 10/10 |
| cudaStreamSynchronize(mainStream) | PASS |
Interpretation:
- This does NOT appear to be a memory visibility issue:
__threadfence() does not resolve it. - This does NOT appear to be a standard stream ordering issue:
same-stream FIFO and streamWaitEvent both fail. - This suggests that device-side ordering primitives alone are insufficient in this case.
Instead, the results indicate that a host-observed completion of the TRSM operation is required before entering the cublasSgemmGroupedBatched call.
Hardware comparison:
- A5000: issue reproduces consistently
- RTX 3090 / RTX 5090: issue does NOT reproduce
All GPUs are based on GA102-class silicon, suggesting this may be timing-sensitive (e.g., dependent on sustained clocks or memory bandwidth), exposing a race window only on A5000.
Workaround:
Replacing full stream synchronization with a minimal host-side event sync resolves the issue:
cudaEventSynchronize(trsmReadyEvent);
This avoids draining unrelated work on the stream while still guaranteeing correctness.
Question:
Is there any known limitation or requirement for cublasSgemmGroupedBatched regarding host-side synchronization when consuming results from preceding kernels?
Alternatively, could this indicate a potential ordering or API boundary issue within cuBLAS grouped batched GEMM?
Additional info:
- Driver version: 570.133.20
- CUDA version: 12.8
- cuBLAS version: (please fill)
- Reproducible in release builds
- compute-sanitizer: (please fill if available)