cudaMemcpy didn't work correctly in golang

I’m using Golang to wrap matrix multiplication with CUDA.
Assume two matrices A[m, k] and B[k, n].
After doing multiplication of A and B in CUDA, call this line to copy from cuda device to host.

C.cudaMemcpy(unsafe.Pointer(&dst[0]), unsafe.Pointer(src), C.size_t(size), C.cudaMemcpyDeviceToHost

src is a CUDA memory pointer. dst is a byte slice.

After copying from device to host, the question is:

  1. Calling printf in CUDA function, the results are always right no matter what m/k/n are.
  2. Printing host memory, the results are quite interesting.
    When m=5, n=30, k<100, then host memory is the right results.
    When m=5, n=30, k>100, then host memory sometimes right, sometimes all zero.
    When k>1000, always zero in host memory.

Any idea to fix this?

GPU: Tesla T4, Cuda 12.0
Compile options: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -Wno-pedantic --forward-unknown-to-host-compiler -arch=native -march=native -mtune=native

Resolved but didn’t figure out why yet.
It worked when using cudaMemcpyAsync and cudaDeviceSynchronize instead. And async method needs a stream when sync method doesn’t.
Seems sync method didn’t always copy the device memory in one cuda stream to host memory.