I’m using Golang to wrap matrix multiplication with CUDA.
Assume two matrices A[m, k] and B[k, n].
After doing multiplication of A and B in CUDA, call this line to copy from cuda device to host.
C.cudaMemcpy(unsafe.Pointer(&dst[0]), unsafe.Pointer(src), C.size_t(size), C.cudaMemcpyDeviceToHost
src
is a CUDA memory pointer. dst
is a byte slice.
After copying from device to host, the question is:
- Calling
printf
in CUDA function, the results are always right no matter what m/k/n are. - Printing host memory, the results are quite interesting.
When m=5, n=30, k<100, then host memory is the right results.
When m=5, n=30, k>100, then host memory sometimes right, sometimes all zero.
When k>1000, always zero in host memory.
Any idea to fix this?
GPU: Tesla T4, Cuda 12.0
Compile options: -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -Wno-pedantic --forward-unknown-to-host-compiler -arch=native -march=native -mtune=native