matrixMul from CUDA samples errors when 4096x4096 matrices


I am running the matrixMul CUDA 12.1 sample in Windows on a 970M, if I use -wA=4096 -hA=4096 -wB=4096 -hB=4096 (to specify 4096x4096 matrices), the cudaStreamSynchronize fails to wait for the kernels to finish:

[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “Maxwell” with compute capability 5.2

MatrixA(4096,4096), MatrixB(4096,4096)
Computing result using CUDA Kernel…
err 0
CUDA error at C:\work\Repos\cuda-samples\Samples\0_Introduction\matrixMul\ code=702(cudaErrorLaunchTimeout) “cudaStreamSynchronize(stream)”

(I’ve added a print getCudaLastError() right after launching the kernel, it’s successful (see the “err 0”))

With 2048x2048 matrices it runs correctly.

What can be the issue ? Thank you !

That is the issue. There are numerous forum topics discussing this, how it comes about, and what you may do about it. here is a relevant article

Thank you so much, it works now