So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).
Unfortunately:
(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”
(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s
This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.
So I’ve written a demo of different matrix multiplication kernels (straight read from global mem (1), shared memory usage (2), shared memory + coalesced reads (3)). (see attached).
Unfortunately:
(a) When I run it on my C2050 (whether or not compiled with arch=sm_20 flag) it crashes after the second cudaThreadSynchronize() call with failure “ERROR: Sync2: unspecified launch failure”
(b) When I run it on my GTX285 it runs fine but I get the wrong results! To multiply 2 400x400 matrices using kernel (1) takes 0.003s, (2) takes 0.035s, (3) takes 0.101s
This is all running on an up-to-date ubuntu release with the latest CUDA drivers, compilers and libraries.