I run this example on a nvidia quadro fx 1700 with the following charateristics :
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes
I want to use my graphic card to its maximum for this matrixmul example. So can you help me to determine the corresponding BLOCKSIZE and grid dimension please?
The huge number of uncoalesced loads and stores is most likely because of the block size you selected. From memory, in that code the block should always be 16, and the matrix dimensions an even multiple of the block size. That implies you need to zero pad your input matrices for sizes which aren’t even multiples of 16.
If you only are interested in doing fast matrix multiplication, then just use cublas. If you are interesting in studying the anatomy of a high performance matrix multiplication routine, then you will want to look at Vasily Volkov’s sgemm kernel.
i want to learn how to calculate the blocksize and grid dimensions to use the maximum of a graphic card ;
i will use a lot of different graphic cards to do some statistics. And I have a question for that : in my cuda_profile.log, i have gpu_time = [ 69648.800 ]. What is the time unit? I thought it was ms, but it seems not…
It comes from hardware timers on the GPU and NVIDIA quote accuracy at 0.5 microseconds. Your own timing is probably wrong - the thing that usually catches beginners is that CUDA kernel launches are non-blocking. It is quite possible that your timing is only measuring the time to queue the kernel launch with the driver, not the actual time for the kernel to run to completion. Here is an example of what can go wrong with host side timing and how to go about fixing it.