MatrixMul & cuda profiler

Hi everybody,

I try the matrixmul example of nvidia. To execute it, I choose :

#define BLOCKSIZE 10

#define N 1000

But when I read the cuda_profile.log after the execution, i have this :

method=[ _Z15matrixMulKernelPfS_S_ii ] gputime=[ 543612.352 ] cputime=[ 543727.000 ] occupancy=[ 0.667 ] gld_coherent=[ 60000 ] gld_incoherent=[ 99760000 ] gst_coherent=[ 0 ] gst_incoherent=[ 1000000 ]

Why gst_incoherent is equal to 1000000??

I run this example on a nvidia quadro fx 1700 with the following charateristics :

Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 536150016 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

I want to use my graphic card to its maximum for this matrixmul example. So can you help me to determine the corresponding BLOCKSIZE and grid dimension please?

Thanks for your help!

The huge number of uncoalesced loads and stores is most likely because of the block size you selected. From memory, in that code the block should always be 16, and the matrix dimensions an even multiple of the block size. That implies you need to zero pad your input matrices for sizes which aren’t even multiples of 16.

If you only are interested in doing fast matrix multiplication, then just use cublas. If you are interesting in studying the anatomy of a high performance matrix multiplication routine, then you will want to look at Vasily Volkov’s sgemm kernel.

Thanks a lot! It works fine now! :D

I ask this for two reasons :

  1. i want to learn how to calculate the blocksize and grid dimensions to use the maximum of a graphic card ;

  2. i will use a lot of different graphic cards to do some statistics. And I have a question for that : in my cuda_profile.log, i have gpu_time = [ 69648.800 ]. What is the time unit? I thought it was ms, but it seems not…

I think the timings in the profiler output are in microseconds.

So how is it calculate? Because when I look at my clock, it doesn’t match with the time spent for the execution.

It comes from hardware timers on the GPU and NVIDIA quote accuracy at 0.5 microseconds. Your own timing is probably wrong - the thing that usually catches beginners is that CUDA kernel launches are non-blocking. It is quite possible that your timing is only measuring the time to queue the kernel launch with the driver, not the actual time for the kernel to run to completion. Here is an example of what can go wrong with host side timing and how to go about fixing it.