slow kernel

Hi all,

I am new to CUDA so bear with me. I am using the Quadro NVS 290 card to do a simple matrix multiplication. I am timing my kernel execution and duplicate serialized algorithm running on the CPU with CUDA event timers. The problem is that my serial cpu code is actually running faster than my kernel code. Any help would greatly be appreciated. The following is my kernel code and my serial cpu code:

//kernel invoke:

dim3 dimBlock(500);
dim3 dimGrid(2);

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, 0);

for(int $ = 0; $ < 1000; $++)
matrix_Mult<<<dimGrid,dimBlock>>>(d_Matrix, d_Multiplier, $);

cudaMemcpy(h_Matrix, d_Matrix, matrix_Size, cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);

//kernel code:

/*********************************************
*Kernel to run Matrix Multiply
*
*matrix is a 1000 X 1000 array
*******************************************/
global
void matrix_Mult(float
matrix, float
multiplier, int x)
{
int i = threadIdx.x;
int y = blockIdx.x;
//int x = ((blockIdx.x * 500 + i) % 1000);
float z = matrix[i + (y * 500) + (x * 1000)];
float v = multiplier[i + (y * 500)];
float b = z * v;

matrix[i + (y * 500) + (x * 1000)] = v;

}

//serial cpu code:

void matrixMultiply( float* matrix, float* multiplier)
{
for(int i = 0; i < 1000 ; i++)
{
for(int j = 0; j < 1000; j++)
{
matrix[i * 1000 + j] = matrix[i * 1000 + j] * multiplier[j];
}

}
}

Your computation is linear in the number of elements of the matrix. To copy it to the device and back will always take about the same time as accessing each element once and making some simple computation with it on CPU. I think there is no point accelerating linear complexity algorithms like this, unless you plan to make more calculations.

About your code. You should exclude memory copying from the timing.

And the kernel is not good:

In configuration <<<2,500>>> – there is too small number of parallel threads to load GPU fully. You should create tens of thousands of parallel threads. Only then GPU can always find something to execute while most of the thread warps would be just waiting for the memory requests to complete.

You should reexpress the loop as parallel computation. Make x = blockIdx.y and run <<<dim3(2,1000),500>>>

Than, occupancy. Look at performance guidelines in the Guide and at occupancy calculator.

There are nice lectures with examples about matrix multiplication here:

http://www.nvidia.com/object/cuda_education.html

In short, it is quite easy to make things go fast, but to squeeze the maximal performance might be tricky.

I wounder about $ sign in this code of yours:

Thanks! I changed the Kernel invoke like you said and I did see some improvement… but the serial algorithm is still running faster. Are you saying that the way I have my algorithm written it is impossible to have it execute faster than the code ran by the CPU?

You should probably also take a look at the CUBLAS library.

N.

It is possible, since GPU can nowadays access its memory around 80 times faster than CPU. So you can be this much faster. But if you copy from usual memory to graphics memory, compute and copy back, then it can not be faster for this kind of simple computations of linear complexity.

Use the Profiler – it is very handy. If you graphics card’s compute capability is less than 1.2 you will have non-coalesced memory access with that block size of 500 and matrix size 1000, which is very-very bad. (You can see it in the Profiler by turning on Session Settings-> Profiler Counters). This coalescing for older cards is quite complicated, I’m not sure if it worse the effort in general.

For a quick test Make the matrix 1024 x 1024 and block size 512.

Than, it will be slightly faster to run with somewhat smaller blocks <<<dim3(8,1024), 128 >>> for the reasons of better occupancy and some others.

Then, you are reading from the “multiplier” each time anew. Bind it to a texture. Texture cache of 8k will fit it all and it will be shared among all blocks within a multiprocessor.

Post you new code, if you try this out.