slow kernel

CUDA_Novice · June 23, 2009, 9:16pm

Hi all,

I am new to CUDA so bear with me. I am using the Quadro NVS 290 card to do a simple matrix multiplication. I am timing my kernel execution and duplicate serialized algorithm running on the CPU with CUDA event timers. The problem is that my serial cpu code is actually running faster than my kernel code. Any help would greatly be appreciated. The following is my kernel code and my serial cpu code:

//kernel invoke:

dim3 dimBlock(500);
dim3 dimGrid(2);

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, 0);

for(int $ = 0; $ < 1000; $++)
matrix_Mult<<<dimGrid,dimBlock>>>(d_Matrix, d_Multiplier, $);

cudaMemcpy(h_Matrix, d_Matrix, matrix_Size, cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);

//kernel code:

/*********************************************
*Kernel to run Matrix Multiply
*
*matrix is a 1000 X 1000 array
*******************************************/
global
void matrix_Mult(float matrix, float multiplier, int x)
{
int i = threadIdx.x;
int y = blockIdx.x;
//int x = ((blockIdx.x * 500 + i) % 1000);
float z = matrix[i + (y * 500) + (x * 1000)];
float v = multiplier[i + (y * 500)];
float b = z * v;

matrix[i + (y * 500) + (x * 1000)] = v;

}

//serial cpu code:

void matrixMultiply( float* matrix, float* multiplier)
{
for(int i = 0; i < 1000 ; i++)
{
for(int j = 0; j < 1000; j++)
{
matrix[i * 1000 + j] = matrix[i * 1000 + j] * multiplier[j];
}

}
}

Knedlik · June 23, 2009, 10:24pm

Your computation is linear in the number of elements of the matrix. To copy it to the device and back will always take about the same time as accessing each element once and making some simple computation with it on CPU. I think there is no point accelerating linear complexity algorithms like this, unless you plan to make more calculations.

About your code. You should exclude memory copying from the timing.

And the kernel is not good:

In configuration <<<2,500>>> – there is too small number of parallel threads to load GPU fully. You should create tens of thousands of parallel threads. Only then GPU can always find something to execute while most of the thread warps would be just waiting for the memory requests to complete.

You should reexpress the loop as parallel computation. Make x = blockIdx.y and run <<<dim3(2,1000),500>>>

Than, occupancy. Look at performance guidelines in the Guide and at occupancy calculator.

There are nice lectures with examples about matrix multiplication here:

http://www.nvidia.com/object/cuda_education.html

In short, it is quite easy to make things go fast, but to squeeze the maximal performance might be tricky.

I wounder about $ sign in this code of yours:

CUDA_Novice · June 23, 2009, 11:22pm

Thanks! I changed the Kernel invoke like you said and I did see some improvement… but the serial algorithm is still running faster. Are you saying that the way I have my algorithm written it is impossible to have it execute faster than the code ran by the CPU?

Nico · June 24, 2009, 7:16am

You should probably also take a look at the CUBLAS library.

N.

Knedlik · June 25, 2009, 3:13pm

It is possible, since GPU can nowadays access its memory around 80 times faster than CPU. So you can be this much faster. But if you copy from usual memory to graphics memory, compute and copy back, then it can not be faster for this kind of simple computations of linear complexity.

Use the Profiler – it is very handy. If you graphics card’s compute capability is less than 1.2 you will have non-coalesced memory access with that block size of 500 and matrix size 1000, which is very-very bad. (You can see it in the Profiler by turning on Session Settings-> Profiler Counters). This coalescing for older cards is quite complicated, I’m not sure if it worse the effort in general.

For a quick test Make the matrix 1024 x 1024 and block size 512.

Than, it will be slightly faster to run with somewhat smaller blocks <<<dim3(8,1024), 128 >>> for the reasons of better occupancy and some others.

Then, you are reading from the “multiplier” each time anew. Bind it to a texture. Texture cache of 8k will fit it all and it will be shared among all blocks within a multiprocessor.

Post you new code, if you try this out.

Topic		Replies	Views
Cuda matrix multiplication too slow CUDA Programming and Performance	5	13323	February 17, 2010
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	737	December 18, 2017
Faster MatrixMult than CUBLAS! CUDA Programming and Performance	4	2781	September 4, 2009
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6528	February 19, 2009
CUDA trouble CUDA Programming and Performance	3	976	March 19, 2013
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
Help me... Cuda program execution is slower than CPU...Did I miss any settings?? CUDA Programming and Performance	5	1181	September 24, 2015
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10237	June 5, 2009
Multiplying a system of 3x3 matrices efficiently CUDA Programming and Performance	2	8790	September 11, 2009

slow kernel

Related topics