Checking Performance learning how to optimize CUDA codes

Hello Everybody,

I am starting to learn the way to improve our cuda programs and for this purpose, I think the best way is following the advices from University of Illinois. They are helping us really well with their papers about optimization.

I am using Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA guide and I have tried to reproduce their experiments. I have found several problems with the matrix Multiplication problem. My code is working ok, but i can’t achive the performance that they say there.
For instance, In the first version of his program where they develop a matrix multiplication code without using shared memory, they say that the peformance is 10.54 GFLOPs. For me it’s just 2GFLOPS. :( , I am using the same amount of registers than they do per thread and also the same numbers of threads (256 per block).

I think my problem is the coalesced access to the device memory but i don’t know if can get this coalesced access with this kernel:

global void matrixMul( float* C, float* A, float* B, int wA, int wB)
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

int indexA, indexB;

indexA = by*BLOCK_SIZE*wA+ty*wA;
indexB = bx*BLOCK_SIZE + tx;	
float Csub;
Csub = 0.0;

for (int i = 0; i < wA; i++){
	Csub += A[indexA] * B [indexB];

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;


wA = 4096, wB = 4096, Every size is 4K
and my execution time on Tesla c870 is around 38 seconds.

Can anybody help me, please? It is quite hard try to improve our codes if it not possible reproduce a toy example.


This looks it will break coalescing.


globalArrayIndex % 16 must be equal to threadIdx %16

It starts out good, but as soon as you increment the indexes by anything other than multiples of 16, you’re breaking coalescing.

Have you seen the CUDA Visual Profiler? It can tell you if you’ve got non-coalesced accesses.

Thank you so much for your help. I am gonna try to do the access coalesced with your notes. I wlil let you know how it is going and whether i get the 10GFLOPS or not.

I didn´t know this tool i am gonna try it as well, it could be really helpfull for me.


Hi again, and the best of luck in your CUDA work.

Since we’re on the subject, here are a couple of good threads:

Here’s an occupancy calculator thread, this calculator is very useful to build an understanding of what is going on:

Here’s the visual profiler thread:

… but I suggest you download it, and its manual, from the NVIDIA CUDA download page.

Hello again,

I am keep on going with the matrix multiplication problem and i am trying to solve the bank conflicts in the Shared Memory.

Actually, I think i have bank conflicts whenever i access to the A matrix because all the threads inside the Warp take the same row. So, In order to solve this problem i think in this code. However, Cuda is where amazing happens and I only get the right result if i have put synthreads. before to load the data from the shared memory.

I know that i shouldn’t think in sequecial mode but this is quite strage.

Anyone can tell me why it is happening that.

for (int k = 0; k < BLOCK_SIZE; k++){
        int  pos = threadIdx.x+k & 15;

// __syncthreads();
Csub += As[ty][pos] * Bs[pos][tx];

Thank you so much for your help.