thread overhead is it relevant?

i have attempted to optimise my algorithm by changing my ‘base transformation’ from 1d thread granularity to 2d thread granularity (i am working on a matrix).

doing this has changed the timing of my kernel from 3ms to 16ms (on a 2000x1000 matrix). I cant understand why. My co-worker has told me that it is due to the thread overhead, but i have tried to inform him that thread overhead is negligable (am i correct?).

My original function went like this (3ms version):


if(eq!=pivot && tid<(neq+ncons+1) )


where tid is my thread index and ‘matrix’,‘rowFactorColumn’ and ‘pr’ are in global memory.

The 2d equivilant that i wrote (works correctly) takes 16ms. I dont know why:

int idx = bx*bdx + x;

int idy = by*bdy + y;

if(idx<(neq+ncons+1) && idy<neq && idy!=pivot)       ((float*)((char*)matrix+idy*pitch))[idx]=((float*)((char*)matrix+idy*pitch))[idx]-(rowFactorColumn[idy]/pivotFactor)*pr[idx];

Both versions do their operations correctly, the only difference is the 2d version takes 13ms extra.

can anyone tell me why?

Run it through the cuda profiler. I would guess that the reads/writes are not coalesced in the 2d equivalant.

I have run it through the profiler. Its the first time ive used the profiler, so the only usefull information i was able to get out of it was that the kernel i am trying to fix, takes up 80% of the GPU time.

I dont understand. What do you mean?

on further examination of the problem, reading each of the variables from global memory is not what is causing the timing to go, it is happening when i try and write to global memory. I.e.

if(idx<(neq+ncons+1) && idy<neq && idy!=pivot){


        float A = ((float*)((char*)matrix+idy*pitch))[idx];

        float B = rfc[idy];

        float C = pr[idx];

        float equate = A - (B/pivotFactor)*C;



        ((float*)((char*)matrix+idy*pitch))[idx] = equate;


Can anyone see any reason why the write to global memory in this case takes 15ms?

You cant just comment out the write and say that this line is taking 15 ms, since when you comment it out the compiler notices that your kernel isnt doing any work and it just optimized it all away.

Run the profiler with all counters enabled (profile-session settings - configuration and check mark those two)
Then run the app, and youll see gld coalesced and gls uncoalesced.

Those are the columns im refering to.

There is no overhead for swapping threads in CUDA. The hardware is capable of running more than 10,000 threads concurrently.

Your kernel is most certainly (with 100% confidence) limited by the available memory bandwidth. If you aren’t getting 70 GiB/s (on 8800GTX / Telsa 800) or 100 GiB/s (on GTX 280), then your memory reads/writes are not coalesced as has already been mentioned.