thread overhead is it relevant?

estra · September 14, 2008, 11:47pm

i have attempted to optimise my algorithm by changing my ‘base transformation’ from 1d thread granularity to 2d thread granularity (i am working on a matrix).

doing this has changed the timing of my kernel from 3ms to 16ms (on a 2000x1000 matrix). I cant understand why. My co-worker has told me that it is due to the thread overhead, but i have tried to inform him that thread overhead is negligable (am i correct?).

My original function went like this (3ms version):

for(eq=0;eq<neq;eq++) 

if(eq!=pivot && tid<(neq+ncons+1) )

((float*)((char*)matrix+eq*pitch))[tid]=((float*)((char*)matrix+eq*pitch))[tid]-(rowFactorColumn[eq]/pivotFactor)*pr[tid];

where tid is my thread index and ‘matrix’,‘rowFactorColumn’ and ‘pr’ are in global memory.

The 2d equivilant that i wrote (works correctly) takes 16ms. I dont know why:

int idx = bx*bdx + x;

int idy = by*bdy + y;

if(idx<(neq+ncons+1) && idy<neq && idy!=pivot)       ((float*)((char*)matrix+idy*pitch))[idx]=((float*)((char*)matrix+idy*pitch))[idx]-(rowFactorColumn[idy]/pivotFactor)*pr[idx];

Both versions do their operations correctly, the only difference is the 2d version takes 13ms extra.

can anyone tell me why?

Ailleur · September 15, 2008, 2:05am

Run it through the cuda profiler. I would guess that the reads/writes are not coalesced in the 2d equivalant.

estra · September 15, 2008, 3:22am

I have run it through the profiler. Its the first time ive used the profiler, so the only usefull information i was able to get out of it was that the kernel i am trying to fix, takes up 80% of the GPU time.

I dont understand. What do you mean?

on further examination of the problem, reading each of the variables from global memory is not what is causing the timing to go, it is happening when i try and write to global memory. I.e.

if(idx<(neq+ncons+1) && idy<neq && idy!=pivot){

        

        float A = ((float*)((char*)matrix+idy*pitch))[idx];

        float B = rfc[idy];

        float C = pr[idx];

        float equate = A - (B/pivotFactor)*C;

        

        // IT IS THE 'WRITE' BELOW THAT TAKES 15ms 

        ((float*)((char*)matrix+idy*pitch))[idx] = equate;

}

Can anyone see any reason why the write to global memory in this case takes 15ms?

Ailleur · September 15, 2008, 4:01am

You cant just comment out the write and say that this line is taking 15 ms, since when you comment it out the compiler notices that your kernel isnt doing any work and it just optimized it all away.

Run the profiler with all counters enabled (profile-session settings - configuration and check mark those two)
Then run the app, and youll see gld coalesced and gls uncoalesced.

Those are the columns im refering to.

MisterAnderson42 · September 15, 2008, 12:23pm

There is no overhead for swapping threads in CUDA. The hardware is capable of running more than 10,000 threads concurrently.

Your kernel is most certainly (with 100% confidence) limited by the available memory bandwidth. If you aren’t getting 70 GiB/s (on 8800GTX / Telsa 800) or 100 GiB/s (on GTX 280), then your memory reads/writes are not coalesced as has already been mentioned.

Topic		Replies	Views
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6096	October 29, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8699	December 18, 2008
one addition and it gets 25 times slower performance issues CUDA Programming and Performance	4	4587	September 15, 2008
Effective global memory bandwidth? CUDA Programming and Performance	17	17613	September 18, 2007
no speedup from coalescing global reads?! Surprising profile results CUDA Programming and Performance	1	1628	March 7, 2008
How to write efficient from local to glocal memory Writing - time problems CUDA Programming and Performance	3	5548	December 5, 2007
Putting the GPU at work CUDA Programming and Performance	21	20257	July 5, 2007
Why the timings of these two ways are similar? one is writing randomly, the other is writing contigo CUDA Programming and Performance	6	7189	January 6, 2010
Help me with very, very poor performance CUDA Programming and Performance	6	3924	May 8, 2008
evaluating global memory access trade-off CUDA Programming and Performance	0	851	April 2, 2009

thread overhead is it relevant?

Related topics