global mem reads coalesced per block or warp?

Knaxkopp · March 1, 2007, 4:01pm

I want to work with every thread (in all blocks) on the same data

(every thread has different initial parameters for some test)

so the kernel is basically:

param=Init[threadID]

bestResult=0

for (i=1..n)

    data=GlobalData[i]

    result=calc(param,data)

    bestResult=max(result, bestResult)

Result[threadID]=bestResult

So is reading GlobalData[i] coalesced for all threads

in a block or a warp ?

I have several blocks to use all multiprocessors. Would it be

a big benefit to use a texture for caching ?

Mark_Harris · March 1, 2007, 6:02pm

No, it’s uncoalesced, because all threads in a warp (or half-warp) would be reading the same value. You need to transpose your load so that sequential threads access sequential values. The simplest way to do this is to index based on threadIdx.x.

So instead of looping over the data in every thread, first just do a parallel load of the data, and then loop to do the computation. Something like this pseudocode (following on from your original pseudocode).

__shared__ float sData[n];

param=Init[threadID]

bestResult=0

for (i=1..n/blockDim)

   sData[threadId + i*blockDim]=GlobalData[threadId + i*blockDim]

__syncthreads();

for(i=1..n)

   result=calc(param,sData[i])

   bestResult=max(result, bestResult)

Result[threadID]=bestResult

Mark

Knaxkopp · March 2, 2007, 1:34pm

So I’m wasting a lot of bandwith !? Im running 1024 blocks with 256 threads.

so if every half-warp re-reads the same values there should be an

bandwith overhead of factor 16384 … wow.

Also thought this, but my data is much bigger than shared mem and if I understand

it right, you shoudn’t use more than half of it (8kb) to enhance block-switching?

So I can only read some data structs to shared , compute, read next structs…

Also I planned to use shared for something else in the future (so kind of

mini stack…)

Perhabs it will be simpler to use texture caching to access the data?

But I don’t understand this texread() stuff. E.g. I could not manage

to read float3 from texture.

But I think to start a new thread with texturing questions

Knaxkopp · March 2, 2007, 5:44pm

Hi, I implemented Mark’s shared mem buffering and it speeds things up by 15-16x.

I used only 1kb shared mem for this, so there is enough space for other things I will need in shared mem. BTW: buffer size of 2/4/8kb didn’t improve things.

But there is still an mem bandwith overuse of #multiproc (in my case 12) as all multiprocs read the same values in their corresponding shared mem.

So I’m still curious if texture caching could improve on the bandwith usage.

G80 is a real number cruncher. It looks like CUDA on 8800GTS does things by 12 times faster than the best known CPU implementations (on a Xeon 3.2GHz). And still there isn’t very much optimized in my code yet External Image

Knaxkopp · March 5, 2007, 4:32pm

Now I’ve tried texture caching instead of the shared mem buffering. It’s 3x times slower External Image (1d, 1 component float texture in linear space, perhabs array mem textures will do better).
But loading shared mem buffer from this linear texture did some slight improvement of 20%.

Mark_Harris · March 6, 2007, 11:44am

The texture cache is optimized for 2D locality. Using a large 1D texture will likely result in more cache misses than a 2D texture if accesses across threads can be made to observe 2D locality.

If this is the case, if you, say, had a 1024 element 1D texture, it might be better to store it as a 32x32 2D texture and change your indexing from:

texture(index)

to

texture(index & 31, index >> 5);

(Note I used & instead of % to ensure fast performance, since modulo requires several cycles.)

Mark

Topic		Replies	Views
Shared memory question CUDA Programming and Performance	27	7528	June 23, 2008
texture memory vs global memory CUDA Programming and Performance	10	13908	August 20, 2007
Copy from texture memory to shared memory Confused about best transfer strategy CUDA Programming and Performance	4	1623	February 11, 2010
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15117	December 17, 2008
Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem CUDA Programming and Performance	7	7670	May 7, 2008
question about texture reads and coalescading reads CUDA Programming and Performance	3	1876	December 12, 2008
read the same position in global mem CUDA Programming and Performance	4	5328	November 6, 2007
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4535	May 12, 2015
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4785	June 10, 2008
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1900	November 1, 2010

global mem reads coalesced per block or warp?

Related topics