Warp Serialisation and Constant Memory Performance Surprise

YDD · March 2, 2009, 3:47pm

I have a kernel whose job is to form an element by element product of a lot of matrices. In pseudo-code

[codebox]__global__ void MakeFG( const cuFloatComplex *Gs, const cuFloatComplex *F, cuFloatComplex *FGs ) {

// Kernel to form the FG products

// Assumes that one block is launched for each entry in the G array

// Gs is actually G[nGs][ARRAY_SIZE], flattened out for the GPU

// Also assumes that the number of threads per block divides ARRAY_SIZE

const unsigned int tx = threadIdx.x;

const unsigned int bx = blockIdx.x;

const unsigned int by = blockIdx.y;

// Location of first element of array worked on by this block

const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));

for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {

FGs[tx+(i*blockDim.x)+iCompare] = F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];

}

…

threads.x = kFGProductKernelSize;

threads.y = threads.z = 1;

grid.x = BLOCKS_PER_ROW;

grid.y = nGs / BLOCKS_PER_ROW;

grid.z = 1;

MakeFG<<<grid,threads>>>( d_Gs, d_F, d_FGs );

[/codebox]

where [font=“Courier New”]ARRAY_SIZE[/font] is 1024, [font=“Courier New”]kFGProductKernelSize[/font] is 256, [font=“Courier New”]BLOCKS_PER_ROW[/font] is 1024 and [font=“Courier New”]nGs[/font] is 131072. Note that with this set up, [font=“Courier New”]ARRAY_SIZE/blockDim.x[/font] will always be an integer. This kernel runs in about 40ms.

Noting that the same [font=“Courier New”]F[ARRAY_SIZE][/font] is used by all the blocks, I thought I’d put this into constant memory on the device, [font=“Courier New”]constant cuFloatComplex dc_F[ARRAY_SIZE][/font]. The kernel then became

[codebox]global void MakeFGWithConstant( const cuFloatComplex *Gs, cuFloatComplex *FGs ) {

const unsigned int tx = threadIdx.x;

const unsigned int bx = blockIdx.x;

const unsigned int by = blockIdx.y;

const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));

for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {

FGs[tx+(i*blockDim.x)+iCompare] = dc_F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];

}

}[/codebox]

To my surprise, this kernel takes about 80ms to run, and according to [font=“Courier New”]cudaprof[/font], a lot of warps are getting serialised.

What’s going on here? Is access to constant memory always serialised? I had really expected using constant memory to make things faster, since the entire [font=“Courier New”]dc_F[/font] array should have been cached on each multiprocessor after the completion of the first block on that multiprocessor.

SPWorley · March 2, 2009, 3:58pm

Constant memory is cached, but it is designed for all threads in a warp to be reading the same element. If different threads access different elements, it takes an extra clock for each new address, even if the values are in cache.

Texture reads are more versatile for this kind of access. But those are serialized too if your queries are not local. Your accesses traverse right down the array so it should be fine.

Shared memory can also be used (a kind of manual cache). That does have bank conflicts, but those are usually very minor since you have 16 banks.

So three flavors of lookup memory, each designed for different access patterns. It’s not hard to try all three.

YDD · March 2, 2009, 4:18pm

That would do it :(

But the shared memory only has the lifetime of a block, and (I’m told) the texture cache only has the lifetime of a warp. Since each block only uses each entry in F once, I don’t see what value those would have here. The reads and write are already fully coalesced. I’d been hoping the constant cache would at least have the lifetime of a kernel.

SPWorley · March 2, 2009, 11:19pm

Shared memory does indeed have block-long lifetime. So you have the overhead (and coding annoyance) of having to initialize the array at the start of each block. However, shared memory is FAST so the payoff might be worth it. Certainly worth trying, it’s only a few lines of code.

This is likely the fastest table method, even with the setup overhead.

Texture memory persists beyond warps and indeed beyond blocks. It does have kernel lifetime. It’s a cache (actually a set of caches…), very similar to a CPU L1 cache.

Unlike shared memory, there will be latency when reading it. This will likely be hidden as long as you have enough threads running. The texture caches eliminate most of the bandwidth issues. Will textures be faster than constants? It’s hard to tell but it’s trivial to try.

Your thought that “texture reads only last one warp” isn’t true… but perhaps you’re thinking of coalescing, which on G200 is sort of like a one-warp, one-clock cache since G200 allows shuffling of the read values.

YDD · March 3, 2009, 2:10pm

That particular lightbulb went off just as I got home yesterday :">

Is that in the manual? At NVISION, I was told that the texture cache only had a warp lifetime - so if you were simply using it as a look up table, then it would offer no advantages on a G200 (on a G80, you would still get some coalescing benefit, if your reads were out of order - which they aren’t here).

I’ll be trying the various methods today, and I’ll see what the results are.

MisterAnderson42 · March 3, 2009, 3:46pm

Often, I say that the cache has only a warp lifetime, but it is technically only true in cases where you are reading from a very large array and all warps running on an MP are likely to access different values. You can get this result by assuming each warp is likely to access an independent set of float4’s: 1024 threads * 16 bytes = 16kiB, which is larger than the 8kiB cache per MP. Thus if all warps on the MP are reading varied values then the cache is likely flushed before the scheduler gets back to the first warp. However, if all the warps on the MP have nice local accesses, those values will stay in the cache.

And the statement that the textures offer no advantages over the improved coalescing on G200 is completely false: I’ve run benchmarks to answer this before in HOOMD. On G80, textures are 5x faster than straight semi-random global memory reads. On G200 the gap is narrowed, but textures are still “only” 3x faster than semi-random global memory reads.

YDD · March 3, 2009, 3:58pm

Often, I say that the cache has only a warp lifetime, but it is technically only true in cases where you are reading from a very large array and all warps running on an MP are likely to access different values. You can get this result by assuming each warp is likely to access an independent set of float4’s: 1024 threads * 16 bytes = 16kiB, which is larger than the 8kiB cache per MP. Thus if all warps on the MP are reading varied values then the cache is likely flushed before the scheduler gets back to the first warp. However, if all the warps on the MP have nice local accesses, those values will stay in the cache.

And the statement that the textures offer no advantages over the improved coalescing on G200 is completely false: I’ve run benchmarks to answer this before in HOOMD. On G80, textures are 5x faster than straight semi-random global memory reads. On G200 the gap is narrowed, but textures are still “only” 3x faster than semi-random global memory reads.

OK - thanks for the clarification. Hopefully by the end of today, I’ll have numbers for the case where the read is fully coalesced.

FWIW, I’ve already established that using shared memory to cache F, and applying it to several G arrays makes things about 30% faster.

YDD · March 3, 2009, 6:51pm

I’ve now written an extra kernel, which uses texture memory to access the array F. With 16384 G arrays (each of length 1024), I find the following times:
Simple : 4.98969 ms
Shared : 3.40253 ms
Shared2: 3.51175 ms
Texture: 3.23071 ms
Where ‘Simple’ is the kernel shown above (without constant memory), ‘Shared’ and ‘Shared2’ use shared memory in slightly different ways to cache F for multiple G arrays in a single block, and ‘Texture’ puts F into texture memory. So, using texture memory does offer some advantage for a look up table, even when all the accesses are strictly coalesced.

Thanks for the help :)

Topic		Replies	Views
Constant Arrays CUDA Programming and Performance	13	30710	November 24, 2007
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4568	December 4, 2009
Constant memory access Using banks like the shared memory? CUDA Programming and Performance	4	4525	January 6, 2009
Speed of Constant memory over Textures CUDA Programming and Performance	2	6995	December 24, 2009
Constant memory 64KB only 8KB usable? CUDA Programming and Performance	13	7126	February 15, 2011
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4135	December 24, 2008
Constant or Texture Memory Which is better for my application? CUDA Programming and Performance	3	2425	November 16, 2007
How to choose the good memory CUDA Programming and Performance	2	4275	December 7, 2007
Constants vs Texture Memory CUDA Programming and Performance	4	7448	February 21, 2007
Should I use constant memory or Texture? CUDA Programming and Performance	8	11630	February 20, 2008

Warp Serialisation and Constant Memory Performance Surprise

Related topics