Warp Serialisation and Constant Memory Performance Surprise

I have a kernel whose job is to form an element by element product of a lot of matrices. In pseudo-code

[codebox]__global__ void MakeFG( const cuFloatComplex *Gs, const cuFloatComplex *F, cuFloatComplex *FGs ) {

// Kernel to form the FG products

// Assumes that one block is launched for each entry in the G array

// Gs is actually G[nGs][ARRAY_SIZE], flattened out for the GPU

// Also assumes that the number of threads per block divides ARRAY_SIZE

const unsigned int tx = threadIdx.x;

const unsigned int bx = blockIdx.x;

const unsigned int by = blockIdx.y;

// Location of first element of array worked on by this block

const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));

for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {

FGs[tx+(i*blockDim.x)+iCompare] = F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];

}

}

threads.x = kFGProductKernelSize;

threads.y = threads.z = 1;

grid.x = BLOCKS_PER_ROW;

grid.y = nGs / BLOCKS_PER_ROW;

grid.z = 1;

MakeFG<<<grid,threads>>>( d_Gs, d_F, d_FGs );

[/codebox]

where [font=“Courier New”]ARRAY_SIZE[/font] is 1024, [font=“Courier New”]kFGProductKernelSize[/font] is 256, [font=“Courier New”]BLOCKS_PER_ROW[/font] is 1024 and [font=“Courier New”]nGs[/font] is 131072. Note that with this set up, [font=“Courier New”]ARRAY_SIZE/blockDim.x[/font] will always be an integer. This kernel runs in about 40ms.

Noting that the same [font=“Courier New”]F[ARRAY_SIZE][/font] is used by all the blocks, I thought I’d put this into constant memory on the device, [font=“Courier New”]constant cuFloatComplex dc_F[ARRAY_SIZE][/font]. The kernel then became

[codebox]global void MakeFGWithConstant( const cuFloatComplex *Gs, cuFloatComplex *FGs ) {

const unsigned int tx = threadIdx.x;

const unsigned int bx = blockIdx.x;

const unsigned int by = blockIdx.y;

const unsigned int iCompare = ARRAY_SIZE*(bx+(by*BLOCKS_PER_ROW));

for( unsigned int i=0; i<(ARRAY_SIZE/blockDim.x); i++ ) {

FGs[tx+(i*blockDim.x)+iCompare] = dc_F[tx+(i*blockDim.x)] * Gs[tx+(i*blockDim.x)+iCompare];

}

}[/codebox]

To my surprise, this kernel takes about 80ms to run, and according to [font=“Courier New”]cudaprof[/font], a lot of warps are getting serialised.

What’s going on here? Is access to constant memory always serialised? I had really expected using constant memory to make things faster, since the entire [font=“Courier New”]dc_F[/font] array should have been cached on each multiprocessor after the completion of the first block on that multiprocessor.

Constant memory is cached, but it is designed for all threads in a warp to be reading the same element. If different threads access different elements, it takes an extra clock for each new address, even if the values are in cache.

Texture reads are more versatile for this kind of access. But those are serialized too if your queries are not local. Your accesses traverse right down the array so it should be fine.

Shared memory can also be used (a kind of manual cache). That does have bank conflicts, but those are usually very minor since you have 16 banks.

So three flavors of lookup memory, each designed for different access patterns. It’s not hard to try all three.

That would do it :(

But the shared memory only has the lifetime of a block, and (I’m told) the texture cache only has the lifetime of a warp. Since each block only uses each entry in F once, I don’t see what value those would have here. The reads and write are already fully coalesced. I’d been hoping the constant cache would at least have the lifetime of a kernel.

Shared memory does indeed have block-long lifetime. So you have the overhead (and coding annoyance) of having to initialize the array at the start of each block. However, shared memory is FAST so the payoff might be worth it. Certainly worth trying, it’s only a few lines of code.

This is likely the fastest table method, even with the setup overhead.

Texture memory persists beyond warps and indeed beyond blocks. It does have kernel lifetime. It’s a cache (actually a set of caches…), very similar to a CPU L1 cache.

Unlike shared memory, there will be latency when reading it. This will likely be hidden as long as you have enough threads running. The texture caches eliminate most of the bandwidth issues. Will textures be faster than constants? It’s hard to tell but it’s trivial to try.

Your thought that “texture reads only last one warp” isn’t true… but perhaps you’re thinking of coalescing, which on G200 is sort of like a one-warp, one-clock cache since G200 allows shuffling of the read values.

That particular lightbulb went off just as I got home yesterday :">

Is that in the manual? At NVISION, I was told that the texture cache only had a warp lifetime - so if you were simply using it as a look up table, then it would offer no advantages on a G200 (on a G80, you would still get some coalescing benefit, if your reads were out of order - which they aren’t here).

I’ll be trying the various methods today, and I’ll see what the results are.

Often, I say that the cache has only a warp lifetime, but it is technically only true in cases where you are reading from a very large array and all warps running on an MP are likely to access different values. You can get this result by assuming each warp is likely to access an independent set of float4’s: 1024 threads * 16 bytes = 16kiB, which is larger than the 8kiB cache per MP. Thus if all warps on the MP are reading varied values then the cache is likely flushed before the scheduler gets back to the first warp. However, if all the warps on the MP have nice local accesses, those values will stay in the cache.

And the statement that the textures offer no advantages over the improved coalescing on G200 is completely false: I’ve run benchmarks to answer this before in HOOMD. On G80, textures are 5x faster than straight semi-random global memory reads. On G200 the gap is narrowed, but textures are still “only” 3x faster than semi-random global memory reads.

OK - thanks for the clarification. Hopefully by the end of today, I’ll have numbers for the case where the read is fully coalesced.

FWIW, I’ve already established that using shared memory to cache F, and applying it to several G arrays makes things about 30% faster.

I’ve now written an extra kernel, which uses texture memory to access the array F. With 16384 G arrays (each of length 1024), I find the following times:
Simple : 4.98969 ms
Shared : 3.40253 ms
Shared2: 3.51175 ms
Texture: 3.23071 ms
Where ‘Simple’ is the kernel shown above (without constant memory), ‘Shared’ and ‘Shared2’ use shared memory in slightly different ways to cache F for multiple G arrays in a single block, and ‘Texture’ puts F into texture memory. So, using texture memory does offer some advantage for a look up table, even when all the accesses are strictly coalesced.

Thanks for the help :)