I have a few basic questions about memory accesses, given the following situation:
Suppose I start on the host with a struct, that according to sizeof(), is 600 bytes big. I cudaMemcpy() it to the device, and want all threads to run an algorithm while treating the struct as a data source (now in shared memory). Basic math tells me that I can have 27 copies of the struct per block, since there’s 16384 bytes of shared memory per block.
Is my analysis correct? If so, is it possible to prevent threads in a block from fighting over access to the single struct? How can I correctly make copies of it, so the threads have alternate structs to efficiently read from?
The thing about shared memory is, that it is shared. So you shouldn’t need more than one copy of your struct in one MP.
If you can make sure, that all threads will be reading the same data at the same cycle, you can use the broadcasting mechanism of the shared memory so all the threads will get their data simultaneously.
Other option is to find an appropriate access pattern to hte shared memory to avoid bank conflicts. Lots of examples to that topic are discussed in the programming guide or more explanative here in the forum.
If you do however have some kind of “random” access pattern, even copying the struct won’t avoid bank conflicts. One nasty option would be to rearrange 16 copies of your struct so that every bank holds one full copy. Now you only have to calculate addresses so that every thread in a half warp accesses the same bank every data fetch. But this only works with 32-bit data types and is therefore not really recommended.
Bank conflicts should be one of the last things you optimize for, if you ever optimize for them at all. You should be much more concerned about carefully loading that struct into shared memory with coalesced reads and making sure that whatever output you write is also coalesced.
Right. You stop threads from “fighting over access” by keeping mind of bank conflicts. The banks aren’t layed out that bank1 is 0-127 bytes, bank 2 is 128-255. Instead, banks are layed out so bank 1 is bytes 0, 16, 32, 48, etc, and bank 2 is 1, 17, 33, 49.
VrahoK’s suggestion is really interesting. If you access the blocks randomly and the threads really do interfere with each other (make sure this is the case), arranging 16 copies of your structure, each in its own bank, would be a great solution. To simplify doing the interlacing, use the offsetof() macro. Then you can set up another macro and automate the whole process. Something like:
Thanks for the replies guys… these are great ideas. Sorry I haven’t been able to see this until now; I was out of town on business.
True, I may be getting a bit ahead of myself by worrying about banks. I’ve read through the manual and am trying to understand how coalescing works. This is a stripped down version of the code:
typedef struct {
// .... filled with stuff that makes it 600 bytes big
} World;
__global__ void myKernel( World * inWorld , float3 * result ) {
__shared__ World myWorld = *inWorld;
// run algorithm by reading "myWorld", and store results in "result"
}
Am I putting the struct into shared memory correctly? If so, how do I go about “carefully loading that struct into shared memory with coalesced reads” ?
Well, you need to follow all the normal rules that apply to coalescing: getting the correct warp base address multiples and each thread indexing 4,8, or 16 byte elements contiguously.
Here is how I load a matrix of values efficiently into shared memory in one of my kernels:
// read in the coefficients
extern __shared__ float2 s_coeffs[];
for (int cur_offset = 0; cur_offset < coeff_width*coeff_width; cur_offset += blockDim.x)
{
if (cur_offset + threadIdx.x < coeff_width*coeff_width)
s_coeffs[cur_offset + threadIdx.x] = d_coeffs[cur_offset + threadIdx.x];
}
__syncthreads();
You can load your World struct in a similar way, with some pointer casting ugliness to get each thread loading either 4,8, or 16 bytes of it at a time.
Right, you need to cast your struct as an array of ints,* and then use coalescing techniques shown in the Guide to fill that array cooperatively using all your threads.
*(Is there any chance that casting random data as floats will not work? I’m thinking that data can happen to appear as NaNs/denormals/etc and get altered during a copy. I can’t remember though if that will happen for certain.)
I would not say that you shouldn’t optimize for bank conflicts or that coalescing is undoubtedly more important. Probably it’s true that on average coalescing is more important, but it’s trivial to imagine a shared mem array/struct that gets loaded once and then reused constantly. In that case, whether that initial load is coalesced would hardly matter, but bank conflicts would. Bank conflicts can kill performance many fold, just like poor coalescing.
Another thing to consider benchmarking is using constant memory for this struct. While stored in the slow global memory, constant memory is cached on the multiprocessors and is optimized specifically for broadcasting one value to many threads. The cache means that you don’t have to do pointer tricks to load it efficiently.
Note that if you aren’t going to have every thread in the warp read the same value, then the shared memory will be better.