Quick memory access question. Threads fighting over a data source?

koraxen · October 14, 2008, 3:04am

Hi guys,

I have a few basic questions about memory accesses, given the following situation:

Suppose I start on the host with a struct, that according to sizeof(), is 600 bytes big. I cudaMemcpy() it to the device, and want all threads to run an algorithm while treating the struct as a data source (now in shared memory). Basic math tells me that I can have 27 copies of the struct per block, since there’s 16384 bytes of shared memory per block.

Is my analysis correct? If so, is it possible to prevent threads in a block from fighting over access to the single struct? How can I correctly make copies of it, so the threads have alternate structs to efficiently read from?

All the best,

Kor

VrahoK · October 14, 2008, 9:58am

The thing about shared memory is, that it is shared. So you shouldn’t need more than one copy of your struct in one MP.

If you can make sure, that all threads will be reading the same data at the same cycle, you can use the broadcasting mechanism of the shared memory so all the threads will get their data simultaneously.

Other option is to find an appropriate access pattern to hte shared memory to avoid bank conflicts. Lots of examples to that topic are discussed in the programming guide or more explanative here in the forum.

If you do however have some kind of “random” access pattern, even copying the struct won’t avoid bank conflicts. One nasty option would be to rearrange 16 copies of your struct so that every bank holds one full copy. Now you only have to calculate addresses so that every thread in a half warp accesses the same bank every data fetch. But this only works with 32-bit data types and is therefore not really recommended.

MisterAnderson42 · October 14, 2008, 11:41am

Bank conflicts should be one of the last things you optimize for, if you ever optimize for them at all. You should be much more concerned about carefully loading that struct into shared memory with coalesced reads and making sure that whatever output you write is also coalesced.

alex_dubinsky · October 14, 2008, 4:58pm

Right. You stop threads from “fighting over access” by keeping mind of bank conflicts. The banks aren’t layed out that bank1 is 0-127 bytes, bank 2 is 128-255. Instead, banks are layed out so bank 1 is bytes 0, 16, 32, 48, etc, and bank 2 is 1, 17, 33, 49.

VrahoK’s suggestion is really interesting. If you access the blocks randomly and the threads really do interfere with each other (make sure this is the case), arranging 16 copies of your structure, each in its own bank, would be a great solution. To simplify doing the interlacing, use the offsetof() macro. Then you can set up another macro and automate the whole process. Something like:

struct MY_STRUCTURE {x,y,z};

define ACCESS_MY_STRUCTURE(member) smem[offsetof(struct MY_STRUCTURE, member)*16 + threadIdx.x%16]

int a = ACCESS_MY_STRUCTURE(z);

(Although… I wonder if you’d save more cycles on bank conflicts than you’d lose on doing the indexing)

cbuchner1 · October 15, 2008, 9:27am

NOT bytes, it’s 32 bit ints !

alex_dubinsky · October 15, 2008, 3:06pm

oops

koraxen · October 19, 2008, 6:09pm

Thanks for the replies guys… these are great ideas. Sorry I haven’t been able to see this until now; I was out of town on business.

True, I may be getting a bit ahead of myself by worrying about banks. I’ve read through the manual and am trying to understand how coalescing works. This is a stripped down version of the code:

typedef struct {

    // .... filled with stuff that makes it 600 bytes big

} World;

__global__ void myKernel( World * inWorld , float3 * result ) {

  

    __shared__ World myWorld = *inWorld;

   // run algorithm by reading "myWorld", and store results in "result"

}

Am I putting the struct into shared memory correctly? If so, how do I go about “carefully loading that struct into shared memory with coalesced reads” ?

MisterAnderson42 · October 19, 2008, 7:05pm

Well, you need to follow all the normal rules that apply to coalescing: getting the correct warp base address multiples and each thread indexing 4,8, or 16 byte elements contiguously.

Here is how I load a matrix of values efficiently into shared memory in one of my kernels:

// read in the coefficients

	extern __shared__ float2 s_coeffs[];

	for (int cur_offset = 0; cur_offset < coeff_width*coeff_width; cur_offset += blockDim.x)

  {

  if (cur_offset + threadIdx.x < coeff_width*coeff_width)

  	s_coeffs[cur_offset + threadIdx.x] = d_coeffs[cur_offset + threadIdx.x];

  }

	__syncthreads();

You can load your World struct in a similar way, with some pointer casting ugliness to get each thread loading either 4,8, or 16 bytes of it at a time.

alex_dubinsky · October 20, 2008, 1:45am

Right, you need to cast your struct as an array of ints,* and then use coalescing techniques shown in the Guide to fill that array cooperatively using all your threads.

*(Is there any chance that casting random data as floats will not work? I’m thinking that data can happen to appear as NaNs/denormals/etc and get altered during a copy. I can’t remember though if that will happen for certain.)

I would not say that you shouldn’t optimize for bank conflicts or that coalescing is undoubtedly more important. Probably it’s true that on average coalescing is more important, but it’s trivial to imagine a shared mem array/struct that gets loaded once and then reused constantly. In that case, whether that initial load is coalesced would hardly matter, but bank conflicts would. Bank conflicts can kill performance many fold, just like poor coalescing.

seibert · October 20, 2008, 3:52am

Another thing to consider benchmarking is using constant memory for this struct. While stored in the slow global memory, constant memory is cached on the multiprocessors and is optimized specifically for broadcasting one value to many threads. The cache means that you don’t have to do pointer tricks to load it efficiently.

Note that if you aren’t going to have every thread in the warp read the same value, then the shared memory will be better.

Topic		Replies	Views
Shared memory question CUDA Programming and Performance	27	7322	June 23, 2008
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
questions about coalescing access coalescing access CUDA Programming and Performance	8	1971	November 13, 2009
Using Shared Memory in CUDA C/C++ Technical Blog	36	1991	October 8, 2020
char global memory access optimization CUDA Programming and Performance	17	11872	May 31, 2010
Practical rules for coalesced memory access ? CUDA Programming and Performance	4	5543	September 13, 2008
Global memory optimisation CUDA Programming and Performance	6	871	February 4, 2015
Bytes in shared memory CUDA Programming and Performance	8	3035	April 19, 2017
Registers and Shared Memory question CUDA Programming and Performance	7	5452	September 10, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16599	March 31, 2011

Quick memory access question. Threads fighting over a data source?

Related topics