sdata[threadIdx.x] = globalmem[some_offset + threadIdx.x];
// any thread can use any sdata now...
Of course, depending on the size of the block of data you want to read vs your thread block size you may need more data loads than just one. If you need a smaller block of memory read in, just perform the read within an if (threadIdx.x < memory_size)
So that means that the first thread has to copy the chunk for it and the rest of the threads in the block? And since only one function is ran (just on many threads) that means you somehow have to keep track if you made a copy or not and need to use the atomic functions for basic locking?
It makes sense, it just sounds kind of complicated for what is a simple task that must be done very often.
No, as MisterAnderson showed you, each thread could copy a different element in shared memory. After the synchread, all the threads have completed their copies and you can safely use the shared memory array.
Nothing really prevents you from having only one thread perform the reads to fill all of the shared memory. It will not be a very optimal way, however.
The idea behind having all threads in the block participate in the read is to keep all the threads busy doing their own small part of the work. The GPU gets its insane performance by being able to run 10 thousand threads all at once in an interleaved fashion. If you serialzie this in any way, even one per block, you are preventing this interleaving from happening and essentially running 1 thread on a relatively slow processor.
That and all threads should participate to get the best benefit from coalesced memory reads.
Lets say globalmem is huge, much bigger than any blocks shared memory. If you set a local pointer equal to it will the __syncthreads call be smart enough to only sync/copy over the max shared size? Will I be safe as long as I dont address over the limit myself in any code following?
Sorry, those parts in the book arnt to clear for me. Im also new to parallel algorithms in general.
Well, you have full control over exactly how many elements are copied. In the really simple example I gave, I loaded a number of elements equal to the block size. 1) You have to limit yourself to not index past the end of your shared/global arrays. So if you are loading an odd size (not a multiple of the block size) you will need an if() to prevent loading past the end.
You have complete control over how many reads are done and in what pattern. For instance, if you need to load 2x the block size in elements, you would need to add another memory copy: something like sdata[threadIdx.x + blockDim.x] = globalmem[some_offset + threadIdx.x + blockDim.x]; Or: if you are loading a really large block of global mem into shared, you could use a for loop.
__syncthreads doesn’t perform any of the loading. __synthreads just says to all threads in the block: “wait until all threads get here”. You need the __syncthreads to prevent some threads from trying to use data in processing that haven’t yet been loaded by another thread.
These are all really simple 1D examples I’m giving here. Have a look at the matrix mul sample in the programming guide for a really good example of a more complicated shared memory usage pattern. Or, if you are interested: I could post a full kernel as an example of the simple usage I’ve demonstrated here.