copying to shared block mem


Is it possible to copy a chunk of data from global memory once and have all of a blocks thread’s access it, or does each thread have to act independently and grab its own?

I’m not sure but it looks like you only get to copy data to the device and then give it the function that will be threaded, leaving no room for bulk block transfers.

Anyone mind explaining this please?


Yep. You can do something like:

sdata[threadIdx.x] = globalmem[some_offset + threadIdx.x];


// any thread can use any sdata now...

Of course, depending on the size of the block of data you want to read vs your thread block size you may need more data loads than just one. If you need a smaller block of memory read in, just perform the read within an if (threadIdx.x < memory_size)

Wait, wouldnt that copy the chunks over per thread and not per block? Id like todo per block.

Shared memory is per block.

Ok, sorry for dragging this topic out so long…

So that means that the first thread has to copy the chunk for it and the rest of the threads in the block? And since only one function is ran (just on many threads) that means you somehow have to keep track if you made a copy or not and need to use the atomic functions for basic locking?

It makes sense, it just sounds kind of complicated for what is a simple task that must be done very often.

Thanks for all the quick replies too!

No, as MisterAnderson showed you, each thread could copy a different element in shared memory. After the synchread, all the threads have completed their copies and you can safely use the shared memory array.

Sorry. I knew it was a way and what he showed, just wasnt sure if it was the only way


Nothing really prevents you from having only one thread perform the reads to fill all of the shared memory. It will not be a very optimal way, however.

The idea behind having all threads in the block participate in the read is to keep all the threads busy doing their own small part of the work. The GPU gets its insane performance by being able to run 10 thousand threads all at once in an interleaved fashion. If you serialzie this in any way, even one per block, you are preventing this interleaving from happening and essentially running 1 thread on a relatively slow processor.

That and all threads should participate to get the best benefit from coalesced memory reads.

Isn’t the speed between global memory and shared very slow? Like almost a few hundred clock cycles(almost like a normal CPU going to RAM)?

Wouldn’t that make it worth it alone if you can have one quick burst of the full 16k and then massive threading?


Also, one last thing here.

Lets say globalmem is huge, much bigger than any blocks shared memory. If you set a local pointer equal to it will the __syncthreads call be smart enough to only sync/copy over the max shared size? Will I be safe as long as I dont address over the limit myself in any code following?

Sorry, those parts in the book arnt to clear for me. Im also new to parallel algorithms in general.

Thanks alot

Martman, I would advise you to check the examples in the SDK on how to use shared memory.

The speed with which you can access global memory is 86 GB/s if you have coalesced reads. That is quite more than the speed with which a CPU can reach its memory.

Well, you have full control over exactly how many elements are copied. In the really simple example I gave, I loaded a number of elements equal to the block size. 1) You have to limit yourself to not index past the end of your shared/global arrays. So if you are loading an odd size (not a multiple of the block size) you will need an if() to prevent loading past the end.

  1. You have complete control over how many reads are done and in what pattern. For instance, if you need to load 2x the block size in elements, you would need to add another memory copy: something like sdata[threadIdx.x + blockDim.x] = globalmem[some_offset + threadIdx.x + blockDim.x]; Or: if you are loading a really large block of global mem into shared, you could use a for loop.

__syncthreads doesn’t perform any of the loading. __synthreads just says to all threads in the block: “wait until all threads get here”. You need the __syncthreads to prevent some threads from trying to use data in processing that haven’t yet been loaded by another thread.

These are all really simple 1D examples I’m giving here. Have a look at the matrix mul sample in the programming guide for a really good example of a more complicated shared memory usage pattern. Or, if you are interested: I could post a full kernel as an example of the simple usage I’ve demonstrated here.