copying to shared block mem

martman · April 6, 2008, 12:03am

Hello

Is it possible to copy a chunk of data from global memory once and have all of a blocks thread’s access it, or does each thread have to act independently and grab its own?

I’m not sure but it looks like you only get to copy data to the device and then give it the function that will be threaded, leaving no room for bulk block transfers.

Anyone mind explaining this please?

Thanks

MisterAnderson42 · April 6, 2008, 12:30am

Yep. You can do something like:

sdata[threadIdx.x] = globalmem[some_offset + threadIdx.x];

__syncthreads();

// any thread can use any sdata now...

Of course, depending on the size of the block of data you want to read vs your thread block size you may need more data loads than just one. If you need a smaller block of memory read in, just perform the read within an if (threadIdx.x < memory_size)

martman · April 6, 2008, 12:38am

Wait, wouldnt that copy the chunks over per thread and not per block? Id like todo per block.

mfatica · April 6, 2008, 12:50am

Shared memory is per block.

martman · April 6, 2008, 1:08am

Ok, sorry for dragging this topic out so long…

So that means that the first thread has to copy the chunk for it and the rest of the threads in the block? And since only one function is ran (just on many threads) that means you somehow have to keep track if you made a copy or not and need to use the atomic functions for basic locking?

It makes sense, it just sounds kind of complicated for what is a simple task that must be done very often.

Thanks for all the quick replies too!

mfatica · April 6, 2008, 1:28am

No, as MisterAnderson showed you, each thread could copy a different element in shared memory. After the synchread, all the threads have completed their copies and you can safely use the shared memory array.

martman · April 6, 2008, 1:33am

Sorry. I knew it was a way and what he showed, just wasnt sure if it was the only way

Thanks

MisterAnderson42 · April 6, 2008, 12:39pm

Nothing really prevents you from having only one thread perform the reads to fill all of the shared memory. It will not be a very optimal way, however.

The idea behind having all threads in the block participate in the read is to keep all the threads busy doing their own small part of the work. The GPU gets its insane performance by being able to run 10 thousand threads all at once in an interleaved fashion. If you serialzie this in any way, even one per block, you are preventing this interleaving from happening and essentially running 1 thread on a relatively slow processor.

That and all threads should participate to get the best benefit from coalesced memory reads.

martman · April 6, 2008, 6:14pm

Isn’t the speed between global memory and shared very slow? Like almost a few hundred clock cycles(almost like a normal CPU going to RAM)?

Wouldn’t that make it worth it alone if you can have one quick burst of the full 16k and then massive threading?

Thanks

martman · April 6, 2008, 7:06pm

Yep. You can do something like:
sdata[threadIdx.x] = globalmem[some_offset + threadIdx.x];

__syncthreads();

// any thread can use any sdata now...
Of course, depending on the size of the block of data you want to read vs your thread block size you may need more data loads than just one. If you need a smaller block of memory read in, just perform the read within an if (threadIdx.x < memory_size)

[snapback]357191[/snapback]

Also, one last thing here.

Lets say globalmem is huge, much bigger than any blocks shared memory. If you set a local pointer equal to it will the __syncthreads call be smart enough to only sync/copy over the max shared size? Will I be safe as long as I dont address over the limit myself in any code following?

Sorry, those parts in the book arnt to clear for me. Im also new to parallel algorithms in general.

Thanks alot

DenisR · April 6, 2008, 7:37pm

Martman, I would advise you to check the examples in the SDK on how to use shared memory.

The speed with which you can access global memory is 86 GB/s if you have coalesced reads. That is quite more than the speed with which a CPU can reach its memory.

MisterAnderson42 · April 6, 2008, 8:38pm

Well, you have full control over exactly how many elements are copied. In the really simple example I gave, I loaded a number of elements equal to the block size. 1) You have to limit yourself to not index past the end of your shared/global arrays. So if you are loading an odd size (not a multiple of the block size) you will need an if() to prevent loading past the end.

You have complete control over how many reads are done and in what pattern. For instance, if you need to load 2x the block size in elements, you would need to add another memory copy: something like sdata[threadIdx.x + blockDim.x] = globalmem[some_offset + threadIdx.x + blockDim.x]; Or: if you are loading a really large block of global mem into shared, you could use a for loop.

__syncthreads doesn’t perform any of the loading. __synthreads just says to all threads in the block: “wait until all threads get here”. You need the __syncthreads to prevent some threads from trying to use data in processing that haven’t yet been loaded by another thread.

These are all really simple 1D examples I’m giving here. Have a look at the matrix mul sample in the programming guide for a really good example of a more complicated shared memory usage pattern. Or, if you are interested: I could post a full kernel as an example of the simple usage I’ve demonstrated here.

Topic		Replies	Views
Shared Memory question CUDA Programming and Performance	5	2888	November 25, 2016
Copying data from global memory to shared memory by each thread CUDA Programming and Performance	6	16938	January 7, 2022
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
copy global memory by CUDA threads CUDA Programming and Performance	3	1209	January 17, 2011
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9163	November 12, 2007
Shared memory vs global memory CUDA Programming and Performance	6	3448	April 30, 2007
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5965	November 27, 2010
Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem CUDA Programming and Performance	7	7599	May 7, 2008
copy global memory by kernel threads CUDA Programming and Performance	1	5957	January 23, 2011
From Global to Shared Copy some data from Global mem to Shared mem CUDA Programming and Performance	2	3334	November 25, 2011

copying to shared block mem

Related topics