General Shared Memory Question

mazzok · March 3, 2010, 6:55pm

Hello,

I have to use the shared memory in a way that the data is read from global to shared memory when the thread block starts so that all threads cann acess the data from there

my quetsion is just how do i do it

conditions like (threadId.x == 0) will naturally not work…

Thank you!

Best regards
maz

Cygnus_X1 · March 3, 2010, 7:24pm

Why not? Maybe it is slow, but it should work? I think I don’t understand something?

Just remember to put __syncthreads() after you load the data so that it is visible to all threads.

seibert · March 3, 2010, 7:34pm

Yeah, small divergent branches to setup shared memory like that are common in CUDA kernels. I use that trick all the time.

mazzok · March 4, 2010, 4:16pm

I Thought it wouldnt work because i thaught thread execution is also parallel, but since teh above works i was probably mistaken and its sequential inside block

Now im working with shared memory in my computational mask for my image filter, but my results show that
1, putting blocks of the data from global to shared memory
2, writing from shared memory in thread specific memory
2, perfomring the calculation on thread specific memory data
3, writing it back to global memory

is in fact slower than

1, reading the global data in thread specific array
2, performing on thread specific array
3, writing it back

i thought that reading bigger blcoks of data into shared memory would be better since each thread only need to get the data from shared memory and doesnt have to read from global memory

it seems that there is a big mistake somwhere, i thought shared memory would do me the trick!
Is there something i might have forgotten about?

Regards

Maz

seibert · March 4, 2010, 4:29pm

Threads are still parallel, but the if-statement will ensure that the other threads will skip the branch and continue executing (or at least go idle for a bit).

Shared memory will help if you need multiple threads to use the same data, or you need to read the data from global memory in a different order than the threads will use it. (For example, shared memory can be used to help you read memory in a coalesced way even if you don’t intend to operate on it in that order.) If this is not the case for your code, then the overhead of reading data into shared memory, and then running __syncthreads() could be why the first situation is slower for you.

mazzok · March 4, 2010, 10:28pm

I think i dont understand this

given the execution series of threadId.x with

1,17,123,23,0,12…

all launched at the same time

given that my shared memory is used during the kernel and jkust being refilled in the i statment stating (if threadIdx.x == 0)

so threadIdx 1 just goes around that statement and wants to read data from shared memory.

at the same time all the opthers except threadIdx == 0 do the same

so i guess that threadIdx == 0 might be filling the data when the other threads have already begun intreprting the sourcecode after that statement.

so even if i put __synchthreads(); behind reading my data from global memory, there might be threads faster requetsing data which is just about to be filled

I fill my shared memory in a for loop after the if threadIdx== 0 statement, and after each data exchange form global to shared i put this __synchthreads()

do i need it after ecah data swap, or only once after the for loop?

so you mean that reading from global in shared memory,performing operations, and rewriting it back actually takes longer for me than reading directly into local threadmemory and reqritingt into global memory?

the cuda programmers guide alwas stated that working with global memory should be avoided as often as possible

ima little onfused

but anyway thanks fro your answer! :)

Maz

Topic		Replies	Views
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5965	November 27, 2010
__syncthreads() and global memory CUDA Programming and Performance	1	2451	December 1, 2008
Use CUDA Shared memory as a write buffer CUDA Programming and Performance	8	3181	May 9, 2015
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1817	August 20, 2010
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8289	April 15, 2011
__syncthreads question CUDA Programming and Performance	9	2028	September 30, 2009
copying to shared block mem CUDA Programming and Performance	11	4179	April 6, 2008
optimization shared memory fail major speed using shared memory in detriment of global memory CUDA Programming and Performance	3	3668	March 31, 2011
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16271	July 21, 2017

General Shared Memory Question

Related topics