General Shared Memory Question


I have to use the shared memory in a way that the data is read from global to shared memory when the thread block starts so that all threads cann acess the data from there

my quetsion is just how do i do it

conditions like (threadId.x == 0) will naturally not work…

Thank you!

Best regards

Why not? Maybe it is slow, but it should work? I think I don’t understand something?

Just remember to put __syncthreads() after you load the data so that it is visible to all threads.

Yeah, small divergent branches to setup shared memory like that are common in CUDA kernels. I use that trick all the time.

I Thought it wouldnt work because i thaught thread execution is also parallel, but since teh above works i was probably mistaken and its sequential inside block

Now im working with shared memory in my computational mask for my image filter, but my results show that
1, putting blocks of the data from global to shared memory
2, writing from shared memory in thread specific memory
2, perfomring the calculation on thread specific memory data
3, writing it back to global memory

is in fact slower than

1, reading the global data in thread specific array
2, performing on thread specific array
3, writing it back

i thought that reading bigger blcoks of data into shared memory would be better since each thread only need to get the data from shared memory and doesnt have to read from global memory

it seems that there is a big mistake somwhere, i thought shared memory would do me the trick!
Is there something i might have forgotten about?



Threads are still parallel, but the if-statement will ensure that the other threads will skip the branch and continue executing (or at least go idle for a bit).

Shared memory will help if you need multiple threads to use the same data, or you need to read the data from global memory in a different order than the threads will use it. (For example, shared memory can be used to help you read memory in a coalesced way even if you don’t intend to operate on it in that order.) If this is not the case for your code, then the overhead of reading data into shared memory, and then running __syncthreads() could be why the first situation is slower for you.

I think i dont understand this

given the execution series of threadId.x with


all launched at the same time

given that my shared memory is used during the kernel and jkust being refilled in the i statment stating (if threadIdx.x == 0)

so threadIdx 1 just goes around that statement and wants to read data from shared memory.

at the same time all the opthers except threadIdx == 0 do the same

so i guess that threadIdx == 0 might be filling the data when the other threads have already begun intreprting the sourcecode after that statement.

so even if i put __synchthreads(); behind reading my data from global memory, there might be threads faster requetsing data which is just about to be filled

I fill my shared memory in a for loop after the if threadIdx== 0 statement, and after each data exchange form global to shared i put this __synchthreads()

do i need it after ecah data swap, or only once after the for loop?

so you mean that reading from global in shared memory,performing operations, and rewriting it back actually takes longer for me than reading directly into local threadmemory and reqritingt into global memory?

the cuda programmers guide alwas stated that working with global memory should be avoided as often as possible

ima little onfused

but anyway thanks fro your answer! :)