I Thought it wouldnt work because i thaught thread execution is also parallel, but since teh above works i was probably mistaken and its sequential inside block
Now im working with shared memory in my computational mask for my image filter, but my results show that
1, putting blocks of the data from global to shared memory
2, writing from shared memory in thread specific memory
2, perfomring the calculation on thread specific memory data
3, writing it back to global memory
is in fact slower than
1, reading the global data in thread specific array
2, performing on thread specific array
3, writing it back
i thought that reading bigger blcoks of data into shared memory would be better since each thread only need to get the data from shared memory and doesnt have to read from global memory
it seems that there is a big mistake somwhere, i thought shared memory would do me the trick!
Is there something i might have forgotten about?
Threads are still parallel, but the if-statement will ensure that the other threads will skip the branch and continue executing (or at least go idle for a bit).
Shared memory will help if you need multiple threads to use the same data, or you need to read the data from global memory in a different order than the threads will use it. (For example, shared memory can be used to help you read memory in a coalesced way even if you don’t intend to operate on it in that order.) If this is not the case for your code, then the overhead of reading data into shared memory, and then running __syncthreads() could be why the first situation is slower for you.
given that my shared memory is used during the kernel and jkust being refilled in the i statment stating (if threadIdx.x == 0)
so threadIdx 1 just goes around that statement and wants to read data from shared memory.
at the same time all the opthers except threadIdx == 0 do the same
so i guess that threadIdx == 0 might be filling the data when the other threads have already begun intreprting the sourcecode after that statement.
so even if i put __synchthreads(); behind reading my data from global memory, there might be threads faster requetsing data which is just about to be filled
I fill my shared memory in a for loop after the if threadIdx== 0 statement, and after each data exchange form global to shared i put this __synchthreads()
do i need it after ecah data swap, or only once after the for loop?
so you mean that reading from global in shared memory,performing operations, and rewriting it back actually takes longer for me than reading directly into local threadmemory and reqritingt into global memory?
the cuda programmers guide alwas stated that working with global memory should be avoided as often as possible