Seems like it won’t work like that… If I only calculate convolutionResultR and put convolutionResultG and convolutionResultB in comment it gives me a red image, logical. Whenever I uncomment convolutionResultR and convolutionResultB I get a white screen.
I suppose something is wrong with the offsets I use.
Now my question is: how can I get the values of R, G, B and A from my pixel?
I’d like to be able to split my float into 4 parts… I tried to cast from float to float4, but that won’t work… Any ideas?
Jamie K has the idea…but it will all depend on how your RGBA values are stored. His #define methods will work only if you store your data as a 32-bit value (e.g. uint), in which case the first byte will be the R value, the second will be the G value, and so on. If you’re calculating the data in some other code, and uploading it in a different format, you need to adjust your calls accordingly; also, if you are doing it that way, you should probably convert to the packed values (and use some bitmasks/bitshifts to get whatever value you want – just #define some little functions for each of R,G,B,A) because you’ll get more data into memory at any one time, which almost always works out faster in CUDA.
s_data_Input is indeed an array in shared memory with pixel data.
I have a working version without shared memory already, it’s just the implementation in shared memory where I’m stuck…
Well the point is that I need offsets in my array indeed… Like you add 0, 1 and 2 in your data-array. Thing is: whenever I execute my code I get a white screen… After some adjustments I have following code for my Row Convolution Filter:
Previously I think your threads were stepping on each other, because red for thread 1 would store to the same location as greeen for thread 0. But that’s fixed now that you’re using float4.
It looks like your calculation with sI will do arithmetic on the pixel value pointed to by surfaceInput, and not do addressing into the image. Make sI a pointer:
Another problem which is a little more involved is that with BLOCKDIM threads, only BLOCKDIM pixels will be copied from global memory to shared memory, but you need BLOCKDIM + 2*KERNEL_RADIUS pixels. So at least some threads will need to load more than one pixel. You could try something like this:
for (int i = threadIdx.x; i < BLOCKDIM + 2*KERNEL_RADIUS; i += BLOCKDIM) {
int gx = blockIdx.x * blockDim.x - KERNEL_RADIUS + i;
s_data_Input[i].x = sI[4*gx]; //R
s_data_Input[i].y = sI[4*gx+1]; //G
s_data_Input[i].z = sI[4*gx+2]; //B
}
__syncthreads();
Then for the convolution, use the loop:
for(int i=0; i <= 1+2*KERNEL_RADIUS; i++) {
...
}
These changes might not get you 100% there, and I may have made some mistakes, but they will get you much closer.
That’s something that I’ve been struggling with… I have to use shared memory as much as possible, but this way I still point to my global memory? I tried to put something like:
float sI = something
but I don’t really know how…
That would’ve been the next step in my optimalisation, thanks ;)