Im trying to listen to a class at the same time so maybe im not reading your code right but there seem to be many things wrong with it.
Firstly, every thread in the block reads its own ((sI + ypI) + 4(x+i)) but they all put it in  of the shared memory array. You dont syncthread after so im not even sure whats goin to happen there. What you probably want is to store in [threadIdx.x] or if youre treating one pixel in your loop at the time, then you only want threadIdx.x==0 to LOAD the data into shared memory, im not sure which one youre trying to achieve.
Also, youre reading s_data_pixelInput1 and  but they have not been loaded as far as i can tell and there is definitely no syncthread before the reads.
convolutionResultR,G,B are shared… im not sure why. you most likely want each thread (one thread per pixel?) to perform its own convolution? If so, those variables should be private to the thread. Same for s_data_R,G,B
Well that’s a typo, should be [tid] indeed ;) Don’t know how that  came there
Well what I want to achief is that every pixel is read with one thread and then perform a convolution on each part of that pixel (R,G,B). The reason that convolutionResultR,G,B are shared is because I have to make sure my program is as fast as possible, I avoid reads into global ;)
That’s why I use the dynamically allocated shared memory. If I use the method like in the first example, I still read from my global memory with this pointer:
You need a syncthread after that load in s_data_Input[tid] = ((sI + ypI) + 4(x+i)); since after that, all threads of the block will need the elements in shared memory.
convolutionResultR,G,B cannot be shared as they are … private(!) to a given pixel, therefore thread. If you declare them without the shared qualifier they will be put into registers, which is just as fast. You just have way too many shared qualifiers in there and you seem to have missed the fact that there is also a register file. Its not just global mem or shared mem, its global mem, shared mem and registers.
So what you want to do is load all the pixels of a block in the shared array (before your loop) and then do all the computations for one pixel using registers AND the values that you have loaded in the shared mem. That way, every pixel is only read once from global memory, which is what you want to achieve. Well, somewhat, more on that later!
Also, since KERNELRADIUS seems to be a #define, you dont even need to use dynamically allocated shared memory. You can use statically allocated shared memory.
A simple shared float sharedPixels[KERNELRADIUS] would do. As bigmac has already stated.
Now, another problem of your approach is that you cannot only load KERNELRADIUS values in shared memory, since those values are what is needed by 1 pixel, but the pixel next to it will also need a column of pixel you have not loaded into shared memory. You could load (KERNELRADIUS*2) values into shared memory, that way you could treat KERNELRADIUS pixels using that shared array.
The only thing I still need to know is how to get the separate values of the R, G, B and A float out of my array?
Now the full pixel is loaded into s_data_Input[tid] doing:
s_data_Input[tid] = ((sI + y*pI) + 4*(x+i));
Knowing that 1 pixel contains 4 floats representing R, G, B and A, I need pixel that corresponds with R, pixel that corresponds with G, pixel that corresponds with B and finally pixel that corresponds with A.