I’m trying to convert some OGL shaders to CUDA. My algorithms need multiple passes/shaders/kernels as the output of one pass is is the input of the next pas (most of the times).
Some of the algorithms access only one pixel and generate one output pixel, other algorithms use multiple neighbouring input pixels to generate one output pixel.
The question is if there’s any golden rule as to when the expence of copying from global memory to texture memory pays off (so that the texture cache is used).
Is there any point in copying the output from one kernel from global to texture memory if the next kernel:
- accesses only one pixel
- accesses 3 neighbouring pixels
- accesses 9 pixels (I guess here texturing pays off)