I used global memory instead of texture memory to store the input image data. When using texture, each thread takes 34 registers, and no local memory, occupancy is 13%; when using global memory, however, each thread uses only 17 registers, but a considerable amount of local memory is used, and occupancy increases to 26%, however, the total perf. goes down, which I guess due to local memory usage.
Is there a way to set some constraints in compiler , to optimize for local memory, instead of registers? In Visual Studio enviroment.