While executing my program i encountered slow kernels (8 - 9ms) want to achieve around 1-2ms. While profiling the kernel in NVVP i noticed that there were memory latencies caused due to Texture stalls.
After removing a device function from the kernel, i noticed the texture stall disappeared. I have not created or declare any texture memory in my code. Does texture memory get created automatically?
My code
__device__ inline int smoothedSum(int *dev_integral_image, int keypt_y,
int keypt_x, int y, int x)
{
// 4 == half kernel = int(9/2)
const int img_y = keypt_y + y;
const int img_x = keypt_x + x;
int location1 = (640 * (img_y + 5)) + (img_x + 5);
int location2 = (640 * (img_y + 5)) + (img_x - 4);
int location3 = (640 * (img_y - 4)) + (img_x + 5);
int location4 = (640 * (img_y - 4)) + (img_x - 4);
return dev_integral_image[location1] + dev_integral_image[location4] -
dev_integral_image[location2] - dev_integral_image[location3];
// have to check the extreme points