Texture Memory being created Automatically?

While executing my program i encountered slow kernels (8 - 9ms) want to achieve around 1-2ms. While profiling the kernel in NVVP i noticed that there were memory latencies caused due to Texture stalls.

After removing a device function from the kernel, i noticed the texture stall disappeared. I have not created or declare any texture memory in my code. Does texture memory get created automatically?

My code

__device__ inline int smoothedSum(int *dev_integral_image, int keypt_y,
                                  int keypt_x, int y, int x)
{

  // 4 == half kernel = int(9/2)
  const int img_y = keypt_y + y;
  const int img_x = keypt_x + x;
  int location1 = (640 * (img_y + 5)) + (img_x + 5);
  int location2 = (640 * (img_y + 5)) + (img_x - 4);
  int location3 = (640 * (img_y - 4)) + (img_x + 5);
  int location4 = (640 * (img_y - 4)) + (img_x - 4);

  return dev_integral_image[location1] + dev_integral_image[location4] -
      dev_integral_image[location2] - dev_integral_image[location3];

  // have to check the extreme points

Depending on the GPU, the L1 and Texture caches are unified. Some metrics may be reported for texture even though there is no explicit usage of texture in your code.

Thanks. That makes sense.

When I run my kernel, does each thread create a copy of the device function and store the variables into L1/Texture memory?

Is this why there is a texture stall when i run my code in the visual profiler?

Is it possible to create a single instance of the device function if this is the case?

I have no idea what your kernel does.

My kernel <<<2400,128>>> calls the custom device function

getDescriptors(integralImage, Iy, Ix, descriptors + descTid);

which does this

__device__ inline int smoothedSum(int *dev_integral_image, int keypt_y,
                                  int keypt_x, int y, int x)
{

  // 4 == half kernel = int(9/2)
  const int img_y = keypt_y + y;
  const int img_x = keypt_x + x;
  int location1 = (640 * (img_y + 5)) + (img_x + 5);
  int location2 = (640 * (img_y + 5)) + (img_x - 4);
  int location3 = (640 * (img_y - 4)) + (img_x + 5);
  int location4 = (640 * (img_y - 4)) + (img_x - 4);

  return dev_integral_image[location1] + dev_integral_image[location4] -
      dev_integral_image[location2] - dev_integral_image[location3];

  // have to check the extreme points
}

__device__ inline void getDescriptors(int *dev_integral_image, int Iy, int Ix,
                                      byte *desc)
{

#define SMOOTHED(y, x) smoothedSum(dev_integral_image, Iy, Ix, y, x)

  desc[0] = (byte)(((SMOOTHED(-2, -1) < SMOOTHED(7, -1)) << 7) +
                   ((SMOOTHED(-14, -1) < SMOOTHED(-3, 3)) << 6) +
                   ((SMOOTHED(1, -2) < SMOOTHED(11, 2)) << 5) +
                   ((SMOOTHED(1, 6) < SMOOTHED(-10, -7)) << 4) +
                   ((SMOOTHED(13, 2) < SMOOTHED(-1, 0)) << 3) +
                   ((SMOOTHED(-14, 5) < SMOOTHED(5, -3)) << 2) +
                   ((SMOOTHED(-2, 8) < SMOOTHED(2, 4)) << 1) +
                   ((SMOOTHED(-11, 8) < SMOOTHED(-15, 5)) << 0));

  desc[1] = (byte)(((SMOOTHED(-6, -23) < SMOOTHED(8, -9)) << 7) +
                   ((SMOOTHED(-12, 6) < SMOOTHED(-10, 8)) << 6) +
                   ((SMOOTHED(-3, -1) < SMOOTHED(8, 1)) << 5) +
                   ((SMOOTHED(3, 6) < SMOOTHED(5, 6)) << 4) +
                   ((SMOOTHED(-7, -6) < SMOOTHED(5, -5)) << 3) +
                   ((SMOOTHED(22, -2) < SMOOTHED(-11, -8)) << 2) +
                   ((SMOOTHED(14, 7) < SMOOTHED(8, 5)) << 1) +
                   ((SMOOTHED(-1, 14) < SMOOTHED(-5, -14)) << 0));

... }

I am running the code on a Jetson TX2

I continue to have no idea what your kernel does. Perhaps someone else will be able to help you.

Hi @srinath2468,

As Bob was saying, providing a complete code base makes helping you much easier. There could be much more going on than what in your device function.

Based on the little code you’ve given, the most likely issue is with

return dev_integral_image[location1] + dev_integral_image[location4] -dev_integral_image[location2] - dev_integral_image[location3];

I’m betting variables location1, location2, location3, and location4 are not coalesced within a warp. This is one of the worst things you can do on a GPU and you’re doing it 4 times in one call.

Please read https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/ for more details.

Hey mnicely!

Thanks for the info. I did figure out that it was uncoalesced reads from the “integral image” as the profiler spat out the same errors too.

Since the algorithm is based on randomness, i feel that the speed of the kernel execution cant be further optimized as attempts to coalesce will add a greater overhead compared to the execution.

I did try shifting parts of the integral image into shared memory but it worsened the behaviour.

Thanks!

Okay, I’m glad you figure it out.

Side note, there are many stochastic algorithms that have made their way to GPUs in an efficient manner. One that comes to mind is https://link.springer.com/article/10.1007/s11265-017-1254-6. The modified algorithm is still stochastic but in a way that adheres to the CUDA programming model better.

That seems like an interesting read as we do have random sampling too. Thanks!