Resample a 16-bit indexed color image What's killing my performance?

I’m trying to use CUDA to resample a 16-bit indexed color image and I just can’t seem to get good performance. I’m guessing that the problem is to do with memory latency because I don’t think this is a very computationally intensive problem. I’ve been able to do the same thing in Direct3D about 3 times faster.

By my estimation I’m getting about 1.5GB/sec memory read performance.

Can anyone suggest what I’m doing wrong please?

texture<ushort1, 2, cudaReadModeElementType> texImage;

texture<uchar4, 1, cudaReadModeElementType> texImageLUT;

__global__ void render_kernel(unsigned int *pDst, int nWindowSizeX, int nWindowSizeY, float fPanX, float fPanY, float fZoom)

{

   int ix = blockDim.x * blockIdx.x + threadIdx.x;

   int iy = blockDim.y * blockIdx.y + threadIdx.y;

	

   if(ix < nWindowSizeX && iy < nWindowSizeY)

   {

      float x = ((float)ix + 0.5f) / fZoom - fPanX;

      float y = ((float)iy + 0.5f) / fZoom - fPanY;

     

      float x0 = floorf(x - 0.5f) + 0.5f;

      float y0 = floorf(y - 0.5f) + 0.5f;

      float x1 = x0 + 1.0f;

      float y1 = y0 + 1.0f;

  

      unsigned short lx0y0 = tex2D(texImage, x0, y0).x;

      unsigned short lx1y0 = tex2D(texImage, x1, y0).x;

      unsigned short lx0y1 = tex2D(texImage, x0, y1).x;

      unsigned short lx1y1 = tex2D(texImage, x1, y1).x;

     //__syncthreads();

     uchar4 rgbx0y0 = tex1Dfetch(texImageLUT, lx0y0);

      uchar4 rgbx1y0 = tex1Dfetch(texImageLUT, lx1y0);

      uchar4 rgbx0y1 = tex1Dfetch(texImageLUT, lx0y1);

      uchar4 rgbx1y1 = tex1Dfetch(texImageLUT, lx1y1);

     float wx1 = x - x0;

      float wx0 = 1.0f - wx1;

      float wy1 = y - y0;

      float wy0 = 1.0f - wy1;

     unsigned char b = (rgbx0y0.x * wx0 + rgbx1y0.x * wx1) * wy0 + (rgbx0y1.x * wx0 + rgbx1y1.x * wx1) * wy1;

      unsigned char g = (rgbx0y0.y * wx0 + rgbx1y0.y * wx1) * wy0 + (rgbx0y1.y * wx0 + rgbx1y1.y * wx1) * wy1;

      unsigned char r = (rgbx0y0.z * wx0 + rgbx1y0.z * wx1) * wy0 + (rgbx0y1.z * wx0 + rgbx1y1.z * wx1) * wy1;

  

      //__syncthreads();

     pDst[nWindowSizeX * (nWindowSizeY - iy - 1) + ix] = (b << 16) + (g << 8) + r;

   }

}

extern "C" void render(unsigned int *pDst, cudaArray *pImage, unsigned int *pImageLUT, int nWindowSizeX, int nWindowSizeY, float fPanX, float fPanY, float fZoom)

{

   dim3 threads(8, 8);

   dim3 grid((nWindowSizeX + 7) / 8, (nWindowSizeY + 7) / 8);

	

   cudaBindTextureToArray(texImage, pImage);

    

   cudaBindTexture(0, texImageLUT, pImageLUT, 65536 * 4);

	

   render_kernel<<<grid, threads>>>(pDst, nWindowSizeX, nWindowSizeY, fPanX, fPanY, fZoom);

  cudaUnbindTexture(texImageLUT);

  cudaUnbindTexture(texImage);

}

You might get better performance by doing this without textures. A resampling is already a very “local” operation (at least when magnifying) so you should be able to do better than the texture cache if you read a part of the image into shared memory, resample, then write back the result.

Also you can split up the (seperable) resample in two passes, this might help, I don’t know.

BTW do you really need the huge lookup table? Normally I’d suggest storing the LUT in constant memory but this one is way too big for that.

I’m more concerned about the performance when demagnifying because the size of my output window will be limited to the screen resolution (e.g. 1600x1200) whereas my input image could be very large indeed (e.g. 12k x 12k). I think this means I have to do everything on a per-output-pixel basis and nothing (not even a copy from device to shared memory) on a per-input-pixel basis.

  1. Your output shape is bad. One needs at least 16 threads writing continuously for coalescing, where you only have 8.
  2. Your block is too small. The random color fetch has a high latency, it would be worth it to hide the latency using more threads.
  3. It may pay off for you to PREVENT CUDA from launching multiple blocks in a single multi-processor: blocks launched in random order screws up the cache. Increasing block size would also help this.

I increased my block size to 32 x 8 and it has actually made it slightly slower. At the same time my occupancy has dropped from 0.5 to 0.33. Does the low occupancy indicate that I am using too many registers?

How do I stop CUDA launching multiple blocks per multi-processor?

I wouldn’t worry about preventing multiple blocks from running on a multi-processor. My testing with a random memory read pattern shows no cache advantage with a single block on a multi-proc vs multiple blocks.

I meant the “blocky” pattern there, sorry for not being clear.

You seem to be indeed using too many registers. Could try using int and bit operation instead of uchar4.

Also, just noticed you used blockDim. It’s better to hardcode it. Multiply is more expensive than a shift.

By the way, what’s your nWindowSizeX? Coalescing also requires 128-byte alignment.

I’ve tried using int and bit shifts instead of uchar4. No difference.

I’ve tried using bit shifts instead of multiplying by blockDim. No difference.

My nWindowSizeX was 1583 but I tried changing it to 1576. No difference.

I said 128-byte alignment.
1576 is still a poor width, it’s still not a multiply of 32. try 1280, or 1536.

Why not check the cubin to see exactly how many registers you are using? That would settle that issue.

Getting the output fully coalesced will certainly help some, though I don’t think it will solve your problems.

To get started, one thing to try would be to comment out the LUT reading lines and replace them with some kind of simple formula. Based on the difference in the timings, you can estimate how much of the time is spent reading the LUT. It may be significant.

I’ve tried commenting out the reading of the LUT. If I read 4 pixels, lookup each and do bilinear interpolation in three color channels it takes about 18ms. If I read 4 pixels and do bilinear interpolation of the index (with no lookups) and replicate across all three color channels it takes about 4ms. If I read 2 pixels, lookup both and do linear interpolation (in x but not y) in three color channels then it takes about 9ms.

Sorry, I read it as 128-bit by mistake. I’ve tried 1568 instead (I’m using 32-bit ints) and it has made a small improvement. My 18ms is now about 15ms. I need to be around 7 or 8ms.

Could made into mad via precompute.

Do you really need maximum precision? You can try some “magical” fixed-point interpolation.

Aligning the big texture’s width may help.

Also, continue to try to increase occupancy (e.g. by stuffing things into shared memory).

There doesn’t seem to be any benefit to using fixed point math for the interpolation. I have gone back to using ints and bit shifts though. It has reduced my register usage to 16 and increased my occupancy to 0.667 but has made no difference to the speed.

I have made the source texture 6400 x 6400 pixels but that has also made no difference (because its being accessed through the texture cache I guess).

__global__ void render_kernel(unsigned int *pDst, int nWindowSizeX, int texture<ushort1, 2, cudaReadModeElementType> texImage;

texture<uint1, 1, cudaReadModeElementType> texImageLUT;

nWindowSizeY, float fPanX, float fPanY, float fInvZoom)

{

   int ix = (blockIdx.x << 4) + threadIdx.x;

   int iy = (blockIdx.y << 4) + threadIdx.y;

  if(ix < nWindowSizeX && iy < nWindowSizeY)

   {

      float x = ((float)ix + 0.5f) * fInvZoom + fPanX;

      float y = ((float)iy + 0.5f) * fInvZoom + fPanY;

  

      float x0 = floorf(x - 0.5f) + 0.5f;

      float y0 = floorf(y - 0.5f) + 0.5f;

      float x1 = x0 + 1.0f;

      float y1 = y0 + 1.0f;

     unsigned short lx0y0 = tex2D(texImage, x0, y0).x;

      unsigned short lx1y0 = tex2D(texImage, x1, y0).x;

      unsigned short lx0y1 = tex2D(texImage, x0, y1).x;

      unsigned short lx1y1 = tex2D(texImage, x1, y1).x;

  

      unsigned int rgbx0y0 = tex1Dfetch(texImageLUT, lx0y0).x;

      unsigned int rgbx1y0 = tex1Dfetch(texImageLUT, lx1y0).x;

      unsigned int rgbx0y1 = tex1Dfetch(texImageLUT, lx0y1).x;

      unsigned int rgbx1y1 = tex1Dfetch(texImageLUT, lx1y1).x;

     float wx0 = x1 - x;

      float wy0 = y1 - y;

      float wx1 = x - x0;

      float wy1 = y - y0;

     float wx0y0 = wx0 * wy0;

      float wx1y0 = wx1 * wy0;

      float wx0y1 = wx0 * wy1;

      float wx1y1 = wx1 * wy1;

  

      unsigned char b = (rgbx0y0 & 0xFF) * wx0y0 + (rgbx1y0 & 0xFF) * wx1y0 + (rgbx0y1 & 0xFF) * wx0y1 + (rgbx1y1 & 0xFF) * wx1y1;

      unsigned char g = ((rgbx0y0 >> 8) & 0xFF) * wx0y0 + ((rgbx1y0 >> 8) & 0xFF) * wx1y0 + ((rgbx0y1 >> 8) & 0xFF) * wx0y1 + ((rgbx1y1 >> 8) & 0xFF) * wx1y1;

      unsigned char r = ((rgbx0y0 >> 16) & 0xFF) * wx0y0 + ((rgbx1y0 >> 16) & 0xFF) * wx1y0 + ((rgbx0y1 >> 16) & 0xFF) * wx0y1 + ((rgbx1y1 >> 16) & 0xFF) * wx1y1;

  

      __syncthreads();

     pDst[nWindowSizeX * (nWindowSizeY - iy - 1) + ix] = (b << 16) + (g << 8) + r;

   }

}

extern "C" void render(unsigned int *pDst, cudaArray *pImage, unsigned int *pImageLUT, int nWindowSizeX, int nWindowSizeY, float fPanX, float fPanY, float fZoom)

{

   dim3 threads(16, 16);

   dim3 grid((nWindowSizeX + 15) / 16, (nWindowSizeY + 15) / 16);

	

   cudaBindTextureToArray(texImage, pImage);

    

   cudaBindTexture(0, texImageLUT, pImageLUT, 65536 * 4);

	

   render_kernel<<<grid, threads>>>(pDst, nWindowSizeX, nWindowSizeY, -fPanX, -fPanY, 1.0f / fZoom);

  cudaUnbindTexture(texImageLUT);

  cudaUnbindTexture(texImage);

}

I’ve had a bit of a breakthrough. I changed my LUT from 65536 ints in linear memory to a 2D CUDA array of 256 x 256 ints. This has taken me from 15ms down to 10ms. Thats about the same as my DirectX time.

I still have a few problems though.

Firstly, in DirectX my result goes straight into a swap chain and I can then just call Present() which doesn’t seem to have a performance cost. In CUDA I have to call glTexSubImage2D(), draw a quad and then call SwapBuffers(). This is costing me about 3ms. Is there any way to speed this up?

Secondly, in DirectX I can call Sleep() while I’m waiting for the GPU to finish and thus my CPU usage stays under 5%. How can I do this in CUDA? Do I have to wait for the next release?

Thirdly, does anyone know why CUDA only supports the clamp addressing mode and not wrap, border or mirror? In particular the wrap mode would be very useful for me (for some rather obscure reasons).

Many CUDA calls are asynchronous, so the return control to the CPU. However, the next synchronous CUDA call you make seems to spin-lock the CPU waiting for the GPU to finish what it is doing. One can only hope that the next release implements an idle CPU wait.

CUDA does support the wrap mode. Section 4.3.4.2 in the guide mentions it: “For normalized coordinates, the “wrap” addressing mode also may be specified. Wrap addressing is usually used when the texture contains a periodic signal”