Low performance and high CPU usage

I’m trying to use CUDA and OpenGL to write a viewer for large (say 12k x 12k) images. I’m storing the images in video memory as an array of 16-bit indices and a palette of 32-bit RGB values. I’m then displaying (part of) this image in a display window with pan and zoom controls. I do this by processing the data into an OpenGL PBO which I then copy into an OpenGL texture and use to draw a quad into the backbuffer of my window.

I have a couple of problems. Firstly, it doesn’t seem to be very fast. For a 1600x1200 window the whole process takes around 30ms on a GeForce 8600GT. Secondly, it seems to be using 100% of the host CPU. Can anyone help?

My kernel looks like this:

__global__ void render_kernel(unsigned int *pDst, int nWindowSizeX, int nWindowSizeY, float fPanX, float fPanY, float fZoom)


   int ix = blockDim.x * blockIdx.x + threadIdx.x;

   int iy = blockDim.y * blockIdx.y + threadIdx.y;

  if(ix < nWindowSizeX && iy < nWindowSizeY)


      float x = ((float)ix + fPanX + 0.5f) / fZoom;

      float y = ((float)iy + fPanY + 0.5f) / fZoom;


      float x0 = floorf(x - 0.5f) + 0.5f;

      float y0 = floorf(y - 0.5f) + 0.5f;

      float x1 = x0 + 1.0f;

      float y1 = y0 + 1.0f;


      unsigned short lx0y0 = tex2D(texImage, x0, y0).x;

      unsigned short lx1y0 = tex2D(texImage, x1, y0).x;

      unsigned short lx0y1 = tex2D(texImage, x0, y1).x;

      unsigned short lx1y1 = tex2D(texImage, x1, y1).x;


     uchar4 rgbx0y0 = tex1Dfetch(texImageLUT, lx0y0);

      uchar4 rgbx1y0 = tex1Dfetch(texImageLUT, lx1y0);

      uchar4 rgbx0y1 = tex1Dfetch(texImageLUT, lx0y1);

      uchar4 rgbx1y1 = tex1Dfetch(texImageLUT, lx1y1);


     float wx1 = x - x0;

      float wx0 = 1.0f - wx1;

      float wy1 = y - y0;

      float wy0 = 1.0f - wy1;

     unsigned char b = (rgbx0y0.x * wx0 + rgbx1y0.x * wx1) * wy0 + (rgbx0y1.x * wx0 + rgbx1y1.x * wx1) * wy1;

      unsigned char g = (rgbx0y0.y * wx0 + rgbx1y0.y * wx1) * wy0 + (rgbx0y1.y * wx0 + rgbx1y1.y * wx1) * wy1;

      unsigned char r = (rgbx0y0.z * wx0 + rgbx1y0.z * wx1) * wy0 + (rgbx0y1.z * wx0 + rgbx1y1.z * wx1) * wy1;


      pDst[nWindowSizeX * (nWindowSizeY - iy - 1) + ix] = (b << 16) + (g << 8) + r;



why you use each thread only once, rather than a loop like “while(ix < nWindowSizeX && iy < nWindowSizeY)”? what’s your blockNum and threadNum?

you could use fewer bn and tn, but with “while” loop. I always scan a 1d array like below, for coalesced read. and for the very first parameter before tuning performance, bn = tn =256 is a safe start.


void scan(T* d_Dst, T* d_Src, const int nR)


	const int gridLen = gridDim.x * blockDim.x;

	T tmpR;

	int offset = blockIdx.x * blockDim.x + threadIdx.x;

	while(offset < nR)


  tmpR = d_Src[offset];

                                //do some thing

  d_Dst[offset] = tmpR;

  offset = offset + gridLen;



for cpu, since pbo is alreay in gpu, there’s no cpu transfer overhead, so 100% is weird. perhaps your UI state machine is problematic. you can download a trial amd “CodeAnalyst” to check the bottleneck.

I’ve now written the same thing using plain DirectX 9.0 and a custom pixel shader and its much faster (about 10ms instead of 30ms) and has almost 0% host CPU usage. Unfortunately there are a couple of major drawbacks. Firstly, I’m limited to 8k x 8k images. Secondly, I have to use an A8L8 texture for my image data. This is a big problem because A8L8 isn’t supported as a render target in DirectX so my image has to come directly from the host and can’t be pre-processed on the GPU.

Does anyone have any ideas why my CUDA version is so slow? Is there any chance of 16k or 32k textures in a future version of the DirectX driver?

I started from the image denoising sample program. I’m still a little bit vague on exactly how threads, grids and blocks fit together. Are there any complete samples that are closer to what I should be doing?

As far as the 100% host CPU usage goes I think at least half the problem is the OpenGL interop. I think that some of those calls are blocking if the GPU is busy. I’ve tried putting a few Sleep() calls in and I can reclaim some time but with no way to find out if a function is going to block its pretty difficult to do much.

Is your image stored as GL_RGB? If so, it will need to be unpacked by the cpu to a four component format before CUDA gets ahold of it. The fast path is GL_RGBA.

Are you using glDrawPixels?
CUDA driver seems to implement it using a CPU memcpy. It’s better to use glTexSubImage like said by the NV guy.

have u installed newest sdk for asyncronized kernel call?

dx’s supporting for >8k*8k is not very near i think.

for the A8L8, dx have supports of this kind of char2 i think, you can check your card’s d3dfmt to workaround it. or you can pad it into char4, in which case you waste half of the memory.

Yes, I’ve installed the newest SDK.

Does anyone know exactly why DirectX is limited to 8k x 8k when CUDA isn’t? Is there some part of the hardware that CUDA isn’t using that imposes this limit or is it just a software/driver thing?

The only 16-bit per pixel texture that is supported under DirectX as a render target is R16F which I can’t use because of the loss of precision and the hassle of converting between float and half-float. I can’t really afford the memory for a 32-bit per pixel texture.

I’m already using glTexSubImage

I don’t really understand this question. My source image is stored as a CUDA array of ushort1 for the indices plus a CUDA linear block uchar4 for the palette. The destination is a PBO which gets copied into a GL_RGBA texture using glTexSubImage2D.

As I understand it, my use of tex2D means that I am accessing texture memory space and that these accesses will be cached. Would I get better performance using global memory space?

Okay, I’ve tried using global memory instead of texture memory for my image data. I’ve tried using a loop within my kernel. I’ve tried lots of different grid/block sizes and nothing is helping. What am I doing wrong? How do I write it so that the hardware is used in as similar way as possible to that under Direct3D?

I see you’re using 2 syncthreads in your code. Can you remove one of them or both and report what happens? Since there is no write dependency in your code, everything should still stay functionally correct, but it’d be interesting to see performance differences.


I’ve just retested it and actually they don’t seem to make any difference at all. I can take out one or both and my times are the same. Typical lines from the profiler log look like this:

method=[ render_kernel ] gputime=[ 24010.561 ] cputime=[ 31219.609 ] occupancy=[ 0.583 ]

method=[ render_kernel ] gputime=[ 23795.010 ] cputime=[ 30925.439 ] occupancy=[ 0.583 ]

My own timings (which also include the time to display the results in OpenGL) are around 35ms.