CUDA Runs 1/10th the speed of openGL

OK so I’ve got an algorithm that I’m working on, and I’ve previously implemented it using openGL/GLSL. The basic idea is this.

I have two off screen buffers, one is read only and one is write-only. I take in 5 floats as an input form the user, and, along with a (constant) input texture, I do some calculations involving a distance calculation, some sin and cos looks up, etc, I then sum the result with the old result from my input buffer (a texture). Then I return the result from the fragment shader, which is attached to the write buffer and so that’s where it gets stored. Then I swap the input/output buffers and calculate a new frame.

Doing it this way in openGL is very very quick, and I was pleasantly surprised at the results. However, now I’m doing the same algorithm in CUDA and I’m much chagrined to find that it’s running at < 1/10th the speed of the openGL implementation, and I’d like to figure out why. I’ve been through the performance guidlines and the only thing I can think of that would be responsible for the slow down is the fact that my (now single) buffer is in global memory in CUDA, and in texture memory in openGL. I did this because, as far as I could tell, there was no way to write back to texture memory in CUDA, which I need to be able to do. Does anyone have any suggestions on how I might speed this up more?

Memory read performance takes a factor of ~20 or more hit if you don’t have coalesced global memory reads (check the programming guide). Are you accessing the memory in such a way?

If you can’t modify the algo to get coalesced reads, then you can still use a texture. Just create a global mem array for the output and a texture for the input. After every kernel call, do a device->device memcpy to copy the output back to the input. Note that in the current version, device-device memcpys are somewhat slow (10G/s according to the bandwidth test on my machine). It has been said that this will be fixed in a later release.

Let’s say I need to take every element in a buffer (representing a 2D image, 512x512 we’ll say), and do some iterative transformation on it (ie take each element, add something to it, and write it back). Every element is totally independent (that is is doesn’t have any dependencies on neighboring pixels such as a median filter would), how would you structure that to get memory coalescing?

Well I modified my code that so now, I’m operating on a 512x512 image, with a 4x64 grid of 32x8 blocks.

This way, each row of the image has four warps, and each warp calculates 128 pixels (or 4 pixels/thread).

I take the warp# (which is just the blockIdx.x value for a given row), and use that to calculate a base address for the warp (128*warp#), then I have each thread process baseAddr+threadIdx.x, I increment the baseAddr by 32 and do it again.

This should yield all the threads in the warp accessing contigous spots in global memory, the block width is a multiple of half the warp size and the width of my image is a multiple of 16. Everything I’ve seen in the performance guide indicates that this should cause memory coalescing and give a performance boost. Yet I’m seeing exactly the same performance as before.

Hmm, strange. It seems from your block setup that you should get full memory coalescing now. You could check that by commenting out all of the calculation and just copying the input to the output. On a GTX, this should get between 40 and 70GB/s (I get 70GB/s coalesced write only and 40GB/s coalesced read only on mine, haven’t tried a copy).

I’m out of ideas as to your performance hit though. You mention calling sin and cos. Are you calling sinf and cosf as the guide recommends to avoid the type conversion? That’s all I can think of. Maybe someone else has a better idea.

Can you post your code, or at least a simplified version of it?

The performance degradation is probably related to the data transfer between CUDA and OpenGL. Did you test speed of it?

This problem is related to my question:
Is the PBO usage the only option to exchange pixel data between CUDA and OpenGL?
What is the expected speed of this operation?

Simon,
Do you plan to introduce more direct methods of data exchange between OpenGL and CUDA ?
Would it be ever possible to lock and access FBO form the CUDA API?

Jan

I didn’t see anything in the original post about transferring data between CUDA and OpenGL.

But yes, using buffer objects is currently the only efficient way to exchange pixel data between CUDA and OpenGL (other than reading the data back to system memory).

Using buffer objects is essentially free - i.e. reading or writing to a pointer obtained by mapping a buffer object should be the same speed as writing to any other global memory.

There is a cost to transferring data from OpenGL to the buffer object (i.e. glReadPixels), but these copies should happen entirely in video memory at close to memory bandwidth limits.

We are investigating direct OpenGL texture access for future CUDA releases.

Any news on the investigation?

Hi Simon, do you have any benchmark, test program that i can use to measure the bandwidth between OpenGl and Cuda. I try to measure it base on the postProcessGL, and i can not say it is fast, it is only 450 Mbs, that is much lower than memory bandwidth between CPU and GPU even with pageable memory

Do you mean the bandwidth between OpenGL pixel buffer and CUDA global memory is approximate the device-to-device bandwidth that is 62Gbs in my machine.

If i have much lower performance than that what should i do.

(my machine is Opteron Quadcore AMD 275, 6G memory, Quadro 5600, Ubuntu 7.10, driver 177.67)