CUDA Runs 1/10th the speed of openGL

gct · May 17, 2007, 8:39pm

OK so I’ve got an algorithm that I’m working on, and I’ve previously implemented it using openGL/GLSL. The basic idea is this.

I have two off screen buffers, one is read only and one is write-only. I take in 5 floats as an input form the user, and, along with a (constant) input texture, I do some calculations involving a distance calculation, some sin and cos looks up, etc, I then sum the result with the old result from my input buffer (a texture). Then I return the result from the fragment shader, which is attached to the write buffer and so that’s where it gets stored. Then I swap the input/output buffers and calculate a new frame.

Doing it this way in openGL is very very quick, and I was pleasantly surprised at the results. However, now I’m doing the same algorithm in CUDA and I’m much chagrined to find that it’s running at < 1/10th the speed of the openGL implementation, and I’d like to figure out why. I’ve been through the performance guidlines and the only thing I can think of that would be responsible for the slow down is the fact that my (now single) buffer is in global memory in CUDA, and in texture memory in openGL. I did this because, as far as I could tell, there was no way to write back to texture memory in CUDA, which I need to be able to do. Does anyone have any suggestions on how I might speed this up more?

MisterAnderson42 · May 18, 2007, 12:35pm

Memory read performance takes a factor of ~20 or more hit if you don’t have coalesced global memory reads (check the programming guide). Are you accessing the memory in such a way?

If you can’t modify the algo to get coalesced reads, then you can still use a texture. Just create a global mem array for the output and a texture for the input. After every kernel call, do a device->device memcpy to copy the output back to the input. Note that in the current version, device-device memcpys are somewhat slow (10G/s according to the bandwidth test on my machine). It has been said that this will be fixed in a later release.

gct · May 18, 2007, 5:10pm

Let’s say I need to take every element in a buffer (representing a 2D image, 512x512 we’ll say), and do some iterative transformation on it (ie take each element, add something to it, and write it back). Every element is totally independent (that is is doesn’t have any dependencies on neighboring pixels such as a median filter would), how would you structure that to get memory coalescing?

gct · May 18, 2007, 6:51pm

Well I modified my code that so now, I’m operating on a 512x512 image, with a 4x64 grid of 32x8 blocks.

This way, each row of the image has four warps, and each warp calculates 128 pixels (or 4 pixels/thread).

I take the warp# (which is just the blockIdx.x value for a given row), and use that to calculate a base address for the warp (128*warp#), then I have each thread process baseAddr+threadIdx.x, I increment the baseAddr by 32 and do it again.

This should yield all the threads in the warp accessing contigous spots in global memory, the block width is a multiple of half the warp size and the width of my image is a multiple of 16. Everything I’ve seen in the performance guide indicates that this should cause memory coalescing and give a performance boost. Yet I’m seeing exactly the same performance as before.

MisterAnderson42 · May 19, 2007, 3:10am

Well I modified my code that so now, I’m operating on a 512x512 image, with a 4x64 grid of 32x8 blocks.

This way, each row of the image has four warps, and each warp calculates 128 pixels (or 4 pixels/thread).

I take the warp# (which is just the blockIdx.x value for a given row), and use that to calculate a base address for the warp (128*warp#), then I have each thread process baseAddr+threadIdx.x, I increment the baseAddr by 32 and do it again.

This should yield all the threads in the warp accessing contigous spots in global memory, the block width is a multiple of half the warp size and the width of my image is a multiple of 16. Everything I’ve seen in the performance guide indicates that this should cause memory coalescing and give a performance boost. Yet I’m seeing exactly the same performance as before.

[snapback]198665[/snapback]

Hmm, strange. It seems from your block setup that you should get full memory coalescing now. You could check that by commenting out all of the calculation and just copying the input to the output. On a GTX, this should get between 40 and 70GB/s (I get 70GB/s coalesced write only and 40GB/s coalesced read only on mine, haven’t tried a copy).

I’m out of ideas as to your performance hit though. You mention calling sin and cos. Are you calling sinf and cosf as the guide recommends to avoid the type conversion? That’s all I can think of. Maybe someone else has a better idea.

Simon_Green · May 21, 2007, 2:37pm

Can you post your code, or at least a simplified version of it?

janzmca · May 23, 2007, 2:15am

The performance degradation is probably related to the data transfer between CUDA and OpenGL. Did you test speed of it?

This problem is related to my question:
Is the PBO usage the only option to exchange pixel data between CUDA and OpenGL?
What is the expected speed of this operation?

Simon,
Do you plan to introduce more direct methods of data exchange between OpenGL and CUDA ?
Would it be ever possible to lock and access FBO form the CUDA API?

Jan

Simon_Green · May 23, 2007, 10:07am

I didn’t see anything in the original post about transferring data between CUDA and OpenGL.

But yes, using buffer objects is currently the only efficient way to exchange pixel data between CUDA and OpenGL (other than reading the data back to system memory).

Using buffer objects is essentially free - i.e. reading or writing to a pointer obtained by mapping a buffer object should be the same speed as writing to any other global memory.

There is a cost to transferring data from OpenGL to the buffer object (i.e. glReadPixels), but these copies should happen entirely in video memory at close to memory bandwidth limits.

We are investigating direct OpenGL texture access for future CUDA releases.

foo · September 3, 2008, 8:03am

Any news on the investigation?

Linh_Ha · September 3, 2008, 4:07pm

Hi Simon, do you have any benchmark, test program that i can use to measure the bandwidth between OpenGl and Cuda. I try to measure it base on the postProcessGL, and i can not say it is fast, it is only 450 Mbs, that is much lower than memory bandwidth between CPU and GPU even with pageable memory

Do you mean the bandwidth between OpenGL pixel buffer and CUDA global memory is approximate the device-to-device bandwidth that is 62Gbs in my machine.

If i have much lower performance than that what should i do.

(my machine is Opteron Quadcore AMD 275, 6G memory, Quadro 5600, Ubuntu 7.10, driver 177.67)

Topic		Replies	Views
How would you do this? CUDA Programming and Performance	12	4464	August 5, 2008
CUDA vs. OpenGL textures read-only vs. read-write CUDA Programming and Performance	5	10926	March 30, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3300	January 10, 2010
Shared memory as slow as global memory CUDA Programming and Performance	8	4281	September 5, 2016
CUDA performance How to improve my code? CUDA Programming and Performance	16	13571	June 29, 2010
Resample a 16-bit indexed color image What's killing my performance? CUDA Programming and Performance	15	14254	August 23, 2007
About coalescing CUDA Programming and Performance	6	2605	April 16, 2010
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5064	September 6, 2008
global memory latency CUDA Programming and Performance	12	16156	December 13, 2007
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1826	November 1, 2010

CUDA Runs 1/10th the speed of openGL

Related topics