device->host->device copy vs cudaGLMapBufferObject 6vs9ms, shouldn't mapping be way faster

Hi,

I have a texture in OpenGL (640x480xGL_RGBA32F_ARB, so not bytes but floats per channel) and I need to get this data over to CUDA.

Using glGetTexImage for OpenGL->CPU and then using cudaMemcpy for CPU->CUDA it takes 6ms, I guess 3ms per operation but I didn’t check it.

The strange thing is: Using CUDA’s ability to map buffer objects it takes 9ms. I was hoping I could do it a little faster.

Does anyone have timings of their own? I have to say I expected more speed, at least as fast as the first naive version.

Can anyone confirm my performance measurements?

The first method (6ms) is straight forward:

glBindTexture(GL_TEXTURE_2D,tex);

glGetTexImage( GL_TEXTURE_2D,0,GL_RGB, GL_FLOAT, pixels);

CUDA_SAFE_CALL( cudaMemcpy( d_input,pixels , sizeof(float3) * 640*480 , cudaMemcpyHostToDevice ) );

I implemented the second method (9ms) this way (shortened):

glBindBuffer(GL_ARRAY_BUFFER,buffer);

glBufferData(GL_ARRAY_BUFFER,s,fakeNonNullData,GL_DYNAMIC_DRAW);

CUDA_SAFE_CALL( cudaGLRegisterBufferObject(buffer) );

glBindBuffer(GL_PIXEL_PACK_BUFFER,buffer);

glBindTexture(GL_TEXTURE_2D,tex);

glGetTexImage(GL_TEXTURE_2D,0,GL_RGB,GL_FLOAT,0);

CUDA_SAFE_CALL( cudaGLMapBufferObject( (void**)&d_input , buffer) );

Of course I create the buffer only once for the whole application lifecycle and I also only once register the buffer object (cudaGLRegisterBufferObject). What I call repeatedly is just glGetTexImage and the cudaGLMapBufferObject.

glGetTexImage takes 6.5ms and cudaGLMapBufferObject takes 2.5ms.

Did you spot any obvious mistakes?

I read the other threads about buffer objects and I also checked the sdk examples.

thx

LastBoyScout

edit: Ah sorry, I am using CUDA 1.0.