NVIDIA Developer Forums

memcpyDtoA takes significant GPU time

Accelerated Computing CUDA CUDA Programming and Performance

Akg October 4, 2010, 8:49am 1

Hi All,

The code below is used to create a 3D texture

channelDesc = cudaCreateChannelDesc<float>();

				textureSize = make_cudaExtent( 912,  32,  907);

				copyParams.srcPtr = make_cudaPitchedPtr(

					( void* ) ( pFilteredData_i + start_angle_index * 912 * 32 ),

					textureSize.width * sizeof( float ),

					textureSize.width,

					textureSize.height );  // approx 105MB

				copyParams.dstArray = m_pArrayIPProjectionsGpu;

				copyParams.extent = textureSize;

				copyParams.kind = cudaMemcpyDeviceToDevice;

				cudaMemcpy3D( &copyParams );

			   

				texData.filterMode = cudaFilterModeLinear;

				texData.addressMode[0] = cudaAddressModeClamp;

				texData.addressMode[1] = cudaAddressModeClamp;

				texData.addressMode[2] = cudaAddressModeClamp;

				cudaBindTextureToArray( texData, m_pArrayIPProjectionsGpu, channelDesc );

				// Kernel Call which uses the 3D texture

The above code creates a 3D array from memory(already allocated in GPU - o/p of a kernel). The size of the memory is ~105MB. The profiler shows that, in average the D to A function takes 23ms to complete. This corresponds to ~4GBps bandwidth in GPU memory ( 105MB/23ms). Which we think is very low.

IS there any clue to what we should look for.? [ our card C1060 ]

thanks in advance

Akg October 4, 2010, 8:49am 2

Hi All,

The code below is used to create a 3D texture

channelDesc = cudaCreateChannelDesc<float>();

				textureSize = make_cudaExtent( 912,  32,  907);

				copyParams.srcPtr = make_cudaPitchedPtr(

					( void* ) ( pFilteredData_i + start_angle_index * 912 * 32 ),

					textureSize.width * sizeof( float ),

					textureSize.width,

					textureSize.height );  // approx 105MB

				copyParams.dstArray = m_pArrayIPProjectionsGpu;

				copyParams.extent = textureSize;

				copyParams.kind = cudaMemcpyDeviceToDevice;

				cudaMemcpy3D( &copyParams );

			   

				texData.filterMode = cudaFilterModeLinear;

				texData.addressMode[0] = cudaAddressModeClamp;

				texData.addressMode[1] = cudaAddressModeClamp;

				texData.addressMode[2] = cudaAddressModeClamp;

				cudaBindTextureToArray( texData, m_pArrayIPProjectionsGpu, channelDesc );

				// Kernel Call which uses the 3D texture

The above code creates a 3D array from memory(already allocated in GPU - o/p of a kernel). The size of the memory is ~105MB. The profiler shows that, in average the D to A function takes 23ms to complete. This corresponds to ~4GBps bandwidth in GPU memory ( 105MB/23ms). Which we think is very low.

IS there any clue to what we should look for.? [ our card C1060 ]

thanks in advance

Akg October 7, 2010, 3:35am 3

ok… we have reformatted the data and loaded it as 2d. now the effective BW is around 40GBps.

Thanks for reading this.

Akg October 7, 2010, 3:35am 4

ok… we have reformatted the data and loaded it as 2d. now the effective BW is around 40GBps.

Thanks for reading this.

Topic		Replies	Views	Activity
3D copy to CUDA array much slower than in 2D CUDA Programming and Performance	1	7604	March 8, 2010
3D device-to-device memcopy to cudaArray slow? CUDA Programming and Performance	8	11757	January 14, 2010
Slow OpenGL Interoperabilty with texture memory memcopyDtoA CUDA Programming and Performance	2	1790	August 4, 2010
memCpy : Device to Device VERY SLOW CUDA Programming and Performance	7	2938	September 13, 2009
3D Texturing CUDA Programming and Performance	7	6245	August 8, 2008
cudaMemcpy3D performance issues (X-Z face) CUDA Programming and Performance	6	810	August 11, 2011
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	764	September 28, 2018
Slow cudaMemcpy execution Tested in GTX480 and GT240 CUDA Programming and Performance	6	2326	April 25, 2012
Large overhead on cudaMemcpy, isolated case CUDA Programming and Performance	2	1449	April 18, 2012
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13438	January 30, 2008