memcpyDtoA takes significant GPU time

Hi All,

The code below is used to create a 3D texture

channelDesc = cudaCreateChannelDesc<float>();

				textureSize = make_cudaExtent( 912,  32,  907);

				copyParams.srcPtr = make_cudaPitchedPtr(

					( void* ) ( pFilteredData_i + start_angle_index * 912 * 32 ),

					textureSize.width * sizeof( float ),

					textureSize.width,

					textureSize.height );  // approx 105MB

				copyParams.dstArray = m_pArrayIPProjectionsGpu;

				copyParams.extent = textureSize;

				copyParams.kind = cudaMemcpyDeviceToDevice;

				cudaMemcpy3D( &copyParams );

			   

				texData.filterMode = cudaFilterModeLinear;

				texData.addressMode[0] = cudaAddressModeClamp;

				texData.addressMode[1] = cudaAddressModeClamp;

				texData.addressMode[2] = cudaAddressModeClamp;

				cudaBindTextureToArray( texData, m_pArrayIPProjectionsGpu, channelDesc );

				// Kernel Call which uses the 3D texture

The above code creates a 3D array from memory(already allocated in GPU - o/p of a kernel). The size of the memory is ~105MB. The profiler shows that, in average the D to A function takes 23ms to complete. This corresponds to ~4GBps bandwidth in GPU memory ( 105MB/23ms). Which we think is very low.

IS there any clue to what we should look for.? [ our card C1060 ]

thanks in advance

Hi All,

The code below is used to create a 3D texture

channelDesc = cudaCreateChannelDesc<float>();

				textureSize = make_cudaExtent( 912,  32,  907);

				copyParams.srcPtr = make_cudaPitchedPtr(

					( void* ) ( pFilteredData_i + start_angle_index * 912 * 32 ),

					textureSize.width * sizeof( float ),

					textureSize.width,

					textureSize.height );  // approx 105MB

				copyParams.dstArray = m_pArrayIPProjectionsGpu;

				copyParams.extent = textureSize;

				copyParams.kind = cudaMemcpyDeviceToDevice;

				cudaMemcpy3D( &copyParams );

			   

				texData.filterMode = cudaFilterModeLinear;

				texData.addressMode[0] = cudaAddressModeClamp;

				texData.addressMode[1] = cudaAddressModeClamp;

				texData.addressMode[2] = cudaAddressModeClamp;

				cudaBindTextureToArray( texData, m_pArrayIPProjectionsGpu, channelDesc );

				// Kernel Call which uses the 3D texture

The above code creates a 3D array from memory(already allocated in GPU - o/p of a kernel). The size of the memory is ~105MB. The profiler shows that, in average the D to A function takes 23ms to complete. This corresponds to ~4GBps bandwidth in GPU memory ( 105MB/23ms). Which we think is very low.

IS there any clue to what we should look for.? [ our card C1060 ]

thanks in advance

ok… we have reformatted the data and loaded it as 2d. now the effective BW is around 40GBps.

Thanks for reading this.

ok… we have reformatted the data and loaded it as 2d. now the effective BW is around 40GBps.

Thanks for reading this.