C2050 bidirectional async transfer slower than unidirectional

pacard · April 23, 2010, 4:01am

I wrote the following code to test the memory transfer bandwidth of our machine, which is a machine with 6 C2050 devices. Since C2050 supports bi-directional async memcpy, I used 8 streams to transfer memory from H2D and D2H at the same time. But I found that the aggregated bandwidth of bi-directional memcpy is even lower than H2D only.

Here is the result:

[root@A124 test]# ./a.out 0

testing uni-directional bandwidth:

time elapsed:0.355392

bandwidth at block size 67108864: 5.62759GB/s

[root@A124 test]# ./a.out 1

testing bi-directional bandwidth:

time elapsed:0.422472

bandwidth at block size 67108864: 4.73404GB/s

Any ideas? Is it possible that the bi-directional transfer is some how turned off on our machine?

OS: RHEL 5.3 x86_64

driver: 195.36.20

NVCC: release 3.0, V0.2.1221

deviceQuery Result:

[i] Device 0: “Tesla C2050”

CUDA Driver Version: 3.0

CUDA Runtime Version: 3.0

CUDA Capability Major revision number: 2

CUDA Capability Minor revision number: 0

Total amount of global memory: 2817982464 bytes

Number of multiprocessors: 14

Number of cores: 448

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

[/i]

Here is the code I used:

#include <stdlib.h>

using namespace std;

#  define CUDA_SAFE_CALL_NO_SYNC( call) {									\

	cudaError err = call;													\

	if( cudaSuccess != err) {												\

		fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",		\

				__FILE__, __LINE__, cudaGetErrorString( err) );			  \

		exit(EXIT_FAILURE);												  \

	} }

#  define CUDA_SAFE_CALL( call)	 CUDA_SAFE_CALL_NO_SYNC(call);

const int STREAMS=8;

double get_time(){

		cudaThreadSynchronize();

		timeval t;

		gettimeofday(&t,0);

		return (double)t.tv_sec+(double)t.tv_usec/1000000;

}

void bw_test(int bidirect){

		CUDA_SAFE_CALL( cudaSetDevice(0) );

		cpu_set_t cpu_set;

		CPU_ZERO(&cpu_set);

		CPU_SET(0, &cpu_set);

		sched_setaffinity(0, 1, &cpu_set);

		cudaStream_t streams[STREAMS];

		for(int i=0;i<STREAMS;i++)

				cudaStreamCreate(&streams[i]);

		void * h_mem[STREAMS];

		void * d_mem[STREAMS];

		unsigned block_size=64*1024*1024;

		unsigned iter=4;

		// allocate memory

		for(int i=0;i<STREAMS;i++){

				CUDA_SAFE_CALL( cudaMallocHost(&h_mem[i], block_size) );

				memset(h_mem[i], 1, block_size);

				CUDA_SAFE_CALL( cudaMalloc((void**)&d_mem[i], block_size) );

		}

		// test bandwidth

		double start=get_time();

		for(int i=0;i<iter;i++){

				for(int j=0;j<STREAMS;j++){

						if(j%2==0){

								CUDA_SAFE_CALL( cudaMemcpyAsync(d_mem[j], h_mem[j], block_size, cudaMemcpyHostToDevice,streams[j]) );

						}

						else{

								if(bidirect){

										CUDA_SAFE_CALL( cudaMemcpyAsync(h_mem[j], d_mem[j], block_size, cudaMemcpyDeviceToHost,streams[j]) );

								}

								else{

										CUDA_SAFE_CALL( cudaMemcpyAsync(d_mem[j], h_mem[j], block_size, cudaMemcpyHostToDevice,streams[j]) );

								}

						}

				}

		}

		for(int i=0;i<STREAMS;i++){

				CUDA_SAFE_CALL( cudaStreamSynchronize(streams[i]) );

		}

		double end=get_time();

		// output

		double bw=(double)block_size*STREAMS*iter/(1024*1024*1024)/(end-start);

		cout<<"time elapsed:"<<end-start<<endl;

		cout<<"bandwidth at block size "<<block_size<<": "<<bw<<"GB/s"<<endl;

}

int main(int argc, char * argv[]){

		if(argc!=2){

				cout<<"usage: "<<argv[0]<<" 1/0"<<endl;

				return 1;

		}

		int bi_directional=atoi(argv[1]);

		if(bi_directional){

				cout<<"testing bi-directional bandwidth:"<<endl;

		}

		else{

				cout<<"testing uni-directional bandwidth:"<<endl;

		}

		bw_test(bi_directional);

		return 1;

}

zeus13i · April 23, 2010, 6:32am

Pic of your 6-C2050 rig? External Image

tmurray · April 23, 2010, 6:40am

I’ve spent a pretty reasonable amount of time looking at bidirectional perf, and most likely it’s an artifact of your system’s BIOS. What platform is it, and are you running the latest BIOS?

Also, is it a dual-socket dual-chipset Nehalem? Those things are often flaky in terms of bandwidth.

pacard · April 23, 2010, 6:56am

CPU: Dual Xeon 5520

Mainboard: Tyan FT72-B7015

The Xeon 5520 CPUs are nehalem architecture. And yes, our mother board is dual-socket dual-chipset. I believe our BIOS is already the latest version. I will double check with the manufacturer.

Any suggestions?

avidday · April 23, 2010, 7:32am

That 7015 board is a odd beast - dual Tylersberg IOH and then 4 PEX8647 PCI-e switches to provide 8 x16 PCI-e slots. Always wonder how well it would work in practice (never tried one myself). I would be asking Tyan about the BIOS because every other dual Tylserberg mobo I have seen had weird bandwidth issues.

avrg_andy · April 23, 2010, 11:47am

A related question… The following convolutional loop does not seem to benefit from bi-directional memcpy. I was expecting the async copies to overlap bi-directionally, but loop timing indicates they don’t. Why? Is my card incapable (Quadro FX 4600)? code problem? any best refs to read up on this?

(mod: the host-side I/O mallocs below were pinned, writeCombined)

//1) Upload a new frame.

//2) FFT to spectral domain.

//3) perform ‘m’ spectral multiplications with filter kernels (convolutions).

//4) perform ‘m’ IFFTs to spatial domain.

//5) sum the ‘m’ spatial domain outputs.

//6) download result to host, hopefully while simultaneously uploading next frame.

//7) goto 1).

cudaStream_t st1,st2;

cudaStreamCreate(&st1);

cudaStreamCreate(&st2);

int n=0;

while (n<NFRAMES){ //one frame is 640x480.

	//UPLOAD NEW FRAME (as uchar8 to reduce bandwidth):  

	cutilSafeCallNoSync( cudaMemcpyAsync(d_Input_uchar, h_Data, dataH * dataW * sizeof(unsigned char), cudaMemcpyHostToDevice, st1) );

	//convert to float for convolution:

	my_cuda_uchar_to_float( dataW,dataH,  d_Input_uchar,dataW, d_Input,dataW);

	//Pad for fft:

	cutilSafeCallNoSync( cudaMemset(d_PaddedInput, 0, fftH * fftW * sizeof(float)) );   

	padDataClampToBorder(d_PaddedInput,d_Input,fftH,fftW,dataH,dataW,dataH,dataW,

kernelY,kernelX);

	//FFT:

	cufftSafeCall( cufftExecR2C(fftPlanFwd, d_PaddedInput, (cufftComplex *)d_DataSpectrum) );

	//set result accumulator to zero:

	cutilSafeCallNoSync( cudaMemset(d_Result, 0, dataH * dataW * sizeof(float)) );

	//DO 'NCONVS' FFT-BASED CONVOLUTIONS // the meat.

	for(int m=0;m<NCONVS;m++){

	   //MULTIPLY same image spectrum by 'm'th pre-defined filter:

	   modulateAndNormalize(d_MulSpectrum, d_DataSpectrum, d_KernelSpectrum[m], fftH, fftW);

	   //IFFT:

	   cufftSafeCall( cufftExecC2R(fftPlanInv, (cufftComplex *)d_MulSpectrum, d_ConvRes) );

	   //combine ConvRes outputs into single d_Result (just add them all):

	   cuda_add_images_I(dataW,dataH,d_Result,dataW,d_ConvRes,fftW);

	}

	//convert to uint8 to reduce data transfer:

	my_cuda_float_to_uchar( dataW,dataH,  d_Result,dataW, d_Result_uchar,dataW);

	//DOWNLOAD RESULT TO CPU: //hopefully, this should happen simultaneously with upload.

	cutilSafeCallNoSync( cudaMemcpyAsync(h_Result, d_Result_uchar, dataH * dataW * sizeof(unsigned char), cudaMemcpyDeviceToHost,st2) );

n++;

}

By the way, my IPP/CPU implementation is twice as fast on a ‘nothing special’ PC. Interestingly, the memcpys do not seem to be the bottleneck… each FFT/IFFT seems to overload the GPU such that every kernel proc happens serially anyway. Perhaps the code could be further optimised or is wrong (e.g, sync tips/issues)? Any relevant reading/tutorials?

tmurray · April 23, 2010, 4:46pm

Quadro 4600 is G80-based, which isn’t capable of any asynchronous memcpys whatsoever.

avrg_andy · April 26, 2010, 9:50am

Thanks. I imagine multiple ‘large-ish’ (640x480) IFFTs would saturate any GPU and so don’t run well in parallel, meaning H-D/D-H transfers are not the main bottleneck so upgrading to hardware capable of async/bi-dir. memcpy is pointless. True?

Topic		Replies	Views
Highly variant memcpyAsync bandwidth on Tesla C2050 pinned memory, async memcpy CUDA Programming and Performance	6	4763	October 24, 2011
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9224	October 26, 2010
bi-directional cudamemcpy on Fermi CUDA Programming and Performance	2	1219	October 24, 2011
bi-directional PCI-E transfer overlap CUDA Programming and Performance	23	37612	July 30, 2010
Question about PCI-E transfer throughput CUDA Programming and Performance	13	453	April 5, 2025
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2600	May 10, 2016
Erratic multi-gpu bandwidth CUDA Programming and Performance	8	2813	June 25, 2015
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26445	February 8, 2010
Asymmetric PCIe bandwidth in bidirectional transfers: H2D drops 56% while D2H maintains performance CUDA Programming and Performance pcie , cuda , a100	1	102	December 2, 2025
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68529	April 18, 2008

C2050 bidirectional async transfer slower than unidirectional

Related topics