C2050 bidirectional async transfer slower than unidirectional

I wrote the following code to test the memory transfer bandwidth of our machine, which is a machine with 6 C2050 devices. Since C2050 supports bi-directional async memcpy, I used 8 streams to transfer memory from H2D and D2H at the same time. But I found that the aggregated bandwidth of bi-directional memcpy is even lower than H2D only.

Here is the result:

[root@A124 test]# ./a.out 0

testing uni-directional bandwidth:

time elapsed:0.355392

bandwidth at block size 67108864: 5.62759GB/s

[root@A124 test]# ./a.out 1

testing bi-directional bandwidth:

time elapsed:0.422472

bandwidth at block size 67108864: 4.73404GB/s

Any ideas? Is it possible that the bi-directional transfer is some how turned off on our machine?

OS: RHEL 5.3 x86_64

driver: 195.36.20

NVCC: release 3.0, V0.2.1221

deviceQuery Result:

[i] Device 0: “Tesla C2050”

CUDA Driver Version: 3.0

CUDA Runtime Version: 3.0

CUDA Capability Major revision number: 2

CUDA Capability Minor revision number: 0

Total amount of global memory: 2817982464 bytes

Number of multiprocessors: 14

Number of cores: 448

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

[/i]

Here is the code I used:

#include <stdlib.h>

using namespace std;

#  define CUDA_SAFE_CALL_NO_SYNC( call) {									\

	cudaError err = call;													\

	if( cudaSuccess != err) {												\

		fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",		\

				__FILE__, __LINE__, cudaGetErrorString( err) );			  \

		exit(EXIT_FAILURE);												  \

	} }

#  define CUDA_SAFE_CALL( call)	 CUDA_SAFE_CALL_NO_SYNC(call);

const int STREAMS=8;

double get_time(){

		cudaThreadSynchronize();

		timeval t;

		gettimeofday(&t,0);

		return (double)t.tv_sec+(double)t.tv_usec/1000000;

}

void bw_test(int bidirect){

		CUDA_SAFE_CALL( cudaSetDevice(0) );

		cpu_set_t cpu_set;

		CPU_ZERO(&cpu_set);

		CPU_SET(0, &cpu_set);

		sched_setaffinity(0, 1, &cpu_set);

		cudaStream_t streams[STREAMS];

		for(int i=0;i<STREAMS;i++)

				cudaStreamCreate(&streams[i]);

		void * h_mem[STREAMS];

		void * d_mem[STREAMS];

		unsigned block_size=64*1024*1024;

		unsigned iter=4;

		// allocate memory

		for(int i=0;i<STREAMS;i++){

				CUDA_SAFE_CALL( cudaMallocHost(&h_mem[i], block_size) );

				memset(h_mem[i], 1, block_size);

				CUDA_SAFE_CALL( cudaMalloc((void**)&d_mem[i], block_size) );

		}

		// test bandwidth

		double start=get_time();

		for(int i=0;i<iter;i++){

				for(int j=0;j<STREAMS;j++){

						if(j%2==0){

								CUDA_SAFE_CALL( cudaMemcpyAsync(d_mem[j], h_mem[j], block_size, cudaMemcpyHostToDevice,streams[j]) );

						}

						else{

								if(bidirect){

										CUDA_SAFE_CALL( cudaMemcpyAsync(h_mem[j], d_mem[j], block_size, cudaMemcpyDeviceToHost,streams[j]) );

								}

								else{

										CUDA_SAFE_CALL( cudaMemcpyAsync(d_mem[j], h_mem[j], block_size, cudaMemcpyHostToDevice,streams[j]) );

								}

						}

				}

		}

		for(int i=0;i<STREAMS;i++){

				CUDA_SAFE_CALL( cudaStreamSynchronize(streams[i]) );

		}

		double end=get_time();

		// output

		double bw=(double)block_size*STREAMS*iter/(1024*1024*1024)/(end-start);

		cout<<"time elapsed:"<<end-start<<endl;

		cout<<"bandwidth at block size "<<block_size<<": "<<bw<<"GB/s"<<endl;

}

int main(int argc, char * argv[]){

		if(argc!=2){

				cout<<"usage: "<<argv[0]<<" 1/0"<<endl;

				return 1;

		}

		int bi_directional=atoi(argv[1]);

		if(bi_directional){

				cout<<"testing bi-directional bandwidth:"<<endl;

		}

		else{

				cout<<"testing uni-directional bandwidth:"<<endl;

		}

		bw_test(bi_directional);

		return 1;

}

Pic of your 6-C2050 rig? External Image

I’ve spent a pretty reasonable amount of time looking at bidirectional perf, and most likely it’s an artifact of your system’s BIOS. What platform is it, and are you running the latest BIOS?

Also, is it a dual-socket dual-chipset Nehalem? Those things are often flaky in terms of bandwidth.

CPU: Dual Xeon 5520

Mainboard: Tyan FT72-B7015

The Xeon 5520 CPUs are nehalem architecture. And yes, our mother board is dual-socket dual-chipset. I believe our BIOS is already the latest version. I will double check with the manufacturer.

Any suggestions?

That 7015 board is a odd beast - dual Tylersberg IOH and then 4 PEX8647 PCI-e switches to provide 8 x16 PCI-e slots. Always wonder how well it would work in practice (never tried one myself). I would be asking Tyan about the BIOS because every other dual Tylserberg mobo I have seen had weird bandwidth issues.

A related question… The following convolutional loop does not seem to benefit from bi-directional memcpy. I was expecting the async copies to overlap bi-directionally, but loop timing indicates they don’t. Why? Is my card incapable (Quadro FX 4600)? code problem? any best refs to read up on this?

(mod: the host-side I/O mallocs below were pinned, writeCombined)

//1) Upload a new frame.

//2) FFT to spectral domain.

//3) perform ‘m’ spectral multiplications with filter kernels (convolutions).

//4) perform ‘m’ IFFTs to spatial domain.

//5) sum the ‘m’ spatial domain outputs.

//6) download result to host, hopefully while simultaneously uploading next frame.

//7) goto 1).

cudaStream_t st1,st2;

cudaStreamCreate(&st1);

cudaStreamCreate(&st2);

int n=0;

while (n<NFRAMES){ //one frame is 640x480.

	//UPLOAD NEW FRAME (as uchar8 to reduce bandwidth):  

	cutilSafeCallNoSync( cudaMemcpyAsync(d_Input_uchar, h_Data, dataH * dataW * sizeof(unsigned char), cudaMemcpyHostToDevice, st1) );

	//convert to float for convolution:

	my_cuda_uchar_to_float( dataW,dataH,  d_Input_uchar,dataW, d_Input,dataW);

	//Pad for fft:

	cutilSafeCallNoSync( cudaMemset(d_PaddedInput, 0, fftH * fftW * sizeof(float)) );   

	padDataClampToBorder(d_PaddedInput,d_Input,fftH,fftW,dataH,dataW,dataH,dataW,

kernelY,kernelX);

	//FFT:

	cufftSafeCall( cufftExecR2C(fftPlanFwd, d_PaddedInput, (cufftComplex *)d_DataSpectrum) );

	//set result accumulator to zero:

	cutilSafeCallNoSync( cudaMemset(d_Result, 0, dataH * dataW * sizeof(float)) );

	//DO 'NCONVS' FFT-BASED CONVOLUTIONS // the meat.

	for(int m=0;m<NCONVS;m++){

	   //MULTIPLY same image spectrum by 'm'th pre-defined filter:

	   modulateAndNormalize(d_MulSpectrum, d_DataSpectrum, d_KernelSpectrum[m], fftH, fftW);

	   //IFFT:

	   cufftSafeCall( cufftExecC2R(fftPlanInv, (cufftComplex *)d_MulSpectrum, d_ConvRes) );

	   //combine ConvRes outputs into single d_Result (just add them all):

	   cuda_add_images_I(dataW,dataH,d_Result,dataW,d_ConvRes,fftW);

	}

	//convert to uint8 to reduce data transfer:

	my_cuda_float_to_uchar( dataW,dataH,  d_Result,dataW, d_Result_uchar,dataW);

	//DOWNLOAD RESULT TO CPU: //hopefully, this should happen simultaneously with upload.

	cutilSafeCallNoSync( cudaMemcpyAsync(h_Result, d_Result_uchar, dataH * dataW * sizeof(unsigned char), cudaMemcpyDeviceToHost,st2) );

n++;

}

By the way, my IPP/CPU implementation is twice as fast on a ‘nothing special’ PC. Interestingly, the memcpys do not seem to be the bottleneck… each FFT/IFFT seems to overload the GPU such that every kernel proc happens serially anyway. Perhaps the code could be further optimised or is wrong (e.g, sync tips/issues)? Any relevant reading/tutorials?

Quadro 4600 is G80-based, which isn’t capable of any asynchronous memcpys whatsoever.

Thanks. I imagine multiple ‘large-ish’ (640x480) IFFTs would saturate any GPU and so don’t run well in parallel, meaning H-D/D-H transfers are not the main bottleneck so upgrading to hardware capable of async/bi-dir. memcpy is pointless. True?