Multi GPU performance degrade when allocated memory increases.

I’ve tested the following on a GTX 690 GPU with 4GB RAM in Windows 7 x64, Visual C++ 10:

I’ve written a function that receives 2 vectors and adds into a 3rd vector. The task is broken over 2 GPU devices. I gradually increased the vector size to benchmark GPU performance. The required time linearly increases relative to vector size up to a certain point and then it abruptly jumps up. When I disable each of the GPU cores, the required time stays linear to the end of available memory. I’ve enclosed a diagram displaying required time versus allocated memory.

Can you tell me what is wrong?

Bests,
Ramin

This is my code:

unsigned	BenchMark( unsigned VectorSize )
{
	unsigned *		D[ 2 ][ 3 ] ;

	for ( int i = 0 ; i < 2 ; i++ )
	{
		cudaSetDevice( i ) ;

		for ( int j = 0 ; j < 3 ; j++ )
			cudaMalloc( & D[ i ][ j ] , VectorSize * sizeof( unsigned ) ) ;
	}

	unsigned	uStartTime = clock() ;

	// TEST
	for ( int i = 0 ; i < 2 ; i++ )
	{
		cudaSetDevice( i ) ;

		AddKernel<<<VectorSize/256,256>>>(
			D[ i ][ 0 ] ,
			D[ i ][ 1 ] ,
			D[ i ][ 2 ] ,
				VectorSize ) ;
	}

	cudaDeviceSynchronize() ;
	cudaSetDevice( 0 ) ;
	cudaDeviceSynchronize() ;

	unsigned	uEndTime = clock() ;

	for ( int i = 0 ; i < 2 ; i++ )
	{
		cudaSetDevice( i ) ;

		for ( int j = 0 ; j < 3 ; j++ )
			cudaFree( D[ i ][ j ] ) ;
	}

	return uEndTime - uStartTime ;
}

__global__ void	AddKernel(
					const	Npp32u *	__restrict__	pSource1 ,
					const	Npp32u *	__restrict__	pSource2 ,
						Npp32u *	__restrict__	pDestination ,
						unsigned			uLength )
{
	unsigned	x = blockIdx.x * blockDim.x + threadIdx.x ;

	if ( x < uLength )
		pDestination[ x ] = pSource1[ x ] + pSource2[ x ] ;	
}

This is the diagram:

https://sites.google.com/site/raminhalavati/Diagrams.png

I would advice to use CUDA events to measure the performance, instead the classical, host-side ‘clock()’ - maybe this is a cause of what You observed…

MK

Dear MK,

I tested that, no difference. I also suspected that when I use two GPU cores simultaneously, the heat forces lower clock rate and it reduces speed. But when I monitored the temperature, I did not see anything abnormal (it was always under 60 degrees).

Problem solved. The problem was due to SLI being active. I disabled it and now it is working smoothly.