DMA transfers in parallel 2-way SLI with 2 GTX 280

Hi all,

My program transfers large amounts of data (256 mb with one call of cudaMemcpy() for example) in the cycle from host to device memory using DMA. I use two work threads and each of them uses cudaSetDevice() to set device for GPU executions.
All seem to work fine but I’m confused with the timing results of cudaMemcpy() calls! Sometimes this transfers work two times faster than the same transfers in a single work thread. And sometimes my program shows that the transfer was performed in 0 ms! How can that be possible with synchronous cudaMemcpy() function? And what’s happening when DMA transfers overlap (two work threads call DMA transfer at the same time)?

Also I want to add, that I ran into this only on Windows (Windows XP in particular) and on Linux timing results of the same program are fine (but much slower than in Windows).

All this runs on:
Windows XP SP2
CUDA 2.1
2-way SLI with 2 GTX 280

Thanks for your help
Denis

When you say 0 ms does that literally mean zero or does the call still take a couple microseconds to execute? I’m not sure I understand the part about twice as fast either. Do you have any code you could post? What version of PCI Express are you running this on and what speeds do you see under Linux?

Thanks for your answer!

When I said 0 ms I meant that this code in my program

GET_TIME(st);

//cst = clock();

if(is_single)

	_sp->retrieve(dev_F, segment_index);

else

	_sp->retrieve(dev_Fd, segment_index);

GET_TIME(et);

if (TIMING) printf("\tthread index %d: segment %d: iteration %d: memory copy time: %.3fs\n",	dev_index, segment_index, k, et - st);

//cet = clock();

//if (TIMING) printf("\tthread index %d: segment %d: iteration %d: memory copy time(time.h): %d\n",	dev_index, segment_index, k-1, cet - cst);

prints something like this

thread index 0: segment 0: iteration 1: memory copy time: 0.000s

Here is the retrieve() function

template<typename T>

void MemoryStorage<T>::retrieve(cudaArray* ptr)

{

	CUDA_FUNCTION_CALL(cudaMemcpy2DToArray(ptr, 0, 0, this->_ptr, sizeof(T) * this->_N, sizeof(T) * this->_N, this->_segment_size / this->_N, cudaMemcpyHostToDevice));

}

I defined GET_TIME() macro as

#define GET_TIME(t)	 if (true){SYSTEMTIME time_t;GetSystemTime(&time_t); t = time_t.wMilliseconds / 1000.0 + time_t.wSecond + time_t.wMinute * 60.0 + time_t.wHour * 60.0 * 60.0;}

I also tried to measure time with the help of clock() function from time.h (commented out lines) and got the same results. So 0 ms means that execution time of cudaMemcpy2DToArray < 1 ms while the size of data is 256MB.

About twice faster:

the whole cycle where I execute this data transfers looks like this

while (++k != _K + 1 && !_stop){

			sync_cpu_threads();

			_stop = true;

			CUDA_FUNCTION_CALL(cudaMemcpy(dev_x_M, _host_x_M, sizeof(T) * _M, cudaMemcpyHostToDevice));

			for (int segment_index = dev_index; segment_index < _segments_count; segment_index += this->_dev_count){

				__int64 x_segment_size = _sp->size(segment_index) / _M;

				__int64 start_index = x_segment_size * segment_index;

				

				GET_TIME(st);

				//cst = clock();

				if(is_single)

					_sp->retrieve(dev_F, segment_index);

				else

					_sp->retrieve(dev_Fd, segment_index);

				GET_TIME(et);

				if (TIMING)	printf("\tthread index %d: segment %d: iteration %d: memory copy time: %.3fs\n",	dev_index, segment_index, k, et - st);

				//cet = clock();

				//if (TIMING)	printf("\tthread index %d: segment %d: iteration %d: memory copy time(time.h): %d\n",	dev_index, segment_index, k-1, cet - cst);

												

				if(is_single)

					CUDA_FUNCTION_CALL(cudaBindTextureToArray(texRefA, dev_F));

				/*else

					CUDA_FUNCTION_CALL(cudaBindTextureToArray(texRefAd, dev_F));*/

				

				CUDA_FUNCTION_CALL(cudaMemcpy(dev_x, _host_x + start_index, sizeof(T) * x_segment_size, cudaMemcpyHostToDevice));

				if (TRACE) printf("\tthread index %d: segment %d: iteration %d\n",	dev_index, segment_index, k);

				

				host_stop = 1;

				CUDA_FUNCTION_CALL(cudaMemcpy(dev_stop, &host_stop, sizeof(int) * 1, cudaMemcpyHostToDevice));

				//will use memset instead of memcpy

				GET_TIME(st);

				//cst = clock();

				if(is_single)

					iteration_v1<<< blocks, threads >>>((float*)dev_x, (float*)dev_x_M, (float*)dev_g, _M, start_index, (float)_epsilon, dev_stop);

				else

					iteration_v1d<<< blocks, threads >>>((double*)dev_x, (double*)dev_x_M, (double*)dev_g, (double*)dev_Fd, _M, start_index, (double)_epsilon, dev_stop);

		

				e = cudaThreadSynchronize();

				GET_TIME(et);

				if (TIMING)	printf("\tthread index %d: segment %d: iteration %d: kernel work time: %.3fs\n",	dev_index, segment_index, k, et - st);

				//cet = clock();

				//if (TIMING)	printf("\tthread index %d: segment %d: iteration %d: kernel work time(time.h): %d\n",	dev_index, segment_index, k-1, cet - cst);

				//e = cudaGetLastError();

				if (e != cudaSuccess) {

					fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",

							__FILE__, __LINE__, cudaGetErrorString(e));

					exit(e);

				}

				

				CUDA_FUNCTION_CALL(cudaMemcpy(&host_stop, dev_stop, sizeof(int) * 1, cudaMemcpyDeviceToHost));

				if (dev_F){	

					if(is_single)

						cudaUnbindTexture(texRefA);

					/*else

						cudaUnbindTexture(texRefAd);*/

					//cudaFreeArray(dev_F);

					//dev_F = NULL;

				}

				CUDA_FUNCTION_CALL(cudaMemcpy(_host_x + start_index, dev_x, sizeof(T) * x_segment_size, cudaMemcpyDeviceToHost));

				

				if(this->_dev_count > 1) LOCK_MUTEX(_g_mutex);

				_stop &= (bool)host_stop;

				if(this->_dev_count > 1) UNLOCK_MUTEX(_g_mutex);

			}

	if (TRACE && this->_dev_count > 1) printf("\tthread index %d: iteration done. synchronizing...\n", dev_index);

			sync_cpu_threads();

	if (TRACE && k < _K) printf("thread index %d: iteration %d complete (error_code %d).\n", dev_index, k, cudaGetLastError());			

		}

_sp object contains data devided into segment 256 MB each.

I can run this code on one of my GPUs (one work thread) or on both (two work threads). First GPU processes segment from _sp with odd indexes, second - with even indexes. When I use one GPU memory transfer time is nearly 46 ms. But when I use both my program prints out different memory transfer times - from 46 ms to 15 ms and than 0 ms.

Tomorrow when I get access to my remote computer I’ll post exact speeds on Linux.

First thing i would advise is moving to QueryPerformanceCounter to get much much more accurate timings.

Thanks for your advise. I’ll try it!

But my current timings show rather reasonable results when there is only one work thread.

So I’ll try to explain it once more

first scenario:

main thread

[indent]—>work thread 1

[indent]—>iteration

[indent]—> DMA memory transfer (256 mb) //each of these memory transfers shows 46 ms[/indent][/indent][/indent]

second scenario:

main thread

[indent]—>work thread 1

[indent]—>iteration

[indent]—> DMA memory transfer (256 mb) //each of these memory transfers shows different timing from 46 ms to 15 ms and then 0 ms[/indent][/indent]

—>work thread 2

[indent]—>iteration

[indent]—> DMA memory transfer (256 mb) //each of these memory transfers shows different timing from 46 ms to 15 ms and then 0 ms[/indent][/indent][/indent]

Work threads in the second scenario work on CPU in parallel and each of them uses separate GPU for its executions.

The number of interrupts the system clock generates per second is configurable at the hardware level but typically it’s fixed at 1 per 15.625 ms under windows. That explains why you either get 0 or 15 (or 46) =) I would also say give QueryPerformanceCounter/Frequency a try and let us know how that works.

As others have said, most timing methods have a limited resolution. For Linux, use gettimeofday() for high resolution timing. That and QueryPerformanceTimer() (Windows) are the only functions that will give you sub-milisecond resolution.

Still, under 15ms for a 256MB transfer… I’m amazed.

This would be a good start:

[codebox]

include <sys/time.h>

define TIMER_FREQUENCY 1E6

inline void QueryHPCTimer(__int64 *time)

{

timeval linTimer;

gettimeofday(&linTimer, 0);

*time = linTimer.tv_sec * TIMER_FREQUENCY + linTimer.tv_usec;

};

inline void QueryHPCFrequency(__int64 *freq)

{

*freq = TIMER_FREQUENCY;

};

#undef TIMER_FREQUENCY

[/codebox]

Or you could use the file I attached as a cross-platform performance timer.
HPC_timing.h (906 Bytes)

Thanks for your answer!
Under Linux I’m already using gettimeofday(). I want to try QueryPerformanceCounter() under Window, but unfortunately I can’t get access to my remote computer where I run all this (internet provider experience some technical problems). So as soon as they fix it I’ll post the results.

I’ve tried QueryPerformanceCounter() under Windows and it helped. Now my timings show near 90 ms for each 512 MB transfer. While I have near 5500 MB/s DMA transfer speed 90 ms is very reasonable result. So I assume that invalid timing caused all the problems. The only thing I can’t understand is why timing functions which I used before QueryPerformanceCounter() worked fine while there was only one work thread and yield wrong results when there were two work threads.
THANKS FOR YOUR HELP!