Could someone compile simple example for me on the mobile card?

Dear all,
right now I’m working on one algorithm that consists of two parts. The first part is sequential (nothing can be done simultaneously) and has to be executed on CPU (because sequential code on GPU is much slower than on CPU). But the second part can be accelerated a lot using GPU. There is only one problem. To be able to run the second part on the GPU I have to transfer data from host to device. These transfers take a lot of time and because of them the whole runtime doesn’t benefit much using GPUs… I’m working on the desktop machine with non-integrated cards (GTX 295).

But as I know, using integrated cards (mobile cards for laptops) I don’t have to do data transfers (using zero-copies). In this case my algorithm could win a lot. Actually I don’t have a mobile card, because of that I wanna ask whether someone could compile one simple example and report me the runtime? I’m interested in the fastest mobile cards: GTX 280M, 260M, GTS 260M, Quadro FX 3700M, Quadro FX 3600M. Could anybody do it for me? The example is very short and simple (assignment of one array to another one on the GPU). I can upload the source code here. I would appreciate your help very much!

Thanks a lot in advance!

I don’t believe any of those mobile parts you are asking about can do zero copy. Amongst the integrated GPUs, only MCP79a based mobile parts (basically Ion + various Apple products) can do zero copy. But your GTX275 (like every other GT200 part) can also do zero copy, although it is slightly more complex that the implementation on the MCP79a. This post discusses zero copy in more detail.

Thanks for your answer. Yes, I’ve read already that post. But which cards are integrated cards that support zero-copies? I means cards that see CPU memory (that don’t need memory transfers).

Like I said, MCP79 aka Ion is the integrated part that can do zero copy. So any of the current NVIDIA chipset Macbooks, Mac minis or iMacs, and anything based on the Ion chipset, which is a couple of mini-ITX motherboards (at least Zotac and Asus IIRC), a couple of mini PCs (at least Acer, Asus and Asrock), and a couple of netbooks (at least Lenovo and HP). All based on the same basic GPU with 16 cores.

EDIT: I now remember that there is also a 8 core “entry level” MCP79 SKU announced, but I am not sure whether that has ever actually seen the light of day in a commercial product or not.

Thanks a lot for very useful information! So zero-copy makes sense only for mentioned NVIDIA chipsets with an integrated card, right? But what about Tesla S1070 and Tesla C1060? They maybe support zero-copies but don’t do them physically, since they are connected via PCIe and can’t access CPU memory directly?

What about GPU’s on NVidia chipsets, are they 3.1 ? When I’m running a code on the discrete GTX 295 then I have the following runtime: T1 + T2 + T3, where T1 - copies from host to device, T2 - kernel runtime and T3 - copies from device to host. If I will run it on the mentioned platforms then the whole runtime will be just T2 or I will have some other time consumptions?

Thanks for your help!

If by “makes sense” you mean work, then yes. They only mobile GPUs which support zero copy.

The C1060/S1070, along with every other GT200 card, can supposedly do zero copy. I am guessing that the GT200 memory controller has some additional functionality which allows it to independently DMA from a defined block of host memory over the PCI-e bus. The big difference between it and the integrated GPUs is probably the PCI-e bus latency, which could cause some coherency problems.

I have no idea what “3.1” is, so I can’t answer that-

It is your code and you have a zero copy capable GPU. Why don’t you benchmark it and see for yourself?

I wrote wrongly. I meant its compute capability, so it’s 1.3 I suppose?

GTX 295 is zero-copy capable, but it’s not integrated card. So anyway data has to be copied physically from host to device and back after the kernel execution. So it doesn’t make much sense to use it for my experiment, since I don’t benefit anything using zero-copies on this card. But for integrated cards that can see CPU memory and where physical data transfers are not needed my program should be extremely efficient. Because of that the question about runtime arose…

All of the zero copy capable integrated parts I am aware of are compute capability 1.1. There are no compute 1.3 mobile parts, although there might be some 1.2 parts in the pipeline. But nothing with double precision, if that is what you were hoping for.

It doesn’t. Both zero copy implementations seem to effectively be DMA operations. The only difference is what bus the DMA is happening across. In the embedded GPU it is probably just running between different ports in the same memory controller. In the discrete cards it happens across the PCI-e bus, which adds latency, but probably at greater total bandwidth than the embedded implementation. But the principle is the same in both cases.

You keep saying that, but as best as I can tell you haven’t tested it, so how can you know?

It was the point of this thread wasn’t it? He wants somebody to test his code on an integrated graphics card so that he can see if it is better.

And my point is better than what? He already has two of the most powerful zero copy capable GPUs NVIDIA make at his disposal, but he hasn’t tried them.

As I understand it the only card he has available is a GTX 295 for which “zero copy” transactions still require data to move over the PCI bus. For an integrated card (again, as I understand it) there is literally no memory copy required for a zero copy operation (hence the name “zero copy”), so if the vast majority of a program execution is copying data to and from the GPU it may be a lot faster to simply use an integrated card. The OP has this sort of problem, and is asking if somebody could test to see if these cards are better for his purposes. It may be that even if the integrated card is 30x less powerful the GTX 295 the algorithm as a whole goes faster - as he explained.

Hi, thanks a lot for your reply ! It’s actually what I meant. I don’t have an integrated card, only discrete one where physical memory transfers are needed. And I want to try my code on the integrated card. Even if that card has less computational power, the whole algorithm can win a lot if data is not being transferred between host and device.

Could anyone run my code on the integrated card?

Thanks in advance!

I don’t have an integrated card. That’s the point. Using zero-copies on GTX295 is even slower than a run with memory transfers… Hopefully it clarifies…

There is copying (really DMA) in both cases. You must sync the host thread after launching a kernel before you can access the data in both cases. In one case you are DMAing data from a 100+Gb/s memory system across the PCI-e bus to a >10Gb/s multichannel host memory, in the other you are DMAing data from within the same 8Gb/s single channel memory system. Everything I have tried with cudaHostAlloc() memory shows the former is at least twice as fast as the latter, although the latency of the latter can be better.

I have found the opposite. Zero copy is considerably faster that using normal pageable memory copies.

This code is an unashamed adaptation of an example that appeared in Dr.Dobbs a while ago on zero copy. All that happens is 128000 floats get copied to the device, incremented, and copied back. The code contains two versions, one using zero copy memory and the other using pageable memory with cudaMemcpy to transfer the results to and from the device. It will report an assert error if the contents of the device results and a host side version of the same thing don’t match, and it will report how long each operation takes in both cases - the time to do the setup memcpys, the kernel execution time and the total time.

Some results:

On a GTX275 sitting in a PCI-e v2 Phenom II system (cuda 2.3 with 190.x drivers):

avid@cuda:~$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./zerocopy

Zero copy initialization time=0.000124

Zero copy kernel execution time=0.000242

Zero copy test total time=0.000366

Pageable memory initialization time=0.000593

Pageable memory execution time=0.000029

Pageable memory total time including copies=0.000967

and on an MCP79 mini-itx board with a dual core celeron (cuda 2.2 with 185.x drivers):

telkku@telkku:~$ LD_LIBRARY_PATH=/opt/cuda/lib ./zerocopy

Zero copy initialization time=0.000402

Zero copy kernel execution time=0.000359

Zero copy test total time=0.000761

Pageable memory initialization time=0.001102

Pageable memory execution time=0.000176

Pageable memory total time including copies=0.001766

What does it show? Well for one, in both cases the zero copy run time is many times slower than the bare kernel execution time, because there are DMA transfers going on (ie there is copying even in the case of the integrated part). Secondly the zero copy version is faster than the pageable version in both cases. Lastly, the GTX275 is faster than the MCP79, although the ratio of bare kernel execution speed to zero copy total time is higher in the GTX275 case, which is probably down to the PCI-e bus.

I probably should attach this rather than post it, but anyway…

#include <sys/time.h>

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

#define NUMBER_OF_ARRAY_ELEMENTS 128000

#define N_THREADS_PER_BLOCK 256

double	systimer(void)

{

	struct		timeval tp;

	register double	result=0.0;

	if ( gettimeofday(&tp, NULL) != -1 )

	{

		result = (double)(tp.tv_sec);

		result += ((double)(tp.tv_usec))*1.0e-6;

	}

	return result;

}

void incrementArrayOnHost(float *a, int N)

{ 

	int i;

	for (i=0; i < N; i++) 

		a[i] = a[i]+1.f;

}

__global__ void incrementArrayOnDevice(float *a, int N)

{

	int idx = blockIdx.x*blockDim.x + threadIdx.x; 

	if (idx < N) 

		a[idx] = a[idx]+1.f; 

}

void checkCUDAError(const char *msg) 

{ 

	cudaError_t err = cudaGetLastError(); 

	if( cudaSuccess != err) {

		fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString(err) );

		exit(EXIT_FAILURE);

	}

}

int main(void)

{

	float *a_m;

	float *a_d;

	float *check_h; 

	

	double time0,time1,time2,time3;

	int i, N = NUMBER_OF_ARRAY_ELEMENTS; 

	size_t size = N*sizeof(float); 

	int blockSize = N_THREADS_PER_BLOCK; 

	int nBlocks = N/blockSize + (N%blockSize > 0?1:0);

	check_h = (float *)malloc(size); 

#if CUDART_VERSION < 2020

	#error "This CUDART version does not support mapped memory!\n" 

#else

	//

	// Zero copy version

	//

	cudaDeviceProp deviceProp;

	cudaGetDeviceProperties(&deviceProp, 0);

	checkCUDAError("cudaGetDeviceProperties");

	if(!deviceProp.canMapHostMemory) {

		fprintf(stderr, "Device %d cannot map host memory!\n", 0);

		exit(EXIT_FAILURE);

	}

	cudaSetDeviceFlags(cudaDeviceMapHost);

	checkCUDAError("cudaSetDeviceFlags");

	cudaHostAlloc((void **)&a_m, size, cudaHostAllocMapped);

	checkCUDAError("cudaHostAllocMapped");

	cudaHostGetDevicePointer((void **)&a_d, (void *)a_m, 0);

	checkCUDAError("cudaHostGetDevicePointer");

	for (i=0; i<N; i++)

		check_h[i] = (float)i; 

	time0 = systimer();

	(void)memcpy(a_m, check_h, size);

	cudaThreadSynchronize(); 

	time1 = systimer();

	incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

	cudaThreadSynchronize(); 

	time3 = systimer();

	checkCUDAError("incrementArrayOnDevice");

	incrementArrayOnHost(check_h, N);

	for (i=0; i<N; i++) 

		assert(check_h[i] == a_m[i]);

	cudaFreeHost(a_m); 

	printf("Zero copy initialization time=%f\n", (float)(time1-time0));

	printf("Zero copy kernel execution time=%f\n", (float)(time3-time1));

	printf("Zero copy test total time=%f\n", (float)(time3-time0));

#endif

	//

	// Pageable memory version

	//

	cudaMalloc((void **)&a_d, size);

	checkCUDAError("cudaMalloc");

	

	a_m = (float *)malloc(size); 

	for (i=0; i<N; i++)

		check_h[i] = (float)i; 

	time0 = systimer();

	memcpy(a_m, check_h, size);

	cudaMemcpy(a_d, a_m, size, cudaMemcpyHostToDevice);

	time1 = systimer();

	incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

	cudaThreadSynchronize(); 

	time2 = systimer();

	cudaMemcpy(a_m, a_d, size, cudaMemcpyDeviceToHost);

	time3 = systimer();

	checkCUDAError("incrementArrayOnDevice");

	incrementArrayOnHost(check_h, N);

	for (i=0; i<N; i++) 

		assert(check_h[i] == a_m[i]);

	free(a_m); 

	cudaFree(a_d);

	printf("Pageable memory initialization time=%f\n", (float)(time1-time0));

	printf("Pageable memory execution time=%f\n", (float)(time2-time1));

	printf("Pageable memory total time including copies=%f\n", (float)(time3-time0));

	free(check_h); 

	return 0;

}

Draw whatever conclusions you want from this.

Edited twice for some slightly revised timing code and results.

Thanks a lot for this very useful example! Actually I compiled it on my machine (GTX 295, AMD Phenom™ 9550 Quad-Core Processor, 2200.149 MHz) and for the first two cases I have very similar results:

[codebox]

ZERO-COPY test

Zero copy initialization time = 0.166 ms

Zero copy kernel execution time = 0.384 ms

PAGEABLE memory test

Pageable memory initialization time = 0.859 ms

Pageable memory execution time = 0.036 ms

[/codebox]

But for the second case when we are using memory transfers using cudaMemcpy, we should use pinned memory that accelerates copies between host and device:

[codebox]

// Pageable memory version

std::cout << “PAGEABLE memory test” << std::endl;

cudaMalloc((void **)&a_d, size);

checkCUDAError(“cudaMalloc”);

a_m = (float *)malloc(size);

for(i = 0; i < N; i++)

check_h[i] = (float)i;

timerGPU = 0;

CE( cutCreateTimer(&timerGPU) );

CUT_SAFE_CALL(cutResetTimer(timerGPU));

CUT_SAFE_CALL(cutStartTimer(timerGPU));

memcpy(a_m, check_h, size);

cudaMemcpy(a_d, a_m, size, cudaMemcpyHostToDevice);

CUT_SAFE_CALL(cutStopTimer(timerGPU));

time = cutGetTimerValue(timerGPU);

std::cout << “Pageable memory initialization time = " << time << " ms” << std::endl;

CUT_SAFE_CALL(cutResetTimer(timerGPU));

CUT_SAFE_CALL(cutStartTimer(timerGPU));

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

cudaThreadSynchronize();

CUT_SAFE_CALL(cutStopTimer(timerGPU));

time = cutGetTimerValue(timerGPU);

std::cout << “Pageable memory execution time = " << time << " ms” << std::endl;

cudaMemcpy(a_m, a_d, size, cudaMemcpyDeviceToHost);

checkCUDAError(“incrementArrayOnDevice”);

incrementArrayOnHost(check_h, N);

for(i = 0; i < N; i++)

assert(check_h[i] == a_m[i]);

free(a_m);

cudaFree(a_d);

//

// Page-locked (pinned) memory version

//

std::cout << “PAGEABLE (pinned) memory test” << std::endl;

cudaMalloc((void **)&a_d, size);

checkCUDAError(“cudaMalloc”);

cudaMallocHost((void **)&a_m, size);

checkCUDAError(“cudaMallocHost”);

for(i = 0; i < N; i++)

check_h[i] = (float)i;

timerGPU = 0;

CE( cutCreateTimer(&timerGPU) );

CUT_SAFE_CALL(cutResetTimer(timerGPU));

CUT_SAFE_CALL(cutStartTimer(timerGPU));

memcpy(a_m, check_h, size);

cudaMemcpy(a_d, a_m, size, cudaMemcpyHostToDevice);

CUT_SAFE_CALL(cutStopTimer(timerGPU));

time = cutGetTimerValue(timerGPU);

std::cout << “Pageable memory initialization time = " << time << " ms” << std::endl;

CUT_SAFE_CALL(cutResetTimer(timerGPU));

CUT_SAFE_CALL(cutStartTimer(timerGPU));

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

cudaThreadSynchronize();

CUT_SAFE_CALL(cutStopTimer(timerGPU));

time = cutGetTimerValue(timerGPU);

std::cout << “Pageable memory execution time = " << time << " ms” << std::endl;

cudaMemcpy(a_m, a_d, size, cudaMemcpyDeviceToHost);

checkCUDAError(“incrementArrayOnDevice”);

incrementArrayOnHost(check_h, N);

for(i = 0; i < N; i++)

assert(check_h[i] == a_m[i]);

cudaFree(a_m);

cudaFree(a_d);

[/codebox]

Then we have the following runtimes:

[codebox]

ZERO-COPY test

Zero copy initialization time = 0.166 ms

Zero copy kernel execution time = 0.384 ms

PAGEABLE memory test

Pageable memory initialization time = 0.859 ms

Pageable memory execution time = 0.036 ms

PAGEABLE (pinned) memory test

Pageable memory initialization time = 0.369 ms

Pageable memory execution time = 0.036 ms

[/codebox]

As you can see, using pinned memory with memory transfers is faster than zero-copies. In the case of using zero-copies the kernel runtime slows down significantly. But which integrated card is the best/fastest one nowadays? Maybe it’s possible to get better zero-copy runtime on it…

I have a pinned version of that code I posted as well. I still find that, overall, zero copy memory is faster on both platforms. But I hope you get the point that zero copy on the MCP79 is basically the same as on the GT200 cards. Even though everything is notionally sitting in the same memory space, there is still DMA activity that slows down things considerably (about a factor of 3 or 4 times over the kernel running time).

On the GTX 275:

david@cuda:~$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./zerocopy

Zero copy initialization time=0.000127

Zero copy kernel execution time=0.000242

Zero copy test total time=0.000369

Pageable memory initialization time=0.000591

Pageable memory execution time=0.000030

Pageable memory total time including copies=0.000898

Pinned memory initialization time=0.000253

Pinned memory execution time=0.000030

Pinned memory total time including copies=0.000380

On the MCP79:

telkku@telkku:~$ LD_LIBRARY_PATH=/opt/cuda/lib ./zerocopy

Zero copy initialization time=0.000396

Zero copy kernel execution time=0.000353

Zero copy test total time=0.000749

Pageable memory initialization time=0.001075

Pageable memory execution time=0.000151

Pageable memory total time including copies=0.001734

Pinned memory initialization time=0.000564

Pinned memory execution time=0.000152

Pinned memory total time including copies=0.000878

As for which MCP79 is better, it seems that there are only three versions, one with 8 cores (I think this is what is appearing as the “Ion LE” in some places) and two with 16 Cores. The latter are only differentiated by clock speed - there is the Ion/9300M and 9400M, which might be Apple only at the moment. As you can see, even with zero copy, they are slow. The one improvement is in latency.

BTW this is the MCP79 I am using:

telkku@telkku:~$ LD_LIBRARY_PATH=/opt/cuda/lib:$HOME/NVIDIA_CUDA_SDK/lib NVIDIA_CUDA_SDK/bin/linux/release/deviceQuery 

CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 9300 / nForce 730i"

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 265617408 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.20 GHz

  Concurrent copy and execution:				 No

  Run time limit on kernels:					 Yes

  Integrated:									Yes

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

There seems to be some confusion here - there is no memory copying in either case. That’s why it’s called zero-copy. The GPU is accessing CPU memory directly across the PCI-E bus.

I would recommend reading the document included in the “simpleZeroCopy” sample in the SDK.