Inconsistant Memory Copy Speed

On my Mac Pro (2009), I’ve been experiencing inconsistent memory copy performance between Host and Device. Copying a large pinned chuck of memory (8MB) to the device sometimes gives me a 5GB/s transfer rate and sometimes closer to 2GB/s. Any thoughts about what might explain this?

My main graphics card is an ATI card, so the NVidia card is only running my CUDA program.

/Chris

I broke the program into a simple memcpy test program. On my Mac Pro (2009), I’m getting a maximum through put of about 3.2GB.

Here is the output of the test program:

10000000 Values
Copy Host to Device: 12.499ms Transfer rate 3.20026 GB/s
Copy Device to Host: 13.241ms Transfer rate 3.02092 GB/s

Any way to speed up memory transfers?

Here is the memcpy test program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cutil.h>

///////////////////////////////////////////////////////////////////////////////
// Main program
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
float *h_data;
float *d_data;
int num_values;
int data_sz;

double rate;
float copy_time1;
float copy_time2;
unsigned int hTimer;

if (argc != 2)
{
    printf("Usage: memcpy number\n");
    exit(0);
}

num_values = atoi(argv[1]);
data_sz = num_values * (sizeof (float));

//CUT_DEVICE_INIT(argc, argv);
CUT_DEVICE_INIT(1, argv);
CUT_SAFE_CALL( cutCreateTimer(&hTimer) );

// Allocate the data to copy
CUDA_SAFE_CALL( cudaMallocHost((void **)&h_data, data_sz) );

// Allocating GPU memory
CUDA_SAFE_CALL( cudaMalloc((void **)&d_data, data_sz)   );

// Reset the timer
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

// Copy the data to the device
CUDA_SAFE_CALL( cudaMemcpy(d_data, h_data, data_sz, cudaMemcpyHostToDevice) );

CUT_SAFE_CALL( cutStopTimer(hTimer) );
copy_time1 = cutGetTimerValue(hTimer);

CUDA_SAFE_CALL( cudaThreadSynchronize() );

// Reset the timer
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUT_SAFE_CALL( cutStartTimer(hTimer) );

// Copy memory from the device to the host
CUDA_SAFE_CALL( cudaMemcpy(h_data, d_data, data_sz, cudaMemcpyDeviceToHost) );

CUT_SAFE_CALL( cutStopTimer(hTimer) );
copy_time2 = cutGetTimerValue(hTimer);

// Print Results
printf("\n%d Values\n", num_values);

rate = (double) data_sz / (double) copy_time1;
rate /= 1000000.0;

printf("Copy Host to Device: %gms\tTransfer rate %g GB/s\n", copy_time1, rate);

rate = (double) data_sz / (double) copy_time2;
rate /= 1000000.0;

printf("Copy Device to Host: %gms\tTransfer rate %g GB/s\n", copy_time2, rate);

CUDA_SAFE_CALL(  cudaFreeHost(h_data) );
CUDA_SAFE_CALL(  cudaFree(d_data) );

CUT_EXIT(1, argv);

}

Continuing to see inconsistent performance.

The above program either returns 5.89GB/s or 3.2GB/s. It seems like it is fast for 5 or 6 tries and then slow for 10 or more.

I also noticed that my kernel execution is exhibiting similar performance variation. Sometimes taking 147ms and others 61ms for the same task. When the kernel is fast the corresponding memcpy is as well.

Anyone else noticing performance issues on a Mac Pro (2009) / 8 core?

/Chris

Hi Chris,

what version of CUDA are you using?
And OSX?

CUDA version 2.2. On the latest Mac OS X 10.5.7.

Thanks,
/Chris

Same program sometimes takes a lot longer (twice as long). Here is the result of running a program that adds a large (16M) array or numbers.

$ ./avg 16000000
16000000 Values
Using device 0: GeForce GT 120
host_data address 0x2702000
device data address 0x2720000

Sum of Numbers:
CPU Sum: -9892.17
GPU Sum: -9892.74

Copy data time: 20.000999 msecs.
CPU time: 25.879000 msecs.
GPU time: 153.072006 msecs.
NoOp time: 0.105000 msecs.
NoOp time: 0.030000 msecs.
Total time: 173.072998 msecs.
Transfer rate 3.19984 MB/s

$ ./avg 16000000
16000000 Values
Using device 0: GeForce GT 120
host_data address 0x2702000
device data address 0x2720000

Sum of Numbers:
CPU Sum: -9892.17
GPU Sum: -9892.74

Copy data time: 10.883000 msecs.
CPU time: 25.850000 msecs.
GPU time: 63.616001 msecs.
NoOp time: 0.096000 msecs.
NoOp time: 0.024000 msecs.
Total time: 74.499001 msecs.
Transfer rate 5.88073 MB/s

Made some optimizations, so now, on my new Mac Pro, the program runs in 11ms when slow and 4 ms when fast (~1 out of ten executions).

It consistently runs at 9ms on my 2 year old Macbook Pro.

So most of the time, the GT120 is slower than the 8600M GT in my laptop. On the laptop, the nVidia card is also running the display, but on the Mac Pro it is not.

This speed difference issue only appears to be a problem on the Mac Pro.

/Chris

Hi Chris,

I’ve been following your posts. I am currently on vacation but in a week I’ll be able to run your program on a 2008 MacPro with Nvidia 8800. I’ll report back if the timings also vary on this machine - though I’m sure they will not. I have never seen such timing variations yet.

Have you tried making the GT120 the main GPU - so also running the display through it?

The CUDA release notes say something about problems when running CUDA on a GPU not driving the main display. Maybe that’s what you are experiencing?

Mark

Mark,

Thank you for responding.

I’ll try hooking up a display tonight and see if that makes a difference.

Thanks!
/Chris

Ok, as it turns out, hooking up the display does fix the problem.

So that sounds like an NVidia driver issue.

Thanks,
/Chris

Hi Chris,
nice to hear that worked. My CUDA app does a lot of up & downloads so I was pretty impressed to hear you are getting >5 GB/s transfers.

On the 2008 MacPro and GeForce 8800 I get about 3.2 GB/s. Looks like the new MacPro has really improved transfer speeds. Good.

Have you tried using write combined pinned memory for the transfers to the GPU? Does it make any difference?

Mark

Mark,

Not sure what “Write Combined” means, but I’ll look into it. This is using pinned memory.

Unfortunately, the new Zero Copy feature doesn’t work with the GT 120 card.

Thanks!
/Chris

Take a look at the CUDA 2.2 reference manual:
3.30.1.3 #define CU_MEMHOSTALLOC_WRITECOMBINED
and
3.25.2.30 CUresult cuMemHostAlloc

I’m not sure if the GPU has to support this. AFAIK it’s rather a memory allocation thing (disabling the CPU caches) so it should work on all GPUs.

I hate being on vacation . unable to test these things myself :-)

mark

I tried turning on CombinedWrite and the CPU algorithm went from 23ms to 1460ms, memory copy went from 28.8ms to 24.5ms, and the creation of the test data went from 247.8ms to 597ms. Seems like there is no benefit from using CombinedWrite memory.

This was tested on my MacBook Pro. I’ll try it on my faster Mac Pro tonight.

Thanks,
/Chris

Similar results on my Mac Pro. CPU access to the CombinedWrite memory is much slower.

/Chris