I have run into an issue on our new K20x cards when transferring data from host to device that i have not seen on my K1000m on my laptop.
The problem
I am seeing poor performance when copying ~500MB from a non-pinned host buffer to a pinned host buffer. This ~500MB takes ~300ms to complete.
This strategy i have been told is called “staging”. Where you setup a pinned buffer and copy your data into the pinned buffer, then copy the data from the pinned buffer down to the device (and vise versa).
Copying from pinned memory to the device is very quick, completing in ~80ms for ~500MB.
Copying from device to pinned memory exhibits the same performance as copying pinned to device, ~80ms for ~500MB.
Copying from pinned memory to a non-pinned buffer takes ~200ms (a bit faster than the other direction) to complete for ~500MB.
Things that didn’t work
- I enabled persistent mode for the CUDA drivers: sudo nvidia-smi -pm 1. No change
- I have 3x K20x cards, i have tried using cudaSetDevice() and running the program on each of the three cards. No change.
- I tried using both cudaMemcpy(, cudaMemcpyHostToHost) and regular C memcpy(). No change.
- I tried registering the unpinned host buffers
Odd solution
- I tried registering/unregistering the unpinned host buffers into pinned buffers. This takes about 15ms tp regoster each buffer and ~20ms to unregister each buffer. The memcpy then ran just as fast as the pinned memcpy above ~80ms. This reduced the total time of the memcpy operation significantly without using the "staging" buffers.
Question(s)
- I was told that register/unregister is was slow and should not used. Instead "staging" buffers are the preferred mechanism for memcpy between host/device. Why is this the case?
- Why am i seeing better performance by registering/unregistering? And is this normal?
- Should i be doing this a different way?
System Specs
$ sudo nvidia-smi
Mon Mar 31 23:47:25 2014
+------------------------------------------------------+
| NVIDIA-SMI 331.38 Driver Version: 331.38 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20Xm On | 0000:03:00.0 Off | 0 |
| N/A 28C P8 17W / 235W | 13MiB / 5759MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20Xm On | 0000:04:00.0 Off | 0 |
| N/A 29C P8 17W / 235W | 13MiB / 5759MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20Xm On | 0000:84:00.0 Off | 0 |
| N/A 25C P8 18W / 235W | 13MiB / 5759MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5400.27
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management: