Hi,
I don’t know whether this is an artifact of the system I’m using, or is otherwise to be expected, but I thought I’d post some findings regarding device->host transfers.
After profiling my code (CPU-side) and learning that cuCopyDtoH was responsible for about 97% of host execution time, I decided to take a look at just how long the transfers were taking.
I did some timings on host to device transfers by instrumenting my code as follows:
[codebox]
double currTime() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec * 10e-6;
}
void copyDeviceToHost(void *tgt, CUdeviceptr src, size_t sz) {
double t1, t2;
t1 = currTime();
failOnCUDAErr(cuMemcpyDtoH(tgt, src, sz));
t2 = currTime();
log(“d->h copy: %u bytes, %f sec\n”, sz, t2-t1);
}
[/codebox]
I was expecting that small transfers would be somewhat inefficient, but the results were actually kind of surprising. The blue points in the attached graph are transfers of >4 bytes in size, whereas the red points are 4 byte transfers.
4 byte transfers have a median execution time of .22918 sec (!), whereas the median execution time of all >4 byte transfers is 0.00013 sec.
This is on a macbook pro using the inbuilt 9400M:
[codebox]
Model Name: MacBook Pro
Model Identifier: MacBookPro5,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.8 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 6 MB
Memory: 4 GB
Bus Speed: 1.07 GHz
Boot ROM Version: MBP51.0074.B01
SMC Version: 1.33f8
NVIDIA GeForce 9400M:
Chipset Model: NVIDIA GeForce 9400M
Type: Display
Bus: PCI
VRAM (Total): 256 MB
Vendor: NVIDIA (0x10de)
Device ID: 0x0863
Revision ID: 0x00b1
ROM Revision: 3343
gMux Version: 1.7.3
[/codebox]