I ran the bandwidth test code in shmoo mode for GTX480
$ oclBandwidthTest --memory=pinned --access=mapped --mode=shmoo
The Device to Host number is clearly out of control with the maximum reached >17GB/s
............................................................
.....................
Device to Host Bandwidth, 0 Device(s), Pinned memory, mapped access
Transfer Size (Bytes) Bandwidth(MB/s)
1024 2079.2
2048 3919.6
3072 5535.1
4096 7249.6
5120 8258.1
6144 9949.8
7168 10860.6
8192 11619.9
9216 13356.5
10240 14371.9
11264 14919.2
12288 16062.7
13312 16744.7
14336 17536.4
15360 17860.5
16384 18357.4
17408 13390.8
18432 11484.1
19456 12217.3
20480 12356.0
22528 12480.9
24576 12800.0
26624 12893.0
28672 13212.9
30720 13198.7
32768 13253.0
34816 13125.7
36864 13678.7
38912 13689.4
40960 11393.6
43008 11027.7
45056 13992.5
47104 13998.2
49152 14073.6
51200 14065.9
61440 14431.0
71680 13672.9
81920 14075.6
92160 12903.0
102400 14153.4
204800 12943.6
307200 11708.4
409600 11401.5
512000 11642.3
614400 6555.7
716800 6678.3
819200 6738.4
Now if I change the access to direct, then all numbers are reasonable (5-6 GB/s). This points to this piece of code (line 631 in oclBandwidthTest.c)
// MAPPED: mapped pointers to device buffer for conventional pointer access
void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
oclCheckError(ciErrNum, CL_SUCCESS);
for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
{
memcpy(h_data, dm_idata, memSize);
}
ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
oclCheckError(ciErrNum, CL_SUCCESS);
Where MEMOCOPY_ITERATIONS is defined as 100. It seems this code copies the data from host memory to host memory 100 times but there is no guarantee
that the data is copied over through PCIe 100 times. One way to fix it should be putting the map/unmap calls inside the for loop.
Can Nvidia conform this and maybe fix it in the next release? thanks