17 GB/s device to host from oclBandwidthTest bug in code?

I ran the bandwidth test code in shmoo mode for GTX480

$ oclBandwidthTest --memory=pinned --access=mapped --mode=shmoo

The Device to Host number is clearly out of control with the maximum reached >17GB/s

............................................................

.....................

 Device to Host Bandwidth, 0 Device(s), Pinned memory, mapped access

   Transfer Size (Bytes)		Bandwidth(MB/s)

   1024						 2079.2

   2048						 3919.6

   3072						 5535.1

   4096						 7249.6

   5120						 8258.1

   6144						 9949.8

   7168						 10860.6

   8192						 11619.9

   9216						 13356.5

   10240						14371.9

   11264						14919.2

   12288						16062.7

   13312						16744.7

   14336						17536.4

   15360						17860.5

   16384						18357.4

   17408						13390.8

   18432						11484.1

   19456						12217.3

   20480						12356.0

   22528						12480.9

   24576						12800.0

   26624						12893.0

   28672						13212.9

   30720						13198.7

   32768						13253.0

   34816						13125.7

   36864						13678.7

   38912						13689.4

   40960						11393.6

   43008						11027.7

   45056						13992.5

   47104						13998.2

   49152						14073.6

   51200						14065.9

   61440						14431.0

   71680						13672.9

   81920						14075.6

   92160						12903.0

   102400					   14153.4

   204800					   12943.6

   307200					   11708.4

   409600					   11401.5

   512000					   11642.3

   614400					   6555.7

   716800					   6678.3

   819200					   6738.4

Now if I change the access to direct, then all numbers are reasonable (5-6 GB/s). This points to this piece of code (line 631 in oclBandwidthTest.c)

// MAPPED: mapped pointers to device buffer for conventional pointer access

		void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);

		oclCheckError(ciErrNum, CL_SUCCESS);

		for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)

		{

			memcpy(h_data, dm_idata, memSize);

		}

		ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);

		oclCheckError(ciErrNum, CL_SUCCESS);

Where MEMOCOPY_ITERATIONS is defined as 100. It seems this code copies the data from host memory to host memory 100 times but there is no guarantee

that the data is copied over through PCIe 100 times. One way to fix it should be putting the map/unmap calls inside the for loop.

Can Nvidia conform this and maybe fix it in the next release? thanks