CUDA Profiler [memcopy] weird result


wanted to check my up- and download speed to the GPU with the CUDA Profiler 1.0.
I’m uploading and downloading exactly 2073600 bytes.

The results are as followed:

method=[ memcopy ] gputime=[ 387.664 ]
method=[ memcopy ] gputime=[ 1188.992 ]

The upload is done by cudaMemcpyToArray and the download is done by cudaMemcpy.

The download speed seems reasonable,
but the upload speed is a little weird. If it takes 387 microseconds to push
2073600 bytes to the Device than the throughput must be around 4,9GB/second
which is way above the maximum throughput of my tiny PCI-Express 1.1 bus :).
Maybe I’m just calculating the throughput wrong but
could anybody give me some info how the profiler gets those timings?

If I simply do a timing on the host code that issues the memcpy i get around 1,5GB/sec.

Thank you!

I also got some very funny timings in the begging … are u running this in a loop ? if not try running this for a few thousand times and take the average and see what you get :)

yes, upload, download and the kernel is launched in a loop on the host code.
The loop runs 1000 times and I already took the average of all the values,
they are all pretty close to each other.

All around 380 to 390 microseconds per upload.
I could believe these timings if they weren’t physically impossible on PCI-Express 1.1 :D

I just ran a benchmark tool in my application and set it to copy the same number of bytes you have. Note that I am using pinned memory. The values I picked to copy are representative of the average.

Upload to device: method=[ memcopy ] gputime=[ 601.504 ]
Download to CPU: method=[ memcopy ] gputime=[ 541.536 ]

Have you double checked the parameters that you are passing to cudaMemcpy? Maybe the number of bytes to transfer you are giving it isn’t what you think it is? But you do say that adding your own timing leads to values that make sense… Perhaps it is a driver issue? What is your architecture: My tests were run on AMD64 linux.

I’m using an Intel Core2Duo E6600 on Intel i975X Platform and a GeForce 8600 GTS as GPU.

I even tripple checke the data ;)
It’s basically the data of a 1 channel grayscale IplImage of size 1920x1080 (1080p).
The Image data of a distorted image is copied to the device, gets undistorted and is copied back to the host where it’s used for other purposes.

Since the undistortion works perfectly I assume all data gets written to the gpu because every pixel value of the source image is transformed into the correct pixel value.

I checked the timing with an QueryPerformanceCounter and also the cutCreateTimer.
Their values are almost identical.
1500 microseconds up and 2000 microseconds down.

I will try installing a new driver or try on a different hardware setup.

Thanks for your help though!

I forgot to mention, my hardware is 8800 GTX, though I’m not sure how that could possibly be the cause of the difference. The output from the profiler is coming from software.

Since you mention QueryPerformanceCounhter, I guess you are working on windows. I just ran my same benchmark in windows and got similar results. It seems puzzling that you get such strage, and yet reproducible results.

The thing is that I just needed a reliable timer to compare results on different hardware setups with different GPUs.
I think I just have to stick with cutTimer or the QueryPerformanceCounter on windows

If I use pinned memory instead I get results that are more plausible:

method=[ memcopy ] gputime=[ 1192.544 ]
method=[ undistortionKernel_GRAY ] gputime=[ 4129.792 ] cputime=[ 4174.901 ] occupancy=[ 1.000 ]
method=[ memcopy ] gputime=[ 1192.640 ]