Hello, I’am postgraduate student, and i study cuda programming the last 2 months. i have put my project in three different GPUs:
i) 8600M GT
i have observed that there is different penalty in each GPU with the same code and data. For example, i have observed that one simple cudaMalloc for 1 integer (without Kernels and cudaMemcopy) needs in (i) 70ms, in (ii) 510ms, and in (iii) 1550ms. So, it seems big time, especially for the (ii) and (iii). So, the application in C must take a lot of time so as to overlap this time.
Is the penalty logically in each GPU?? Why these penalties are so much differently?
Thank you for any help!!
(1) Are you timing the very first cudaMalloc() in the application, by any chance? If so, what is being timed is context creation time. Try inserting a cudaFree(0) before the cudaMalloc(), so the context creation happens on cudaFree() call.
(2) Are you performing a controlled experiment where the change of GPU is the only variable that changes, and the rest of the system is exactly the same?
(3) If any of your measurements are done on Windows Vista or Windows 7, you may see significant performance artifacts due to the WDDM driver model.
(4) If your measurements are done under Linux, make sure to put the driver into persistence mode with nvidia-smi -pm 1 (or run X if this is a local machine), otherwise the kernel module may get unloaded, and context creation will include the time to reload the kernel module which caused a large-ish additional delay.
My understanding is that context creation time has some dependency on the amount of memory on the card, and the number of supported features, so I would expect a card with larger memory or higher compute capability to have slightly longer context initialization time. I have not measured the context initialization time recently so I do not have any guidance as to what expected ranges are. I am a bit surprised at the 1.5 second measurement on the GTX280, I suspect item (3) or (4) above come into play.
Here is one data point for the CUDA context creation time. I created a tiny CUDA app that does nothing except making a single call to cudaFree(0), followed by a minimal cudaMalloc(), followed by calling an empty CUDA kernel. I timed the duration of each call. I ran this application after I placed the kernel module into persistence mode with nvidia-smi -pm 1. I used the high-resolution timer provided via the gettimeofday() function.
On a 64-bit RHEL 5.2 Linux system with a Xeon X5272 CPU and a C2050 GPU, I see times around 67 msec for the context creation triggered by the call to cudaFree(0). Allocating a single 4-byte integer on the device with cudaMalloc() takes 118 usec (= 0.118 msec), and launching an empty kernel takes 2.99 msec.
There are likely many factors that influence the exact latency of such operations including CPU speed, GPU speed, operating system, and the number and type of GPUs in the system. The latency of cudaMalloc() calls probably varies significantly as systems in general typically implement layered allocators such that allocation of a single integer will simply invoke the top-most sub-allocator (which is also the fastest) the vast majority of the time. Most CUDA programmers probably will never have to measure the latency of the operations (at least I have never had the need in over seven years of working with CUDA), but you can certainly time CUDA operations under relevant scenarios with fairly simple test apps if need be.
Hello, thanks for the quickly answer. I put cudaFree(0) first, but the previous penalties are created in cudaFree(0). How can i place the kernel module into persistence mode with nvidia-smi -pm 1 ?? My measurements are done under Linux.
Thanks a lot..
Finally, i found what you want to tell me, but when i put the driver into persistence mode with nvidia-smi -pm 1, i take the below message:
Unable to set persistence mode for GPU 0000:01:00.0: Insufficient Permissions
Some functions of nvidia-smi require root privileges, this includes switching the kernel module into persistence mode.