Hi all,
Is there a reliable utility (presumably, developed by Nvidia) that would give GPU utilization statistics for Tesla cards?
Here’s the reason for the question: we have a couple of Tesla S1070 units hooked up to a host system for high-speed simulations. In the future, we might upgrade to more of parallel S1070/S2070 units with a high-speed interconnect. Without knowing GPU utilization, there would be no direct way to know the bottlenecks in the setup (is the interconnect fast enough to keep the GPUs fully occupied?).
Without a direct measure of GPU load, any kind of scale-up in a cluster environment is pretty much guesswork… Hope, there is a positive answer somewhere…
Thanks for any tips.
No, there is no such tool. Guys at NVIDIA (Tim) have said more than once that they would have provided one already if it were easy…
Here’s hoping that Fermi makes it easy and we will get one for that architecture.
For your particular use-case, I would think that application level benchmarks would be much more information rich than simply monitoring the GPU load, anyways.
“fully occupied” is a funny thing when you’re talking about GPUs, and everybody has a different meaning. I don’t know that there’s a magic bullet there.
however, as far as additional tools go, I am well aware that it’s hard to tell what you’re trying to figure out and am working to improve that.
Thank you for the replies, guys.
I guess, we’ll have to resort to benchmarking when the time comes. Although it’s a catch-22 problem, since we need to know the reasonable GPU/host/interconnect combination before investing in it, but we can’t measure performance until we have the hardware… So we are trying to figure out a good way to get an estimate on a smaller scale…
In any event, some kind of a load measurement utility would be very helpful, since there is really no way to get such estimates online (it’s heavily dependent on the particular parallel algorithm)… And I’m sure there will be more questions like this, as GPUs get more into the HPC area.
I’m not sure if this would be of any use to you, but a colleague of mine wrote a full system simulator for CUDA applications that records the amount of time that your application spends doing particular operations by intercepting CUDA calls as your program makes them. For example, you can tell how much time you spend in kernels, copying memory, memory allocation, host code, etc. You can also change different system parameters to see how they affect your application. For example, you can increase the PCIe bandwidth/latency, the GPU clock frequency, malloc latency, or make calls synchronous/asynchronous. There are no pretty GUIs or anything like the visual profiler, you would be using a trace driven architecture simulator on the command line. Let me know if you would be interested and I could possibly send you the code.
It would be crude and inaccurate and a terrible hack, but you could imagine making a small program that launched a no-op kernel and timed the span between launch and cudaSynchronize(). If it was less than 25 us, there was no kernel running. Otherwise it would give you a delay measuring roughly how long the another (foreign) kernel was busy. Repeat every second or so and you could use some running averages to figure out how often the GPU is busy with other kernels by treating each no-op launch as a point sample of the load on the GPU.
Yes, it’s ugly, yes it gives poor time resolution, yes, it has lots of flaws. But it’d be some feedback, anyway.
The code consists of two portions, one is an add-on module to Ocelot ( http://code.google.com/p/gpuocelot/ ) that creates an annotated trace of all of the cuda calls that a program makes. The second part is a trace analysis tool that tries to determine the total execution time of the program that generated the trace using simple timing models.
The first part works stand-alone form the second part, and is actually distributed with Ocelot. You will need to check out the current version from subversion though, as this was recently added and we don’t have an official release that supports it yet. Basically, you want to compile your program with nvcc, and then link it against Ocelot rather than libcudart.so. From that point, you should be able to enable trace generation using a config file and can probably extract a fair amount of information by simply examining the trace.
As for generating a trace, you need a config file in the directory from which you launch your program. An example is given here: ( http://code.google.com/p/gpuocelot/source/…t/config.ocelot ). Change line 21 from CudaRuntimeBase to TraceGeneratingCudaRuntime.
For the actual trace simulator, I’ll send the author an email to see if I can get a copy of the code.
Greg
edit: Also, you need to enable GPU devices in the config file and select a GPU rather than the Ocelot Emulator, otherwise you will get an inflated measurement of kernel execution times.