Survey of common CUDA runtime profiles What is your application like?

Hello CUDA developers,

I am a student trying to get a better understanding on how CUDA is used in real world systems-- an attempt at getting out of theory land. My interest is in resource allocation issues. I am hoping that the community here might be able to give me some insights into the workloads folks typically deal with. Here are the questions that I have for you:

  1. How much of your application’s execution time is made up between execution on the host CPU(s) and GPU(s)? Would you say it’s 20/80? 50/50? 80/20?
  2. Does your application run periodically (ex. time-steps) or does it run in more of a one-shot manner?
  3. Assuming we’re all using multicore systems now, any experiences/comments concerning scalability and/or bottlenecks (many CPUs sharing few GPUs)? Have you ever run into a problem where more than one program wanted to use the GPU?
  4. Finally, what is your field/application?

Thank you very much-- your help would be greatly appreciated.

A wide range of applications is nicely documented at the CUDA Zone.

Thanks, SPWorley. While this is useful information, I am really interested in workload characteristics between the CPU and GPU. I guess I’ll dive into all of those papers and see what I can find. :)

I’ve profiled many of the CUDA SDK examples, but most of them are short apps meant to exercise the GPU. Few of them could make stand-alone applications.

  1. the only thing running on the CPU is the reading of input data from files and sending it to the GPU. After that the only thing running on the CPU are kernel calls.
  2. a mix of the 2, both achieving the same quantitative result
  3. not all that applicable since im not using the cpu
  4. Monte Carlo simulation in (medical) physics to solve Boltzmann’s equation. The application has 0 interactivity, which helps for 1).
  1. GPU mostly. CPU used for data I/O and some format conversions only
  2. Time steps
  3. ?
  4. Audio processing
  1. roughly 50/50. Further parts of the application could be ported to use the GPU, but with diminishing returns.
  2. periodically (iterative approximation)
  3. The simulations are so demanding on memory (and CPU) that you would rarely run two at the same time. Instead, they run sequentially on a dedicated headless machine. Thus no allocation problems.
  4. semiconductor optics

Thank you for the responses. You’ve all been quite helpful. Though this is a small sample, it is starting to confirm my suspicions: First, the division of labor between the CPU and GPU varies from application to application (though I get the feeling the GPU typically uses a significantly larger portion of the execution time than the CPU). Second, the GPU is being used for both batch-like jobs and signal processing-like jobs. And finally, most applications run in dedicated environments, so competition for the GPU is rare.

Please continue to share any more thoughts! More examples wouldn’t hurt.

Of the CPU work, can you say if this 50% is on a single processor or distributed amongst several (multi-threaded)? And if distributed amongst processors, is this 50% a measure of the CPU work in serial or in parallel?

  1. 99% GPU 1% CPU (maybe even less CPU)

  2. time steps. Millions of them.

??? Not sure what you are asking for. A properly configured cluster will schedule one CPU core along with your GPU job. On multi-core systems, this leaves the other CPU cores available for other CPU-only jobs to run (or they just sit idle). The program driving the GPU typically pegs it’s CPU core at 100% polling for the GPU. If you don’t like that behavior, you can enable a less CPU intensive mode at the cost of reducing your GPU app’s performance (about 10% in my app)

Not really. A cluster scheduler will only put one job on a GPU at a time and NVIDIA thankfully gives us a compute-exclusive mode to forcibly prevent simultaneous jobs from even starting on a GPU. If you turn compute-exclusive off, you can of course force more than one app to a single GPU. Each job just runs at a little less than 1/2 speed in that case (assuming there is enough memory on the GPU to accommodate both jobs)

  1. Molecular dynamics

Yes, the CPU part is fully multithreaded, and it’s 50% times four CPUs.

That’s also why I haven’t ported more parts to CUDA: Moving more work to the GPU could at most gain a factor of 2 over the current state.

How do I enable this compute-exclusive mode?? I have not encountered this in any CUDA documentation or demonstration. Seems like this should be one of the first things mentioned for GP-GPU computing…

The nvidia-smi utility. It is discussed in the the Linux release notes. If you are not using Linux, I don’t know that you can set compute exclusive mode.