Kernels launch - parallel or serial?

Hi dear CUDA guys,

I need your answer about a specific topic related to kernels execution, I was talking with a guy who has studied CUDA and he says the following statement:

“Let’s say that we have 128 processing units, you can run a kernel that only use one half and inmediately you can make another diferent kernell call that uses the rest of the resources (the other 64), doing another kind of computation”

According to the SIMD architecture model, I expect to do, for instance, call one kernel that only use one proccesing unit and the next kernel call (with diferrent instructions) will have to wait the previous kernel, even if the first kernel only use 0.0001% of the GPU resources.

What do you guys say?, I need feedback from this, I appreciate any comments, thanks!!


I believe concurrent kernel execution is restricted to Fermi-based cards. Previous generations do not have this capability, though they can overlap memcopies with kernel execution.
This means that on pre-Fermi cards, the next kernel will have to wait for the first kernel to complete regardless of the amount of MPs used for the first kernel.


That’s not true for the current hardware. Concurrent kernel execution is not yet supported, a kernel must end before another is issued, regardless of how little resources it may use.

What this guy describes is likely how it will work in Fermi based GPUs that should be out in 2010 Q1.

No this is not true, you can set different kernels running in parallel from your host code. You would need to call cudaThreadSynchronize(); in your host code to make it wait for a kernel to finish otherwise the next kernel function is set in motion while the other is still running (assuming you have not maxed out the available GPU cores. I’ve just tested this on my macbook pro with a geforce9600M and found it to work fine.

No, I think you are confused by the asynchronous calling convention in CUDA. Current CUDA devices cannot execute multiple kernels at once. However, kernel calls return immediately, letting you overlap CPU computations with the GPU. The kernels are queued up in the driver and run in sequence, though. This is even true if you issue kernel calls on multiple streams.

So why do we bother with the whole grid threads stuff. Why not just use an entire device every time? If this is true then roll on fermi… presumably though kernels can run in parallel on different devices right?

I’m not sure what you mean.
The grid can be much larger or smaller than the number of compute units on the GPU. How do you expect to specify your problem size without it? If you add two vectors, should CUDA assume that they are exactly as long as the number of SPs in the device you happen to be executing on?

Kernels can run in parallel on different devices, yes.

Hmm it seems I’ve misunderstood something here, partly from the now apparently misleading fig 2.2 on page 11 of the programming guide which shows 2 different grids of different sizes / configurations (2x3 and 3x2) apparently co-existing.

Also my own experiments have shown 2 kernels running in parallel on my macbook, but I’ve got 2 GPU’s in it so that explains that.

Finally it would seem possible to have the same effect as several different kernels in parallel on one device simply by grabbing enough memory for them all, and using conditionals on the thread / block index to determine which code to run.

I’m working hard here to get up to scratch on CUDA so please bear with me and both forgive and correct my misunderstandings!

Not to keep harping on this, but your measurement method or interpretation might be misleading you here.

CUDA does not run kernels on multiple devices unless you do a little work to set this up. When you call your first CUDA function in your program (assuming you are using the runtime API), the library establishes a CUDA context for you, by default with device 0, whatever the first GPU is in your computer. All kernels you invoke will execute on this device for the entire life of the host thread. (With the driver API, you can destroy a context and create a new one associated with a different device.) You can pick a different device by calling cudaSetDevice() before any other cuda function.

To run kernels (different invocations, not one giant grid) on multiple devices, you have to first create 1 host thread per CUDA device, then call cudaSetDevice() in each host thread with a different value. Then each host thread is free to queue up kernels, which will automatically go into the queues of their respective CUDA devices. At the moment, a single threaded program cannot use multiple CUDA devices at the same time.

If this is what you are doing, feel free to tell me to buzz off. :) Just wanted to be clear.

Yes, this is called the “fat kernel” trick, where you run completely disjoint code with an if statement that depends on the value of blockIdx.x. The disadvantage to this approach is that if your two tasks take different amounts of time, your fat kernel will run for the longer of the two tasks, leaving some of the chip idle. In addition, the block scheduler is suboptimal when blocks take very different amounts of time to complete, so it is possible that the multiprocessors will be underutilized even while both tasks are executing.

This is why Fermi lifting this restriction will be a big deal. :) I suspect that NVIDIA will expose the multi-kernel ability through their existing CUDA stream concept. Current GPUs already can overlap kernel execution on one stream with a memory copy on another stream. Fermi would just enable kernels to run concurrently on different streams, which is the perfect abstraction for this. What remains to be seen is how they let the programmer control the relative multiprocessor allocation between streams. Probably the easiest way would be a function which let you set the number of multiprocessors a given stream could use. Then you could decide on the relative importance of the tasks you were running and assign multiprocessors to each accordingly. (Hopefully this will make PhysX and 3D coexist happily on one card.)

Thanks, it’s starting to make a bit more sense… slowly.

I have a few questions still though. On my macbook I run a program calling one kernel on data1, then I change it to run on data2, then I change it to 2 kernel calls, one for data1 and one for data2. The program runs fine and the results are as expected. However all 3 versions take exactly the same execute time which lead me to believe that it’s running 2 kernels in parallel. Perhaps the compiler is doing something clever here to make the 2 functions into 1??

Also there are machines with more devices than CPU-cores, so with my basic knowledge of OpenMP I don’t quite see how to have enough host threads to match the number of devices. Maybe I misunderstand again and it’s possible to switch devices within a thread with the command you mention.

I’ll be trying this this week to see what results I get. I’m about to order several machines and cards… the plan is to have 2 tesla and 2 fermi cards per 8-core machine, maybe I should be getting more Fermi cards instead, this really does seem like a big deal!

It’s not automatically spawning kernels on different devices or anything like that. Your timing methodology is just probably wrong (no cudaThreadSynchronize after timing, not timing using cudaEvents, something like that).

Thats true, I’m timing the execution of the whole program. I need to look into how to time cuda events properly, though I would have expected the sequential execution of the two kernels to have an effect on the overall time of the program.

Thanks everyone for your comments here I think I’m learning a lot!

How long are the times involved here and how much work is your kernel actually doing? A simple kernel can be so quick that it might contribute a negligible amount of time to the overall program execution. (To calibrate your thinking, a kernel that does nothing takes 10-30 microseconds, and your 9600M can move 20 GB/sec from global memory to the GPU, assuming 80% of theoretical peak.) The compiler cannot merge two kernels together for you.

Putting more CUDA devices than CPU cores into a computer is actually a bit challenging. A quad-core CPU is dirt cheap, but it is very hard to install four GTX 285 devices into one computer (they don’t fit in a standard ATX case, for one thing). You can switch to GTX 295 cards, which have two GPUs in each card, but then powering 3 or 4 GTX 295s is non trivial. Even if you run 4 GTX 295 cards in one computer, you can still get pretty good throughput using a Core i7 processor with hyperthreading, giving you 8 logical cores to match your 8 CUDA devices.

Also, there is no requirement that you have as many CPU cores as active host threads using CUDA devices. One device per core gives you the lowest latency when getting results from a kernel execution, but for long running kernels (seconds), the extra latency could be negligible. People have run 8 CUDA devices on a quad-core CPU (without hyperthreading) and still gotten good performance for their particular situation. As far as the host OS is concerned, you can make as many threads as you want. You just can’t expect them all to get CPU time simultaneously.

Not to discourage buying Tesla, but are you sure you need that if you are just starting out with CUDA? For large deployments, Tesla is great, but for development workstations, you can usually do just fine with a GTX 285 (and maybe a second, lighter-weight card for the display) and save a bunch of money for your final deployment when you know better what your needs are. :)

Actually I’ve made a toy benchmark some time ago that looked like this:

if(blockIdx.x < val)

  //do 1000 loops of long number crunching


  //do 2000 loops

When val was selected to divide the grid in half I got the same total performance as for a slim kernel that does 1500 loops.

It’s a trivial case but shows that fat kernels might not be as evil and the scheduler as primitive.

I’m part of a university research lab developing autonomous robotics, our work involves genetic algorithms, neural networks, some physics simulation, and a lot of varied image and audio processing, essentially all running at once. Most of this stuff is suitable for parallel computation and with real humanoid robotics it MUST run in real time. We were about to place an order for some 12-core servers, or even a 24-core server as we need more computational power than we have right now but then I discovered the CUDA option. That original order is now on hold while I take a crash course in CUDA to make sure it can deliver what I think it can. So far it’s looking good and to take advantage of NVIDIA’s mad science offer we need to place the order by the end of the month (which given University red tape means placing the order in the next few days). For universities if we pay for a fermi card now, we get a tesla and then a fermi when it comes out, and we get to keep the tesla (unlike non-uni customers), this is why I was planning 2 fermi’s and 2 teslas per machine, it’s essentially a by one get one free offer.

From my investigations so far, we could expect between 6-12 x speed up with the 12 or 24 -core servers, but following the CUDA route I’m finding more like 50 - 100 x in some of the neural net applications I’ve written (and I’m sure my code is far from optimal right now) and thats on my geforce 8600 GT. Obviously the visual processing is going to work well and so right now it seems this clearly the way to go. For some of my work I will need to be running several different programs/kernels to do different visual processing operations and feeding their output to a very large hierarchy of neural networks.

The price of many-core servers is prohibitive, and the overheads for clusters can work out but can be a pain too, right now the graphics card processing option really looks good but do tell me if you can see a problem in any of this.



Ok, fair enough. I just bring it up because a GTX 285 is just as fast as a Tesla C1060, but with only 1 GB of memory instead of 4 GB. However, a GTX 285 costs about $350 compared to $900-$1800 (depending on what sale is running) for a C1060.

If you can get 50-100x improvement with a GeForce 8600 GT, then it probably means there is something very wrong with your CPU code. An 8600 GT card only has 32 stream processors running at 1.1 GHz (so probably ~70 GFLOPS at best) and 12.8 or 22.4 GB/sec of device memory bandwidth. You might see a factor of 5-10 if your CPU isn’t very recent, but 100x suggests you have a lot of CPU performance left on the table.

No, you are correct, i’m getting x10 not x100… but still with only 32 cores and processing many times that in threads I can expect big things of the tesla or even of the 850