Specifics on performance

Hi, I’d like to ask the experts a few specifics.

First of all: When I launch a global function, does my cpu send a functioncall per thread to the gpu? Or does it simply send the total number of threads and the function call (with arguments ofcourse.

Secondly: What is the best way to store constant memory on the device, that can be used by all kernel functions? I was thinking about simply using the device in combination with const, but I also notice that constant memory space in cuda is nog cached, so it can be really slow. Is it better to simply give these values using function call paramaters, or store them in some other way? (Perhaps a texture or so)

If anyone knows, please tell me :)

If I have more questions I’ll ask here

thx in advance

Are you asking about host threads or device threads? A single kernel launch just specifies the grid and block dimensions to the GPU. All device threads are launched by the GPU scheduler. For host threads, a kernel launch in each host thread is separate and will be sent to the GPU(s) independantly.

First, it is device memory that is not cached, and it is not slow. Device memory is blazing fast (140 GiB/s on GTX 280) assuming you read/write it coalesced.

Constant memory does have a small per MP cache and is optimized for the case where all threads in a warp are accessing the same element of the constant array (broadcast). It is good for filter kernels and the like where coalescing is not easily attainable and you are rereading the same element over and over again. Also, there is no need for the device qualifier. Just declare constant memory:

__constant__ float d_myarray[128];

And copy to it from the host like this:

foat h_myarray[128];

// fill out array values

cudaMemcpyToSymbol(... I forget the exact syntax here, see the cuda programming guide).

For the first question I was asking about a single host function calling on a global function with a huge grid. I already had the feeling that it would call multiple functions, instead of just a single one with given blocksize… It’s why my cpu time goes through the roof when I launch my current app. I guess I’ll have to work with as much device functions as possible and try to avoid anything on the host.

No, calling a global function issues a single function call. The reason your CPU is maxing is because synchronizing with the GPU (via cudaMemcpy or cudaThreadSynchronize) causes a spin-wait, spinning your cpu till the results come in. I think I heard the spin-waiting will be fixed in CUDA 2.1 (or later). In any case, you can tell your CPU to take a breather by calling the correct sleep() function for your OS. Although when you wake up you may realize that the results arrived some time ago.

But wouldn’t that mean that the current version of CUDA has a CPU bottleneck?

Also, I don’t use cudaMemcpy or cudaThreadSynchronize in my code (up until now). So I’m assuming cuda does those things implicitly when synchronizing a bunch of global functions?

No. In this case, counter-intuitively wasting CPU cycles actually makes your program faster (assuming you don’t have a background thread that wants those cycles). While CUDA is spin-waiting, it is polling the GPU to see when the operation is complete so that it can start the next one. Waiting longer between polls or using interrupts will lengthen the time it takes to detect when the operation is done and slow the overall run time.

The queue depth is a few hundred global calls, after which it will sync implicitly.

I undertsand that, but a faster CPU would mean faster polling, right? So in a way, the speed of your CPU will influence the speed of CUDA running on your GPU. Again correct me if I’m wrong.

Oh, and thx for the help so far.

I guess in principle, yes. But in practice I don’t notice a difference in performance between a 5-year old CPU and today’s CPU w/ the same GPU. And my app makes 100’s to 1000’s of kernel calls per second with 100’s of these implicit synchronization points per second (copying a single float back from the GPU to make flow control decisions on the CPU.)

I forgot to mention that if regaining the CPU cycles for background tasks is more important than the fastest possible polling, you can write your own polling loop using cudaEventQuery (search the forums, there are a number of threads with examples).