Specifics on performance

Tovi · November 8, 2008, 11:56pm

Hi, I’d like to ask the experts a few specifics.

First of all: When I launch a global function, does my cpu send a functioncall per thread to the gpu? Or does it simply send the total number of threads and the function call (with arguments ofcourse.

Secondly: What is the best way to store constant memory on the device, that can be used by all kernel functions? I was thinking about simply using the device in combination with const, but I also notice that constant memory space in cuda is nog cached, so it can be really slow. Is it better to simply give these values using function call paramaters, or store them in some other way? (Perhaps a texture or so)

If anyone knows, please tell me :)

If I have more questions I’ll ask here

thx in advance

MisterAnderson42 · November 9, 2008, 12:51am

Are you asking about host threads or device threads? A single kernel launch just specifies the grid and block dimensions to the GPU. All device threads are launched by the GPU scheduler. For host threads, a kernel launch in each host thread is separate and will be sent to the GPU(s) independantly.

First, it is device memory that is not cached, and it is not slow. Device memory is blazing fast (140 GiB/s on GTX 280) assuming you read/write it coalesced.

Constant memory does have a small per MP cache and is optimized for the case where all threads in a warp are accessing the same element of the constant array (broadcast). It is good for filter kernels and the like where coalescing is not easily attainable and you are rereading the same element over and over again. Also, there is no need for the device qualifier. Just declare constant memory:

__constant__ float d_myarray[128];

And copy to it from the host like this:

foat h_myarray[128];

// fill out array values

cudaMemcpyToSymbol(... I forget the exact syntax here, see the cuda programming guide).

Tovi · November 9, 2008, 2:27am

Are you asking about host threads or device threads? A single kernel launch just specifies the grid and block dimensions to the GPU. All device threads are launched by the GPU scheduler. For host threads, a kernel launch in each host thread is separate and will be sent to the GPU(s) independantly.

First, it is device memory that is not cached, and it is not slow. Device memory is blazing fast (140 GiB/s on GTX 280) assuming you read/write it coalesced.

Constant memory does have a small per MP cache and is optimized for the case where all threads in a warp are accessing the same element of the constant array (broadcast). It is good for filter kernels and the like where coalescing is not easily attainable and you are rereading the same element over and over again. Also, there is no need for the device qualifier. Just declare constant memory:
__constant__ float d_myarray[128];
And copy to it from the host like this:
foat h_myarray[128];

// fill out array values

cudaMemcpyToSymbol(... I forget the exact syntax here, see the cuda programming guide).

For the first question I was asking about a single host function calling on a global function with a huge grid. I already had the feeling that it would call multiple functions, instead of just a single one with given blocksize… It’s why my cpu time goes through the roof when I launch my current app. I guess I’ll have to work with as much device functions as possible and try to avoid anything on the host.

alex_dubinsky · November 9, 2008, 8:35pm

No, calling a global function issues a single function call. The reason your CPU is maxing is because synchronizing with the GPU (via cudaMemcpy or cudaThreadSynchronize) causes a spin-wait, spinning your cpu till the results come in. I think I heard the spin-waiting will be fixed in CUDA 2.1 (or later). In any case, you can tell your CPU to take a breather by calling the correct sleep() function for your OS. Although when you wake up you may realize that the results arrived some time ago.

Tovi · November 10, 2008, 9:15pm

But wouldn’t that mean that the current version of CUDA has a CPU bottleneck?

Also, I don’t use cudaMemcpy or cudaThreadSynchronize in my code (up until now). So I’m assuming cuda does those things implicitly when synchronizing a bunch of global functions?

MisterAnderson42 · November 10, 2008, 9:38pm

No. In this case, counter-intuitively wasting CPU cycles actually makes your program faster (assuming you don’t have a background thread that wants those cycles). While CUDA is spin-waiting, it is polling the GPU to see when the operation is complete so that it can start the next one. Waiting longer between polls or using interrupts will lengthen the time it takes to detect when the operation is done and slow the overall run time.

The queue depth is a few hundred global calls, after which it will sync implicitly.

Tovi · November 11, 2008, 3:26am

I undertsand that, but a faster CPU would mean faster polling, right? So in a way, the speed of your CPU will influence the speed of CUDA running on your GPU. Again correct me if I’m wrong.

Oh, and thx for the help so far.

MisterAnderson42 · November 11, 2008, 12:46pm

I guess in principle, yes. But in practice I don’t notice a difference in performance between a 5-year old CPU and today’s CPU w/ the same GPU. And my app makes 100’s to 1000’s of kernel calls per second with 100’s of these implicit synchronization points per second (copying a single float back from the GPU to make flow control decisions on the CPU.)

I forgot to mention that if regaining the CPU cycles for background tasks is more important than the fastest possible polling, you can write your own polling loop using cudaEventQuery (search the forums, there are a number of threads with examples).

Topic		Replies	Views
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8621	December 18, 2008
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17423	June 14, 2008
CUDA thread in background? CUDA Programming and Performance	10	16007	February 19, 2010
Language confusion with multi-gpu CUDA Programming and Performance	11	19296	October 30, 2007
Single or multiple CPU threads using same GPU? CUDA Programming and Performance cuda , performance	5	2665	September 14, 2023
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6527	May 9, 2011
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23205	October 7, 2010
Too much time for kernel launch latency CUDA Programming and Performance	9	2592	November 28, 2022
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13463	July 9, 2008
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	4839	January 9, 2017

Specifics on performance

Related topics