A few new to CUDA questions

delcypher · February 4, 2011, 1:41am

Hi I’ve just started programming in CUDA C and I’ve been looking through the “CUDA Programming Guide” & “The Cuda best practices” guide and it has raised several questions.

* Kernel calls are asynchronous (Programming guide 3.2.6)?

This half makes sense to me…

If kernel calls are asynchronous (by that I assume it means it starts the kernel on the GPU and then returns control to host code to allow further commands to be processed) why does host code after a kernel call not execute until the kernel has finished?

e.g.

//Do asynchronous kernel call

<<<blocks,threads>>>kernel();

//The host doesn't execute this until the kernel is done?

someHostFunction();

Apparently this could let you queue kernel calls…

//Do asynchronous kernel call

<<<blocks,threads>>>kernel1();

<<<blocks,threads>>>kernel2();

//The host doesn't execute this until kernel2 is done?

someHostFunction();

But again this seems confusing as although kernel2 is called asynchronously someHostFunction() won’t execute until kernel2 is finished.

* How do you manage queued kernel calls on a device that supports concurrent kernel calls (cudaDeviceProp.concurrentKernels)?

Hopefully someone will be able to resolve my last question. Assuming you can queue kernel calls. What if say you want to execute 2 kernels one after the other on a device that supports concurrent kernel calls. How do you ensure that they are NOT called concurrently?

<<<block,threads>>kernel1();

//run kernel2 when kernel1 is complete.

<<<blocks,threads>>>kernel2();

And conversely if you did want your kernels to execute concurrently how would enforce that?

* Global memory size and alignment requirements (Cuda programming guide 5.3.2.1.1)…?

There is a section that reads

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results

(off by a few words), so special care must be taken to maintain alignment of the

starting address of any value or array of values of these types.

Is this suggesting that if my data-types don’t align that I will read data incorrectly?! That’s really bad… I have a struct defined as follows which on the system I’m running on is 20-bytes.

typedef struct

        {

                double x,y;

                int isNanoparticle;

        } DirectorElement;

A linear array of these structs is constructed on the host and then copied on to the device. Is alignment going to cause a problem?

I have other problems but I will put that in another post as those issues are slightly different.

Thanks in advanced.

avidday · February 4, 2011, 6:43am

[*]Kernel calls are asynchronous. The only times they appear not to be are if you

[*] Use the visual profiler or Nsight

[*] Are using a WDDM version of Windows, in which case there is batching to reduce the very high GPU latency associated with the platform

[*] Call a synchronization primitive to force synchronous behaviour

If you have multiple kernel calls following one another, they get queued.

[*] On concurrent capable devices, the streams mechanism is used to control concurrency. If you launch two kernels into different CUDA streams, they may execute concurrently. If you launch them in the same stream, they are serialized with respect to one another. If you don’t use streams, they are implicitly launched into CUDA stream 0, which is always serialized.

[*] The alignment passage quote you quote to is referring to non-word boundary aligned data, and should not effect a structure of the type you mention. However, for performance reasons, a structure of arrays is much more preferrable to an array of structures in CUDA. The GPU has extremely fast (“coalesced”) access methods for 4, 8 and 16 byte types that are much more efficient than reading something like your structure containing 4 and 8 byte types.

delcypher · February 4, 2011, 12:21pm

accidental post… can’t seem to delete it!

delcypher · February 4, 2011, 12:29pm

Thanks that really clears that up.

Having an array of structures is certainly easier to use in what I’m trying to do. But I suppose I could do…

struct Lattice

{

	double* x; //array of x

	double* y; //array of y

	int* isNanoparticle; //array of isNanoparticle

}

It would mean significant restructuing of my program to do this…But if I will get a performance increase then I may try it. My bigger problem though at the moment is that my code doesn’t work and cuda-gdb is being pretty useless ( The Official NVIDIA Forums | NVIDIA )

Thanks, for your quick and useful reply.

Topic		Replies	Views
strange behavior of kernel-calls CUDA Programming and Performance	2	2469	December 4, 2008
A question about kernel execution CUDA Programming and Performance	1	2653	August 24, 2009
0.9 asynchronous kernel question CUDA Programming and Performance	7	8536	June 14, 2007
ASYNCHRONOUS CALLS how do we FIGURE OUT this CUDA Programming and Performance	13	7117	April 14, 2008
Kernel Synchronization in CUDA not fully explained in programming guild CUDA Programming and Performance	1	10666	February 25, 2010
Very quick question regard aync CUDA Programming and Performance	4	2754	June 25, 2008
Asynchronous kernel calls CUDA Programming and Performance	4	9282	October 21, 2009
Newbie question: Async kernels Whether kernel executions are serialzed CUDA Programming and Performance	1	2108	July 7, 2008
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2322	May 30, 2009
Kernel execution CUDA Programming and Performance	2	932	September 28, 2009

A few new to CUDA questions

Related topics