A few new to CUDA questions

Hi I’ve just started programming in CUDA C and I’ve been looking through the “CUDA Programming Guide” & “The Cuda best practices” guide and it has raised several questions.

* Kernel calls are asynchronous (Programming guide 3.2.6)?

This half makes sense to me…

If kernel calls are asynchronous (by that I assume it means it starts the kernel on the GPU and then returns control to host code to allow further commands to be processed) why does host code after a kernel call not execute until the kernel has finished?

e.g.

//Do asynchronous kernel call

<<<blocks,threads>>>kernel();

//The host doesn't execute this until the kernel is done?

someHostFunction();

Apparently this could let you queue kernel calls…

//Do asynchronous kernel call

<<<blocks,threads>>>kernel1();

<<<blocks,threads>>>kernel2();

//The host doesn't execute this until kernel2 is done?

someHostFunction();

But again this seems confusing as although kernel2 is called asynchronously someHostFunction() won’t execute until kernel2 is finished.

* How do you manage queued kernel calls on a device that supports concurrent kernel calls (cudaDeviceProp.concurrentKernels)?

Hopefully someone will be able to resolve my last question. Assuming you can queue kernel calls. What if say you want to execute 2 kernels one after the other on a device that supports concurrent kernel calls. How do you ensure that they are NOT called concurrently?

<<<block,threads>>kernel1();

//run kernel2 when kernel1 is complete.

<<<blocks,threads>>>kernel2();

And conversely if you did want your kernels to execute concurrently how would enforce that?

* Global memory size and alignment requirements (Cuda programming guide 5.3.2.1.1)…?

There is a section that reads

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results

(off by a few words), so special care must be taken to maintain alignment of the

starting address of any value or array of values of these types.

Is this suggesting that if my data-types don’t align that I will read data incorrectly?! That’s really bad… I have a struct defined as follows which on the system I’m running on is 20-bytes.

typedef struct

        {

                double x,y;

                int isNanoparticle;

        } DirectorElement;

A linear array of these structs is constructed on the host and then copied on to the device. Is alignment going to cause a problem?

  • I have other problems but I will put that in another post as those issues are slightly different.

Thanks in advanced.

    Kernel calls are asynchronous. The only times they appear not to be are if you

      Use the visual profiler or Nsight

      Are using a WDDM version of Windows, in which case there is batching to reduce the very high GPU latency associated with the platform

      Call a synchronization primitive to force synchronous behaviour

    If you have multiple kernel calls following one another, they get queued.

    On concurrent capable devices, the streams mechanism is used to control concurrency. If you launch two kernels into different CUDA streams, they may execute concurrently. If you launch them in the same stream, they are serialized with respect to one another. If you don’t use streams, they are implicitly launched into CUDA stream 0, which is always serialized.

    The alignment passage quote you quote to is referring to non-word boundary aligned data, and should not effect a structure of the type you mention. However, for performance reasons, a structure of arrays is much more preferrable to an array of structures in CUDA. The GPU has extremely fast (“coalesced”) access methods for 4, 8 and 16 byte types that are much more efficient than reading something like your structure containing 4 and 8 byte types.

accidental post… can’t seem to delete it!

Thanks that really clears that up.

Having an array of structures is certainly easier to use in what I’m trying to do. But I suppose I could do…

struct Lattice

{

	double* x; //array of x

	double* y; //array of y

	int* isNanoparticle; //array of isNanoparticle

}

It would mean significant restructuing of my program to do this…But if I will get a performance increase then I may try it. My bigger problem though at the moment is that my code doesn’t work and cuda-gdb is being pretty useless ( http://forums.nvidia.com/index.php?showtopic=192411 )

Thanks, for your quick and useful reply.