Hi I’ve just started programming in CUDA C and I’ve been looking through the “CUDA Programming Guide” & “The Cuda best practices” guide and it has raised several questions.
* Kernel calls are asynchronous (Programming guide 3.2.6)?
This half makes sense to me…
If kernel calls are asynchronous (by that I assume it means it starts the kernel on the GPU and then returns control to host code to allow further commands to be processed) why does host code after a kernel call not execute until the kernel has finished?
e.g.
//Do asynchronous kernel call
<<<blocks,threads>>>kernel();
//The host doesn't execute this until the kernel is done?
someHostFunction();
Apparently this could let you queue kernel calls…
//Do asynchronous kernel call
<<<blocks,threads>>>kernel1();
<<<blocks,threads>>>kernel2();
//The host doesn't execute this until kernel2 is done?
someHostFunction();
But again this seems confusing as although kernel2 is called asynchronously someHostFunction() won’t execute until kernel2 is finished.
* How do you manage queued kernel calls on a device that supports concurrent kernel calls (cudaDeviceProp.concurrentKernels)?
Hopefully someone will be able to resolve my last question. Assuming you can queue kernel calls. What if say you want to execute 2 kernels one after the other on a device that supports concurrent kernel calls. How do you ensure that they are NOT called concurrently?
<<<block,threads>>kernel1();
//run kernel2 when kernel1 is complete.
<<<blocks,threads>>>kernel2();
And conversely if you did want your kernels to execute concurrently how would enforce that?
* Global memory size and alignment requirements (Cuda programming guide 5.3.2.1.1)…?
There is a section that reads
Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results
(off by a few words), so special care must be taken to maintain alignment of the
starting address of any value or array of values of these types.
Is this suggesting that if my data-types don’t align that I will read data incorrectly?! That’s really bad… I have a struct defined as follows which on the system I’m running on is 20-bytes.
typedef struct
{
double x,y;
int isNanoparticle;
} DirectorElement;
A linear array of these structs is constructed on the host and then copied on to the device. Is alignment going to cause a problem?
- I have other problems but I will put that in another post as those issues are slightly different.
Thanks in advanced.