curandGenerate() not returning immediately

new to cuda , and more familiar with opencl, trying to use the curand library, it says curandGenerate is launched “asynchronously” p 5 curand guide, but it doesn’t, http://pastebin.com/8WUBaD5n maybe its due to memcopy not being asynchronously. however, I want both, curandGenerate and cudaMemcpy to run in the background. in other words, or more generally, how can I do parallel cpu gpu computation in cuda?

this is on tesla 2050, linux (SLES11), cudatoolkit 4.1

thanks
T

cudaMemcpy() is indeed a synchronous operation. If you want overlap between CPU work, GPU work, and copies between CPU and GPU, you would want to learn about CUDA streams and cudaMemcpyAsync(). You may find the CUDA API online documentation helpful in that endeavor:

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html

I am quite sure there are worked examples of CUDA streams, and overlapped CPU / GPU work in the SDK, but I do not offhand know what they are. Maybe a forum participant who is familiar with the SDK example apps can point you (as a guess, look for an app called simpleStreams).

[later:]

Here is a link to an online copy of the simpleStreams app:

Right, I noticed the plain cudaMemcpy is synchronous and now changed it to Async, with setting up a stream as per p31 of cuda programming guide.

Still, curandGenerate and cudaMemcpyAsync return after 10 secods (of creating 240mil random numbers), and do not return immediately, meaning that the function crgen does not return immediately.

http://pastebin.com/CRYeXWwA

T

As far as I can see (apologies if I overlooked something; it’s a bit late in the day for me) all the operations are in the same stream, which means cudaMemcpyAsync() has to wait for curandGenerate() to finish and the situation is exactly the same as before. Only GPU operations (including host / device copies) in different stream can overlap. As the curandGenerate() call is described as asynchronous, give the following a try:

(1) start host timer
(2) call curandGenerate()
(3) perform 10 seconds worth of host-side computation
(4) call cudaThreadSynchronize()
(5) stop host timer

If things work as expected, the total time between the timer calls should be about 10 seconds, indicating that the 10 seconds worth of host-side computation overlapped with the 10 seconds worth of curandGenerate() computation on the device.

makes sense what you say. the actual copy of the rngs should definitively wait til rngs are created (although it could be optimized, too).
what I actually want is to issue the 2 commands (generate and copy) and then forget about them. Issuing them or putting them onto the “command queue” should take no more than a microsecond, and the execution should return. I will check 20 seconds later if they have finished, but I cannot wait 20 seconds doing nothing (on the host/cpu).

I think opencl works like that. If there is a way in cuda I would appreciate any hint.

thanks
T

Kernel launches in CUDA are indeed asynchronous, i.e. the relevant information is pushed into a queue and control returns to the caller almost immediately. I have observed launch times as low as two microseconds for empty kernels. Using a high-precision host-side timer you can observe this by doing

(1) record host timer
(2) launch kernel
(3) record host timer

The time difference between steps (1) and (3) should be in the single-digit microsecond range. Since curandGenerate() is described as an asynchronous call, it should behave in analogous fashion, as it presumably maps to one or several kernel launches internally. I have not used CURAND myself and thus have no firsthand experience. I will check with relevant engineers regarding curandGenerate() behavior, in case there is a documentation error.

I checked with the CURAND engineers and also wrote myself a little test program that demonstrates that curandGenerate operates asynchronously, as documented. The output from the little test app shown below on my somewhat older 64-bit RHEL Linux system with a C2070 is as follows:

generating 240000000 normally distributed DP random numbers

time to launch curandGenerateNormalDouble:  0.000004 seconds

time to execute curandGenerateNormalDouble: 0.235601 seconds

The call to curandGenerateNormalDouble() returned control to the caller after 4 microseconds, while it took the GPU roughly a quarter of a second to finish generating the numbers. Note that the non-blocking action of cudaMemcpyAsync() requires that the host memory pointer points to pinned memory (from a call to cudaMallocHost(), for example) so the GPU can DMA the data directly into this pinned host memory. If a pointer to normal pageable host memory is passed, cudaMemcpyAsync() gracefully degrades, turning into a blocking cudaMemcpy() which in addition requires an internal host-to-host copy to transfer the data from a host-side pinned DMA buffer inside the driver to its final destination.

#include <stdio.h>

#include <stdlib.h>

#include "curand.h"

#define N 240000000

// Macro to catch CUDA errors in CUDA runtime calls

#define CUDA_SAFE_CALL(call)                                          \

do {                                                                  \

    cudaError_t err = call;                                           \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)

// Macro to catch errors in CURAND calls

#define CURAND_SAFE_CALL(call)                                        \

do {                                                                  \

    curandStatus_t err = call;                                        \

    if (CURAND_STATUS_SUCCESS != err) {                               \

        fprintf (stderr, "CURAND error in file '%s' in line %i\n",    \

                 __FILE__, __LINE__);                                 \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)

#include <stddef.h>

#include <sys/time.h>

static double second (void)

{

    struct timeval tv;

    gettimeofday(&tv, NULL);

    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;

}

int main (void)

{

    curandGenerator_t gen;

    double* devData;

    double start, stop, stop2;

CUDA_SAFE_CALL (cudaMalloc((void**) &devData, N*sizeof(devData[0])));

CURAND_SAFE_CALL (curandCreateGenerator (&gen, CURAND_RNG_PSEUDO_DEFAULT));

    CURAND_SAFE_CALL (curandSetPseudoRandomGeneratorSeed (gen, 1234ULL));

/* execute twice to warm up CPU caches */

    for (int k = 0; k < 2; k++) {

        start = second();

        curandGenerateNormalDouble(gen, devData, N, 0, 1);

        stop = second();

    }

    CUDA_SAFE_CALL (cudaThreadSynchronize());

    stop2 = second();

printf ("generating %d normally distributed DP random numbers\n", N);

    printf ("time to launch curandGenerateNormalDouble:  %.6f seconds\n", 

            stop-start);

    printf ("time to execute curandGenerateNormalDouble: %.6f seconds\n", 

            stop2-start);

CURAND_SAFE_CALL (curandDestroyGenerator (gen));

    CUDA_SAFE_CALL (cudaFree (devData));

return EXIT_SUCCESS;

}