Differences between host API and device API for CURAND?

For the purposes of gradual migration, it’s essential that I have a RNG algorithm that can produce the same values on the GPU as the CPU given the same starting input.

I’m a little confused why the host API for CURAND doesn’t provide the same algorithm as the device API for performing incremental pseudorandom sequences and it’s clearly not designed to be used like this:

#include <assert.h>

#include <curand.h>

int main(int argc, char** argv)

{

	curandGenerator_t* generator = new curandGenerator_t();

					

	assert(curandCreateGeneratorHost(generator, CURAND_RNG_PSEUDO_DEFAULT) == CURAND_STATUS_SUCCESS);

					

	assert(curandSetPseudoRandomGeneratorSeed(*generator, 0UL) == CURAND_STATUS_SUCCESS);

					

	assert(curandSetGeneratorOffset(*generator, 0UL) == CURAND_STATUS_SUCCESS);

				

	assert(curandSetGeneratorOrdering(*generator, CURAND_ORDERING_PSEUDO_DEFAULT) == CURAND_STATUS_SUCCESS);

					

	assert(curandGenerateSeeds(*generator) == CURAND_STATUS_SUCCESS);

	

	float* randoms = new float[1];

	

	for (int i = 0; i < 1000000; i++)

	{

		assert(curandGenerateUniform(*generator, randoms, 1) == CURAND_STATUS_SUCCESS);

	}

	

	delete[] randoms;

	delete generator;

}

The above code takes around 15 seconds to produce 1 million random numbers. In contrast calling the curandGenerateUniform function to fill an array with 1 million values takes about 2 milliseconds.

It’s a shame to have to discard this new library, but I can’t see a way of producing the same sequence of values on the CPU as I can produce in an incremental way on the GPU. I will continue to pick my way through the “Xorshift RNGs” paper in the hope of replicating the algorithm used on the GPU for the moment.

Can anyone shed any light on this?

Hi danuk

The easiest way to get the same results on the GPU and CPU is to use the host interface. You can use curandCreateGenerator() for GPU generation, and curandCreateGeneratorHost() for CPU generation. Both of these generators will produce the same sequence of random numbers. The only difference in the interface is that the output pointer you pass to curandGenerateUniform() will be a pointer to host memory allocated with malloc() for the generator created with curandCreateGeneratorHost(). For a generator created with curandCreateGenerator() the pointer will be to device memory allocated with cudaMalloc().

The device API gives you functions like curand_init() and curand_uniform() that can be called from each device thread. This gives you more power to control what’s going on and perhaps use the generated random numbers without ever having to write them to memory. But it’s more complicated. If you want to get the same results on the CPU you have to deal with issues like different thread counts. You might have 2^14 threads on the GPU, but just 1 thread on the CPU. There is an “unsupported” way to get CPU versions of curand_init() and curand_uniform(), let me know if you want to do that.

I just read the other earlier thread about RNGs, it seems like people might be interested in this. To get CPU versions of the device functions like curand_init() and curand_uniform(), do the following:

#define QUALIFIERS __host__ __device__

#include <curand_kernel.h>

#undef QUALIFIERS

But I still recommend using the host API with curandCreateGenerator() and curandCreateGeneratorHost() if you need the same results on CPU and GPU, it’s simpler.

Hi Nathan,

We have written a Monte carlo simulation to compare the execution speeds on a CPU & GPU. The CPU version is written in C# employing Task Parallel Library to take advantage of multi-core. The GPU Version of code has a kernel which uses curand_init and curand to generate random numbers. In the CPU Version we wrote a Managed wrapper class wrapping the curandCreateGeneratorHost, curandSetPseudoRandomGeneratorSeed, curandGenerateSeeds and curandGenerateUniform.

To be able to make both the versions comparable we need to make the code behavior identical to the maximum possible. We are worried if using the curand device api in the kernel and curand host api in the CPU would actually make any difference. If yes, is there a way out apart from the unsupported way (mentioned above) of using the device api in host code.

One way is to generate required number of random numbers from the host code using the host api (generate on device or host) and then send the pointer to the kernel. With this method, if the random numbers go up to millions in size memory would become a problem.

Please suggest.

Thanks,
Chandrasekhar

Chandrasekhar -

The simplest method is to do the generating call from the CPU host API, either to the host memory or the device memory. But as you say, this requires storing the random numbers before processing them, which slows things down. The next easiest way is to do the unsupported method to get the same functions on the CPU and GPU to do inline generation. It is unsupported only because we didn’t document and test it thoroughly, I don’t expect it to stop working. In future versions of CURAND you may not even need the #define part to get it working.

You get the same numbers using the host library API and the inline device API, but you have to arrange all your threads and states correctly. Using curand_init() and curand() from the device API gives you one stream of random numbers. Using curandGenerate() launches 4096 threads on the device, each one of which is doing curand_init() and then lots of calls to curand(). If you call the output from the curandGenerate() something like output, then output[0] is the first random number from thread 0, output[1] is the first one from thread 1, …, output[4095] is the first one from thread 4095, output[4096] is the second output from thread 0, output[4097] is the second output from thread 1, …, and so on. If you are launching exactly 4096 threads in your kernel, you can probably arrange things to line up. If you’re using fewer or more threads than 4096, you’ll have to do some reorganizing.