curand host performance

Hi,

Does anyone know whether curand’s host random number generation is competitive compared to other host implementations (e.g. GSL, boost::random, etc)?

I ask because I have a device based application for which I am trying to assess its performance relative to a CPU equivalent. To test this at the moment I am comparing a C implementation using CURAND’s host random number and testing computation time relative to my CUDA device implementation. This shows about a 50-100x speed up versus a single core CPU which looks great - but I am concerned this is not a fair test if CURAND host generatio might be slow versus equivalent CPU based random number generators.

Does anyone have experience with this question? I guess the answer is to implement the comparison using boost::random on the host side, but this is of course extra work :-)

CURAND “host” generation still uses the GPU.

Are you sure? I’m calling curand through the VC++ compiler and don’t use any device calls, e.g.

// set up rnd num generator
	curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
	int seed = 1;
	for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

//...
// draw rnd number
// index = iteration of simulation (single thread on CPU)
			curandStateMRG32k3a localState = devStates[index];
			z = curand_normal_double(&localState);
			devStates[index] = localState;

I don’t think this utilises the GPU.

functions like curand_init and curand_normal_double are part of the curand Device API:

http://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-overview

When I said “host” generation I was referring to usage of the curand Host API:

http://docs.nvidia.com/cuda/curand/host-api-overview.html#host-api-overview

As stated in the first sentence of the Device API section:

"To use the device API, include the file curand_kernel.h in files that define kernels that use cuRAND device functions. "

AFAIK there is no support for calling device API functions from host code. If you have a functional code that demonstrates such usage, I’d certainly be interested in seeing it. Trivial attempts to build a host code around what you have shown produce expected errors for me:

$ cat t61.cu
#include <curand_kernel.h>
#include <curand.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define numSimulations (GRID_SIZE*BLOCK_SIZE)
int main(){

        curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
        int seed = 1;
        for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

        int index = 0;
        double z;
                        curandStateMRG32k3a localState = devStates[index];
                        z = curand_normal_double(&localState);
                        devStates[index] = localState;

}
$ nvcc -arch=sm_61 -o t61 t61.cu
t61.cu(13): warning: variable "z" was set but never used

t61.cu(10): error: calling a __device__ function("curand_init") from a __host__ function("main") is not allowed

t61.cu(15): error: calling a __device__ function("curand_normal_double") from a __host__ function("main") is not allowed

t61.cu(13): warning: variable "z" was set but never used

2 errors detected in the compilation of "/tmp/tmpxft_00007811_00000000-5_t61.cpp4.ii".
$

If you build it as a c++ file it works perfectly. E.g. set up your project in CUDA so you get the right include directories, then right click your .cu file and choose Item Type C/C++ compiler under general (i.e. don’t use nvcc, use cl).

The code below will compile under Visual Studio 2013:

#include "cuda_runtime.h"
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include <stdio.h>
#include <vector>

int main()
{
	int minNumSims = 5000;
	const unsigned BLOCK_SIZE = 256;
	const unsigned GRID_SIZE = (const unsigned)ceil(float(minNumSims) / float(BLOCK_SIZE));
	int numSimulations = BLOCK_SIZE * GRID_SIZE;

	// set up rnd num generator
	curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
	int seed = 1;
	for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

	// draw rnd number
	std::vector<double>* rndNumbers = new std::vector<double>(numSimulations);

	for (int i = 0; i < numSimulations; i++)
	{
		curandStateMRG32k3a localState = devStates[i];
		(*rndNumbers)[i] = curand_normal_double(&localState);
		devStates[i] = localState;
	}

	delete rndNumbers;
	free(devStates)
    return 0;
}

This is how I prove that the curand based kernel is producing the right results by comparing to the host based results.

I agree it seems to work, and also does not seem to use the GPU. I’m unfamiliar with this, so can’t offer any further advice.

As you noted: