Hi,
Does anyone know whether curand’s host random number generation is competitive compared to other host implementations (e.g. GSL, boost::random, etc)?
I ask because I have a device based application for which I am trying to assess its performance relative to a CPU equivalent. To test this at the moment I am comparing a C implementation using CURAND’s host random number and testing computation time relative to my CUDA device implementation. This shows about a 50-100x speed up versus a single core CPU which looks great - but I am concerned this is not a fair test if CURAND host generatio might be slow versus equivalent CPU based random number generators.
Does anyone have experience with this question? I guess the answer is to implement the comparison using boost::random on the host side, but this is of course extra work :-)
CURAND “host” generation still uses the GPU.
Are you sure? I’m calling curand through the VC++ compiler and don’t use any device calls, e.g.
// set up rnd num generator
curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
int seed = 1;
for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }
//...
// draw rnd number
// index = iteration of simulation (single thread on CPU)
curandStateMRG32k3a localState = devStates[index];
z = curand_normal_double(&localState);
devStates[index] = localState;
I don’t think this utilises the GPU.
functions like curand_init and curand_normal_double are part of the curand Device API:
http://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-overview
When I said “host” generation I was referring to usage of the curand Host API:
http://docs.nvidia.com/cuda/curand/host-api-overview.html#host-api-overview
As stated in the first sentence of the Device API section:
"To use the device API, include the file curand_kernel.h in files that define kernels that use cuRAND device functions. "
AFAIK there is no support for calling device API functions from host code. If you have a functional code that demonstrates such usage, I’d certainly be interested in seeing it. Trivial attempts to build a host code around what you have shown produce expected errors for me:
$ cat t61.cu
#include <curand_kernel.h>
#include <curand.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define numSimulations (GRID_SIZE*BLOCK_SIZE)
int main(){
curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
int seed = 1;
for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }
int index = 0;
double z;
curandStateMRG32k3a localState = devStates[index];
z = curand_normal_double(&localState);
devStates[index] = localState;
}
$ nvcc -arch=sm_61 -o t61 t61.cu
t61.cu(13): warning: variable "z" was set but never used
t61.cu(10): error: calling a __device__ function("curand_init") from a __host__ function("main") is not allowed
t61.cu(15): error: calling a __device__ function("curand_normal_double") from a __host__ function("main") is not allowed
t61.cu(13): warning: variable "z" was set but never used
2 errors detected in the compilation of "/tmp/tmpxft_00007811_00000000-5_t61.cpp4.ii".
$
If you build it as a c++ file it works perfectly. E.g. set up your project in CUDA so you get the right include directories, then right click your .cu file and choose Item Type C/C++ compiler under general (i.e. don’t use nvcc, use cl).
The code below will compile under Visual Studio 2013:
#include "cuda_runtime.h"
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include <stdio.h>
#include <vector>
int main()
{
int minNumSims = 5000;
const unsigned BLOCK_SIZE = 256;
const unsigned GRID_SIZE = (const unsigned)ceil(float(minNumSims) / float(BLOCK_SIZE));
int numSimulations = BLOCK_SIZE * GRID_SIZE;
// set up rnd num generator
curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
int seed = 1;
for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }
// draw rnd number
std::vector<double>* rndNumbers = new std::vector<double>(numSimulations);
for (int i = 0; i < numSimulations; i++)
{
curandStateMRG32k3a localState = devStates[i];
(*rndNumbers)[i] = curand_normal_double(&localState);
devStates[i] = localState;
}
delete rndNumbers;
free(devStates)
return 0;
}
This is how I prove that the curand based kernel is producing the right results by comparing to the host based results.
I agree it seems to work, and also does not seem to use the GPU. I’m unfamiliar with this, so can’t offer any further advice.