curand host performance

John_Smith_Lon · December 28, 2016, 5:39pm

Hi,

Does anyone know whether curand’s host random number generation is competitive compared to other host implementations (e.g. GSL, boost::random, etc)?

I ask because I have a device based application for which I am trying to assess its performance relative to a CPU equivalent. To test this at the moment I am comparing a C implementation using CURAND’s host random number and testing computation time relative to my CUDA device implementation. This shows about a 50-100x speed up versus a single core CPU which looks great - but I am concerned this is not a fair test if CURAND host generatio might be slow versus equivalent CPU based random number generators.

Does anyone have experience with this question? I guess the answer is to implement the comparison using boost::random on the host side, but this is of course extra work :-)

Robert_Crovella · December 28, 2016, 6:25pm

CURAND “host” generation still uses the GPU.

John_Smith_Lon · December 28, 2016, 11:02pm

Are you sure? I’m calling curand through the VC++ compiler and don’t use any device calls, e.g.

// set up rnd num generator
	curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
	int seed = 1;
	for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

//...
// draw rnd number
// index = iteration of simulation (single thread on CPU)
			curandStateMRG32k3a localState = devStates[index];
			z = curand_normal_double(&localState);
			devStates[index] = localState;

I don’t think this utilises the GPU.

Robert_Crovella · December 29, 2016, 12:04am

functions like curand_init and curand_normal_double are part of the curand Device API:

http://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-overview

When I said “host” generation I was referring to usage of the curand Host API:

http://docs.nvidia.com/cuda/curand/host-api-overview.html#host-api-overview

As stated in the first sentence of the Device API section:

"To use the device API, include the file curand_kernel.h in files that define kernels that use cuRAND device functions. "

AFAIK there is no support for calling device API functions from host code. If you have a functional code that demonstrates such usage, I’d certainly be interested in seeing it. Trivial attempts to build a host code around what you have shown produce expected errors for me:

$ cat t61.cu
#include <curand_kernel.h>
#include <curand.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define numSimulations (GRID_SIZE*BLOCK_SIZE)
int main(){

        curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
        int seed = 1;
        for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

        int index = 0;
        double z;
                        curandStateMRG32k3a localState = devStates[index];
                        z = curand_normal_double(&localState);
                        devStates[index] = localState;

}
$ nvcc -arch=sm_61 -o t61 t61.cu
t61.cu(13): warning: variable "z" was set but never used

t61.cu(10): error: calling a __device__ function("curand_init") from a __host__ function("main") is not allowed

t61.cu(15): error: calling a __device__ function("curand_normal_double") from a __host__ function("main") is not allowed

t61.cu(13): warning: variable "z" was set but never used

2 errors detected in the compilation of "/tmp/tmpxft_00007811_00000000-5_t61.cpp4.ii".
$

John_Smith_Lon · December 29, 2016, 10:21am

If you build it as a c++ file it works perfectly. E.g. set up your project in CUDA so you get the right include directories, then right click your .cu file and choose Item Type C/C++ compiler under general (i.e. don’t use nvcc, use cl).

The code below will compile under Visual Studio 2013:

#include "cuda_runtime.h"
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include <stdio.h>
#include <vector>

int main()
{
	int minNumSims = 5000;
	const unsigned BLOCK_SIZE = 256;
	const unsigned GRID_SIZE = (const unsigned)ceil(float(minNumSims) / float(BLOCK_SIZE));
	int numSimulations = BLOCK_SIZE * GRID_SIZE;

	// set up rnd num generator
	curandStateMRG32k3a* devStates = (curandStateMRG32k3a*)malloc(BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
	int seed = 1;
	for (int i = 0; i < numSimulations; i++) { curand_init(seed, i, 0, &(devStates[i])); }

	// draw rnd number
	std::vector<double>* rndNumbers = new std::vector<double>(numSimulations);

	for (int i = 0; i < numSimulations; i++)
	{
		curandStateMRG32k3a localState = devStates[i];
		(*rndNumbers)[i] = curand_normal_double(&localState);
		devStates[i] = localState;
	}

	delete rndNumbers;
	free(devStates)
    return 0;
}

This is how I prove that the curand based kernel is producing the right results by comparing to the host based results.

Robert_Crovella · December 29, 2016, 3:14pm

I agree it seems to work, and also does not seem to use the GPU. I’m unfamiliar with this, so can’t offer any further advice.

njuffa · December 29, 2016, 9:25pm

As you noted:

Topic		Replies	Views
Differences between host API and device API for CURAND? CUDA Programming and Performance	4	12087	February 16, 2011
About Host API of curand curandGenerateUniform() function CUDA Programming and Performance	3	1693	June 15, 2016
CURAND performace? CUDA Programming and Performance	0	3265	July 14, 2011
CURAND CURAND low per CUDA Programming and Performance	8	3066	April 12, 2011
how to get same output by CURAND in CPU and GPU CUDA Programming and Performance	3	5913	July 19, 2011
need help with curand host api CUDA Programming and Performance	3	1019	June 11, 2015
How do I test random kernels vs cpu? CUDA Programming and Performance	2	540	August 8, 2017
CURAND: Independence of RND numbers Are Host API generated RND numbers independant? CUDA Programming and Performance	3	5851	May 26, 2011
Inconsistent offsets in CURAND Host API CUDA Programming and Performance	7	3565	May 23, 2012
curandCreateGenerator(me,CURAND_RNG_PSEUDO_MT19937) == C++11 std::mt19937?? ? GPU-Accelerated Libraries	3	2095	June 9, 2016

curand host performance

Related topics