curand_init sequence number problem

danielfedailarsen · December 27, 2017, 5:42pm

Hi!

I’ve got a problem when trying to initialize a number of curandStates in a kernel. I have the following code:

#include "cuda_runtime.h"
#include "curand_kernel.h"
#include <time.h>

//Defined elsewhere, but put here for readability
#define DISPLAY_WIDTH 1920
#define DISPLAY_HEIGHT 1080

__global__ void CurandSetup(curandState* states, const unsigned long seed, const int width, const int height)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;
	if (x >= width || y >= height) { return; }
	int idx = y * width + x; //Range(0,2073599)
	
	curand_init(seed, idx, 0, &states[idx]); //Problem seems to be here
        if (x == 1919 && y == 1079)
        {
            printf("Yey, it works\n"); //If this prints, I think it works
        }
}

curandState* CurandInit()
{
	curandState* states;
	cudaMallocManaged(&states, DISPLAY_WIDTH * DISPLAY_HEIGHT * sizeof(curandState));
	
	const int block_x = 32;
	const int block_y = 32;
	const dim3 block(block_x, block_y);
	const int grid_x = (DISPLAY_WIDTH + block_x - 1) / block_x; //=60
	const int grid_y = (DISPLAY_HEIGHT + block_y - 1) / block_y; //=34
	const dim3 grid(grid_x, grid_y);
	cudaError_t err = cudaGetLastError();
	CurandSetup <<<grid, block>>> (states, time(NULL), DISPLAY_WIDTH, DISPLAY_HEIGHT);
	cudaDeviceSynchronize();
	err = cudaGetLastError(); //Success when run in "Release" in Visual Studio

	return states;
}

I have a OpenGL window, and I want to have a dedicated curandState set up for each pixel, as I spawn a thread per pixel in my main kernel. After reading the docs and viewing multiple samples, the above is what I came up with. However, for some reason it fails (nothing is printed after the initialization for x=1919 and y=1079) and the memory is labeled as “Unable to read memory” in Visual Studio’s debugger. I’ve spent hours trying to find the issue, and it seems like it is the idx that is passed to curand_init that causes the problem. If I pass 0 it works fine, but if I pass the highest possible integer that could occur in my setting (2073599=1920*1080-1), it fails. I can’t seem to find any reasoning as to why this happens…

All insight is greatly appreciated, thanks :)

Robert_Crovella · December 27, 2017, 9:11pm

When I run your code, it prints out “Yey, it works”

However your kernel takes 22 seconds to run on my GTX 960

If you haven’t disabled the windows WDDM TDR timeout, you will certainly be hitting that.

danielfedailarsen · December 27, 2017, 10:54pm

Indeed you are correct. Never thought this kernel would take that long; on my 850M it took 60 seconds. Thank you :)

By taking a look at the code for curand_init, it is clear that due to the way the sequence number is used in for-loops in the subcall to _skipahead_sequence_scratch, the larger the sequence number, the longer it is going to take.

I guess the next question would be if you know if this is the “best” way of doing this? The reason I want to have a curandState for each pixel is that if I don’t, there will be trouble when multiple curand_uniform calls are made on the same curandState simultaneously. There are of course ways to solve this, but that will likely have an impact on performance there and then which is undesirable for my application (path tracing). Any thoughts?

Robert_Crovella · December 27, 2017, 11:20pm

Correct, you wouldn’t want to do that.

Is there some reason you don’t just use a different seed for each state, instead?

When I do this:

curand_init(seed+idx, 0, 0, &states[idx]);

the kernel runs in 1.7ms.

Using large values of either subsequence or offset will result in increased computation times for curand_init. There might be reasons why you would want to pick a particular subsequence or offset, but I can’t discern them from what you’ve provided so far. Using a different seed for each thread should give a different sequence for each thread.

http://docs.nvidia.com/cuda/curand/device-api-overview.html#pseudorandom-sequences

“Different seeds are guaranteed to produce different starting states and different sequences.”

Also, I think you know this already, but CUDA code generally runs faster in “Release” projects vs. the corresponding “Debug” projects in VS. The debug projects add -G switch which results in generation of generally slower device code.

njuffa · December 27, 2017, 11:45pm

Different seeds imply different sequences but not uncorrelated streams of pseudo-random numbers. Where that is needed proper offsetting or leapfrogging needs to be applied.

The Philox generator typically offers a favorable balance between the cost of properly offsetting many independent uncorrelated PRNG streams and the cost of PRNG generation itself that makes it suitable for many applications.

danielfedailarsen · December 28, 2017, 12:12am

I see; since most examples I saw used a thread’s global id for the sequence number, I just assumed that was the best way to go. For my use of generating new paths on the hemisphere of a point on a mesh and its normal, I think just using a different seed will work fine. Thank you both :)

On my 850M, doing the switch you mention got me down to 29ms, which is practically irrelevant as the kernel is only called once on start-up. And yes, I’ve been running in Release mode. When running in Debug, I get cudaErrorLaunchOutOfResources after the kernel call in the error checking as seen in the above code. From what I’ve read, this is likely due to the number of registers the kernel uses taking the number of spawned threads into account vs. the number of registers the available in Debug mode, correct?

The output from compiling with -Xptxas=“-v” :

Used 63 registers, 6704 bytes cumulative stack size, 340 bytes cmem[0]

Robert_Crovella · December 28, 2017, 12:25am

Yes, I would normally think the out of resources is likely due to a registers-per-thread issue. However the 850M is a compute capability 5.0 device (I believe) and it should have 64K registers per SM, so 63 registers per thread should not be an issue. It may be some other issue (stack/memory, etc.) Unless that is the compiler output from release mode, instead of debug mode.

Based on the suggestion by njuffa, I took your original code and the only modification I made was to replace each instance of curandState with curandStatePhilox4_32_10_t, and the kernel execution time dropped from ~23s to ~4ms. It seems that should give you uncorrelated sequences with relatively short setup time.

I didn’t benchmark actual generation times. I can’t comment on other differences between Philox and XORWOW, but there is the documentation.

njuffa · December 28, 2017, 12:40am

The trade-offs between XORWOW and Philox are as follows:

XORWOW provides fastest PRNG generation of any generator provided by CURAND. It also provides the lowest quality PRNGs (as measured by standard test suites of randomness) out of the generators offered, but still sufficient for most applications. It features expensive skip-ahead (offsetting) and leap-frogging.

Philox generates PRNGs more slowly than XORWOW, but is not the slowest generator offered by CURAND. The quality of the PRNGs is in the middle of the generators offered by CURAND and should be suitable for all applications except possibly the most demanding ones (for which Mersenne Twister is indicated). Offsetting and leap-frogging is very cheap, cheaper than with any other generator offered by CURAND.

In situations where the execution time cost is dominated by offsetting / leap-frogging of the generators (in order to provide uncorrelated PRNG streams) such as this one, Philox is the obvious generator to try. I don’t offhand recall the trade-offs in terms of state size between XORWOW and Philox but I believe the difference is minor.

danielfedailarsen · December 28, 2017, 9:10am

Thanks for all the info, very much appreciated :)

Topic		Replies	Views
Trying to understand CURand (curand_init) sequence input parameter CUDA Programming and Performance	5	5611	April 19, 2011
CURAND question CUDA Programming and Performance	1	1446	December 1, 2010
why i need setup_kernel for curand states? GPU-Accelerated Libraries	19	2715	June 14, 2019
curand_init() in the same kernel function CUDA Programming and Performance	1	1142	April 16, 2015
CURAND acting strangely CUDA Programming and Performance	13	21485	April 26, 2011
[cuRand] curandStatePhilox4_32_10_t error during init CUDA Programming and Performance	7	1752	September 26, 2014
curand_init fails with non-zero subsequence or offset CUDA Programming and Performance	4	2249	April 30, 2014
curandState: strange behaviour (CURAND) CUDA Programming and Performance	3	2111	August 16, 2013
Question about optimal cuRAND() use GPU-Accelerated Libraries	7	2773	April 27, 2015
Should a kernel initializing random states with curand_init be so slow? CUDA Programming and Performance	10	2411	July 11, 2018

curand_init sequence number problem

Related topics