why i need setup_kernel for curand states?

smartvoice · June 10, 2018, 1:05am

Hello All,

Why do I need to have a separate kernel to generate the curand_uniform value?
Basically, merging two kernels into one would be more efficient. Why I need the setup stage,then?

__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
    int id = threadIdx.x;
    curand_init ( seed, id, 0, &state[id] );
} 

__global__ void generate( curandState* globalState ) 
{
    int ind = threadIdx.x;
    curandState localState = globalState[ind];
    float RANDOM = curand_uniform( &localState );
    globalState[ind] = localState; 
}

Thanks,

Robert_Crovella · June 10, 2018, 3:33pm

You can combine them. Try it.

smartvoice · June 10, 2018, 5:37pm

Hi txbob,

I can combine them.
However, after combining them in a single kernel, printf() the random value (inside the kernel) doesn’t work. If I have them in a separate call, the printf() works. Very strange behavior.

__global__ void generate(unsigned long seed, float* data) 
{
    int id = threadIdx.x;
    curandState state;
    curand_init ( seed, id, 0, &state);
    float ranv = curand_uniform( &state );
    printf("tid %d : %f\n", id,ranv );
    data[id] = ranv; 
}

Robert_Crovella · June 11, 2018, 3:06am

Seems to work for me:

$ cat t1372.cu
#include <curand_kernel.h>
#include <stdio.h>

__global__ void generate(unsigned long seed, float* data)
{
    int id = threadIdx.x;
    curandState state;
    curand_init ( seed, id, 0, &state);
    float ranv = curand_uniform( &state );
    printf("tid %d : %f\n", id,ranv );
    data[id] = ranv;
}


int main(){
    const int n = 16;
    float *d_data;
    cudaMalloc(&d_data, n*sizeof(float));
    generate<<<1,16>>>(1234, d_data);
    cudaDeviceSynchronize();
}

$ nvcc -o t1372 t1372.cu -lcurand
$ ./t1372
tid 0 : 0.145468
tid 1 : 0.820181
tid 2 : 0.550399
tid 3 : 0.294830
tid 4 : 0.914733
tid 5 : 0.868979
tid 6 : 0.321921
tid 7 : 0.782857
tid 8 : 0.011302
tid 9 : 0.285450
tid 10 : 0.781606
tid 11 : 0.233840
tid 12 : 0.679064
tid 13 : 0.282442
tid 14 : 0.629903
tid 15 : 0.121223
$

CUDA 9.2, CentOS7, Tesla V100

smartvoice · June 11, 2018, 9:12am

Hi txbob,

Thanks for verifying it for me.
There is a small difference between my kernel with your kernel, where the state is declared with “curandStateMRG32k3a” in my case and “curandState” in your case.
If I used “curandStateMRG32k3a”, it didn’t print anything. If I use “curandState”, it works like a charm. Wondering why? Any ideas?

__global__ void kern_rng_using_cuRand(float *data, unsigned int seed, int N)    
{                                                                               
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;                   
                                                                                
    if (tid < N)                                                                
    {                                                                           
        //curandStateMRG32k3a state;  // prev: not print                        
        curandState state;  // now: working!                                    
        curand_init( seed, tid, 0, &state);                                     
                                                                                
        float a1 = curand_uniform(&state);                                      
        printf("%d: \t %f\n", tid, a1);                                         
        data[tid] = a1;                                                         
                                                                                
                                                                                
    }                                                                           
                                                                                
}

Thanks!

Robert_Crovella · June 11, 2018, 6:25pm

I just used the exact same kernel code as what you posted in your comment #3 above.

I took my code that I posted, and the only change I made was to the state variable:

curandStateMRG32k3a state;

and it still works in a similar fashion.

smartvoice · June 11, 2018, 8:08pm

Interesting!

#include <stdlib.h>
#include <stdio.h>
#include <string.h>


// Utilities and system includes
#include <helper_functions.h>  // helper for shared functions common to CUDA Samples
#include <helper_cuda.h>       // helper for CUDA Error handling

#include <cuda_runtime.h>
#include <curand.h>
#include <curand_kernel.h>


const int ga_seed = 87651111; 

using namespace std;


__global__ void kern_rng_using_cuRand(float *data, unsigned int seed, int N)
{
	unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

	if (tid < N)
	{
		curandStateMRG32k3a state;  // prev: not print
		//curandState state;  // now: working! 
		curand_init( seed, tid, 0, &state);

		float a1 = curand_uniform(&state);
		printf("%d: \t %f\n", tid, a1);
		data[tid] = a1;


	}

} 



///////////////////////////////////////////////////////////////////////////////
// Main program
///////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);


	float *data;	

	int N = 32;
	size_t N_bytes = sizeof(float) * N;

	cudaMallocManaged((void**)&data, N_bytes);

	int blksize = 1024;
	int grdsize = (N + blksize - 1) / blksize;



	//-------------------------------------------------------------------------
	// rng using cuRand 
	//-------------------------------------------------------------------------

	cudaEventRecord(start);

	kern_rng_using_cuRand <<<grdsize, blksize >>>(data, ga_seed, N); 

	cudaEventRecord(stop);
	cudaEventSynchronize(stop);
	float timer_ms = 0.f;
	cudaEventElapsedTime(&timer_ms, start, stop);
	printf("[cuRand] \t %f (ms)\n", timer_ms);

	cudaFree(data);
	cudaEventDestroy(start);
	cudaEventDestroy(stop);
	cudaDeviceReset();
}

Could you help me rerun the test using the code above?

Robert_Crovella · June 11, 2018, 10:57pm

please add proper CUDA error checking to your code, and make sure there are no reported errors, before asking others for help.

Your code is reporting an error that you are ignoring.

smartvoice · June 12, 2018, 12:48am

OK. I admit there is an error pops out after I added getLastCudaError(“Kernel execution failed”).

It reports too many resources requested for launch.

um_rand.cu(93) : getLastCudaError() CUDA error : Kernel execution failed : (7) too many resources requested for launch.

It is a simple kernel, where 1 blk with 1024 threads is launched, and only 32 threads are actively running. When I compiled the program, there are no warnings at all. It is very interesting!

Then I check the resource usage by adding the “-res-usage” option.

ptxas info : 77712 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function ‘_Z21kern_rng_using_cuRandPfji’ for ‘sm_70’
ptxas info : Function properties for _Z21kern_rng_using_cuRandPfji
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 88 registers, 368 bytes cmem[0], 80 bytes cmem[2]

I found that for such a small kernel, there are 88 registers per thread are needed! The limit for regs per sm is 65536 on V100. Therefore, if the driver statically partitions the resources, it should have register spilling ( max 65536 / 88 = 745 threads are allowed per sm). if the driver dynamically partitions the resources (N = 32), there should be enough resources.
There is no shared memory usage for this kernel.

After I changed curandStateMRG32k3a back to curandState, the kernel used much fewer registers!
Reduce the reg usage from 88 to 54. And the program runs as expected without any issues.

ptxas info : 77712 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function ‘_Z21kern_rng_using_cuRandPfji’ for ‘sm_70’
ptxas info : Function properties for _Z21kern_rng_using_cuRandPfji
6456 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 54 registers, 368 bytes cmem[0]

Again, I think this is bizarre behavior.

Thanks for the suggestion!

Robert_Crovella · June 12, 2018, 1:18am

With proper error checking, it looks like you’ve made a lot of progress.

The compiler is making the resource decision for register usage at compile time. There are various ways to limit register per thread usage, such as a compiler switch (-maxrregcount) or use of launch_bounds

If you google for these you’ll find many descriptions of their use, and of course they are documented (maxrregcount in the nvcc manual, launch_bounds in the CUDA programming guide.)

njuffa · June 12, 2018, 6:48am

Why do you think these observations constitute bizarre behavior? Different PRNGs in CURAND tend to require different amount of resources, including registers. I am reasonably certain that when you use curandState instead of curandStateMRG32k3a, you are effectively switching to the simplest PRNG supplied by CURAND, XORWOW. Check the documentation to make sure, as my memory is hazy.

The CURAND documentation should also contain a brief explanation about the trade-offs between the various PRNGs offered.

Antonio_M_Rodrigo · June 12, 2019, 10:32am

Robert_Crovella:

Seems to work for me:

$ cat t1372.cu
#include <curand_kernel.h>
#include <stdio.h>

__global__ void generate(unsigned long seed, float* data)
{
    int id = threadIdx.x;
    curandState state;
    curand_init ( seed, id, 0, &state);
    float ranv = curand_uniform( &state );
    printf("tid %d : %f\n", id,ranv );
    data[id] = ranv;
}


int main(){
    const int n = 16;
    float *d_data;
    cudaMalloc(&d_data, n*sizeof(float));
    generate<<<1,16>>>(1234, d_data);
    cudaDeviceSynchronize();
}

$ nvcc -o t1372 t1372.cu -lcurand
$ ./t1372
tid 0 : 0.145468
tid 1 : 0.820181
tid 2 : 0.550399
tid 3 : 0.294830
tid 4 : 0.914733
tid 5 : 0.868979
tid 6 : 0.321921
tid 7 : 0.782857
tid 8 : 0.011302
tid 9 : 0.285450
tid 10 : 0.781606
tid 11 : 0.233840
tid 12 : 0.679064
tid 13 : 0.282442
tid 14 : 0.629903
tid 15 : 0.121223
$

CUDA 9.2, CentOS7, Tesla V100

I have tried this code and it works fine, but when setting n to a very large number (7200000), it collapses after some printf iterations, and pops: unspecified launch failure cudaGetLastError().
How could i get rid of this error? Not much info about it.

Robert_Crovella · June 12, 2019, 11:30am

maybe just a kernel timeout. Not clear what you changed in the code. Not clear what operating system you are running on, or what GPU you are running on, or whether that GPU is also hosting a display.

Antonio_M_Rodrigo · June 12, 2019, 5:57pm

Sorry. Here’s more information: Im using a 940M on Windows 10, not hosting a display. Here’s the code I’m trying to fix:

__global__ 
void generate(unsigned long seed, float* data)
{
	int id = threadIdx.x + blockIdx.x * blockDim.x;
	curandState state;
	curand_init(seed, id, 0, &state);
	float ranv = curand_uniform(&state);
	data[id] = ranv;
}

void operations(const size_t numRows, const size_t numCols)
{
        //numCols=3000 numRows=2400 

        const dim3 blockSize(16, 16, 1);
        const dim3 gridSize(ceil(1.0f*numCols / blockSize.x), ceil(1.0f*numRows / blockSize.y));

        int numpixels = numCols * numRows; 
	const int n = numpixels; //numpixels=7200000

	float *d_data;
	checkCudaErrors(cudaMalloc(&d_data, n * sizeof(float)));

	generate << <gridSize, blockSize, 0 >> > (1234, d_data);
	checkCudaErrors(cudaDeviceSynchronize());
}

When I use n=16 like you did, and generate<<<1,16>>>(1234, d_data); everything works fine. But when I use n=7200000 in my case, and generate <<<gridSize, blockSize>>> (1234, d_data); it pops the following error when reaches to that line: unspecified launch failure cudaDeviceSynchronize(). Also, when gridsize dimensions are smaller it seems to work fine too. Maybe the problem is that dimensions are too big? I can’t find anything else.

Robert_Crovella · June 12, 2019, 6:31pm

A 940M on Windows is subject to kernel duration limits. (It is effectively “hosting a display” whether you realize it or not.) You may be hitting a WDDM TDR timeout. Please google that.

Antonio_M_Rodrigo · June 13, 2019, 7:49am

Thank you so much Robert. I found information about TDR and I disabled it temporarily, resulting in my program perfectly working (with about 1150 msecs of execution time).

Now, I changed a little the generate kernel, so it could generate much more random numbers (which I need to use later):

__global__ void generate(unsigned long seed, float* data, int numRows, int numCols)
{
	const int2 thread_2D_pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
	const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x;

	//int id = threadIdx.x + blockIdx.x * blockDim.x;

	curandState state;
	curand_init(seed, thread_1D_pos, 0, &state);
	float ranv = curand_uniform(&state);
	data[thread_1D_pos] = ranv;	
}

With this new amount of threads my program works fine but the execution time is about 772600 msecs (almost 13 minutes). But I guess it’s a normal time given the big amount of random numbers to generate (?)

Robert_Crovella · June 13, 2019, 12:00pm

Are you building a debug project? If so, you should switch to building a release project.

Antonio_M_Rodrigo · June 13, 2019, 4:30pm

Yes, I’m building a Release project.

Robert_Crovella · June 13, 2019, 5:44pm

I don’t have a system with a 940M to test on.

The curand_init function is time consuming. The curand_uniform function should be much quicker. If you need to generate a large number of random numbers, depending on your specific needs, it may be much more efficient to use a smaller grid to generate the numbers, and put a loop in your kernel so that each thread generates multiple random numbers. Each thread calls curand_init once, then calls curand_uniform multiple times, populating multiple places in your random number array in memory.

Unfortunately this forum thread kind of started off on a bad idea “can I combine the init and generate functions into a single kernel?” The answer is yes, but that may or may not be the best way to go, performance wise, depending on your needs.

In your last posting where you mention the 13 minute execution time, you don’t indicate how many numbers you are generating, but in most cases you should be able to generate multiple numbers per thread and do much better than 13 minutes, I think.

Antonio_M_Rodrigo · June 14, 2019, 11:08am

I tried your idea about using less threads for my kernel, calling cuda_init just once, and adding a loop inside for curand_uniform. It worked perfectly, my execution time is now 4secs instead of 13minutes. I can even enable TDR without getting timeout errors now. Thank you so much for your help Robert!