CURAND acting strangely

MatijaSh · April 16, 2011, 5:19pm

if i run this code ( pretty similar to sample in curand documentation ) :

global void test_kernel ( curandState *globalRand, float *randoms, int N_V )
{
//global index
int id = blockIdx.x * blockDim.x + threadIdx.x;

if ( id >= N_V )
return;

curandState local_state = globalRand[ id ]; → i intend to launch kernel multiple times so i preserve state

randoms[ id ] = curand_uniform ( &local_state );

globalRand[ id ] = local_state; → with this line commented everything works fine
}

//BLOK_SIZE is 1024 → running on gtx 570

block_num = N_V / BLOCK_SIZE + 1;

test_kernel<<<block_num, BLOK_SIZE>>>( globalRand, d_randomi, N_V );

but, even if i run kernel ONLY ONCE or more times for big N_V ( about 3000 and higher ) i get output ( when i printf randoms in host after kernel finished ) which is decreasing to 0 and after that getting negative values

I would appreciate any help,
using 64 bit windows 7, VS Pro, CUDA 3.2, 570 GTX

hyqneuron · April 16, 2011, 5:43pm

I don’t know how curand works. I could only point out a few trivial things:

block_num = (N_V +512)/ BLOCK_SIZE;
Make sure that you have used cudaThreadSynchronize after the cudaMemcpy

MatijaSh · April 16, 2011, 10:54pm

Thank you, but the problem remains.

notice that if i remove last line in kernel ( globalRand[id] = local_state ) program works fine, returning numbers from 0 to 1.

in curand doc it says: Distribution functions ( like curand_uniform ) may use any number
of unsigned integer values from a basic generator. The number of values
consumed is not guaranteed to be fixed.

what does that mean? is it posible that every thread takes a few neighbour globalRand ( like globalRand[id+1], or[id-1] ), so it comes to some sort of conflict while updating globalRand?

I think the most correct would be num_block = (N + threadsPerBlock â€“ 1) / threadsPerBlock

your answer returns 0 for N_V < 512 ( if you noticed that my BLOCK_SIZE is 1024 ). if not, i think you wrote the same thing as I ( only in parenthesis )

hyqneuron · April 17, 2011, 4:59am

You’re right. Sorry for the wrong advice given. I always had the impression that integer division was rounded towards the nearest.

I don’t know how the compiler handles your code. Though I suspect it is affected by the size of a curandState struct. The possibility is that the statement which you commented out is forcing the compiler to do some very silly thing which crashes your kernel. You could use cuobjdump to check it out. Or maybe sometimes reading the ptx output of the compiler could also give you some clue. Did you check the error from your kernel launch? I suspect there exists an unknown error or something like that.

Also, you could attempt the following code

__global__ void test_kernel ( curandState* globalRand, float *randoms, int N_V )

{

//global index

int id = blockIdx.x * blockDim.x + threadIdx.x;

if ( id >= N_V )

return;

curandState* global_state = globalRand + id*sizeof(curandState) ; //do not copy it to local mem or register, just leave it in global mem

randoms[ id ] = curand_uniform ( global_state );

}

Also, if curandState is not something huge, and if you are able to fit 1024 of them into 48KB,you could just declare them explicitly in shared memory and then copy global states to shared memory the first time your kernel is launched. Though you will be limited to using a block number less than or equal to your number of MPs.

Does curand_uniform modify the parameter passed in?

seibert · April 17, 2011, 1:33pm

I use CURAND in basically this way without any trouble. How are you initializing globalRand before calling test_kernel?

MatijaSh · April 17, 2011, 5:21pm

This is what i did:

curandState *globalRand;

cudaMalloc( (void**)&globalRand, N_V*sizeof(int) );

test_kernel<<< …

instead of : cudaMalloc( (void**)&globalRand, N_V*sizeof(curandState) )

Program worked small numbers so I didn’t think error could be in initializing.

Thank you and sorry for bothering.

sinclair · April 17, 2011, 9:03pm

Do you call curand_init() though? Unless I missed something in your code you sent out, you’re not calling it, and I believe you need to beforehand.

Also, I’m pretty sure you should be doing something like this instead (despite what you’re last post said):

curandState *globalRand;

cudaMalloc((void **)&globalRand, N_V*sizeof(curandState));

// next: init_kernel

// finally: test_kernel

where init_kernel looks something like this:

__global__ void init_kernel (curandState * state)

{

  int threadIndex = ((threadIdx.y * blockDim.x) + threadIdx.x) + (blockDim.x * blockDim.y * blockIdx.x);

// each thread get the same seed, different sequence number, and an offset of 1000

  curand_init(SEED, threadIndex, 1000, &state[threadIndex]);

}

I imagine (but don’t know for sure) that it worked for a small number of threads because you weren’t going outside it’s memory bounds or something like that. Also, if you weren’t initializing it, that would be another possible reason.

As an aside, as you probably know already, you should be checking your error codes returned from your cudaXXX functions. I didn’t do it above solely for brevity.

Hope this helps,

Matt

seibert · April 18, 2011, 2:43pm

As mentioned by sinclair, you absolutely have to call curand_init() for all of your globalRand elements, otherwise the results are undefined. I do this the same way that he shows.

MatijaSh · April 18, 2011, 4:47pm

I called curand_init() before, pretty same way like described above, just accidentaly forgot to write it.

Works fine now, thanks!

sinclair · April 18, 2011, 4:54pm

EDIT: Was your previously posting stating that you had previously used sizeof(int) (incorrectly), but that once you switched it to sizeof(curandState), it started working? If so, then I just misread your post, sorry!

So what did you change then (for the benefit of the rest of us :) )?

Matt

MatijaSh · April 18, 2011, 5:02pm

As i described above, while doing cudaMalloc() for globalRand, i put cudaMalloc ( (void **)&globalRand, N_Vsizeof(int) ), instead N_Vsizeof(curandState). Still don’t know why it worked for numbers smaller than about 3000.

yes, thats it :)

sinclair · April 18, 2011, 5:15pm

I think you missed my edit to my previous post :), where I asked if what you just said in this post is what you changed (and it is). Sorry about missing that!

My theory for why it would work for few threads is that you weren’t running out of your memory bounds.

Matt

MatijaSh · April 26, 2011, 2:02pm

anybody idea about this situation ?

global void device_random_init ( curandState *globalRand, int seed, int N_V )
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

curand_init ( seed, id, 0, &globalRand[id] );

}

and in main i call it like:

int main()
{
curandState globalRand;
cudaMalloc( (void*)&globalRand, 5001000sizeof(curandState) );
device_random_init<<<500, 1000>>>( globalRand, rand(), N_V );

… }

it crashes for ( 1000*1000 ) and works for 500 * 1000 ( that is highest number without crashing ).

i work on gtx 570 ( max block size 1024, not 512 ) and it is also used for display , so is it possible kernel is taking too long and it got suspended?
kernels that crash take aprox 5 seconds before crashing

thanks

sinclair · April 26, 2011, 2:09pm

Do you have you Xserver enabled? Because if you do, I’m pretty sure the “watchdog” timer times out at 5 seconds and crashes your kernel (i.e. you can’t have a kernel that runs long than 5 seconds. So that would explain your problem.

Topic		Replies	Views
why i need setup_kernel for curand states? GPU-Accelerated Libraries	19	2502	June 14, 2019
CURAND Crashing Problem CUDA Programming and Performance	6	12807	March 1, 2011
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7878	June 14, 2013
CURAND acting strangely CUDA Programming and Performance	0	641	April 16, 2011
Should a kernel initializing random states with curand_init be so slow? CUDA Programming and Performance	10	2157	July 11, 2018
curandState: strange behaviour (CURAND) CUDA Programming and Performance	3	2043	August 16, 2013
curand eats device memory CUDA Programming and Performance	13	16638	January 28, 2011
CURAND CURAND low per CUDA Programming and Performance	8	2963	April 12, 2011
Cuda code performance CUDA Programming and Performance	14	3107	December 16, 2014
slow cuda computing time (need idea for my cuda coding) Teaching and Curriculum Support	1	1386	August 28, 2014

CURAND acting strangely

Related topics