CURAND acting strangely

if i run this code ( pretty similar to sample in curand documentation ) :

global void test_kernel ( curandState *globalRand, float *randoms, int N_V )
{
//global index
int id = blockIdx.x * blockDim.x + threadIdx.x;

if ( id >= N_V )
return;

curandState local_state = globalRand[ id ]; -> i intend to launch kernel multiple times so i preserve state

randoms[ id ] = curand_uniform ( &local_state );

globalRand[ id ] = local_state; -> with this line commented everything works fine
}

//BLOK_SIZE is 1024 -> running on gtx 570

block_num = N_V / BLOCK_SIZE + 1;

test_kernel<<<block_num, BLOK_SIZE>>>( globalRand, d_randomi, N_V );

but, even if i run kernel ONLY ONCE or more times for big N_V ( about 3000 and higher ) i get output ( when i printf randoms in host after kernel finished ) which is decreasing to 0 and after that getting negative values

I would appreciate any help,
using 64 bit windows 7, VS Pro, CUDA 3.2, 570 GTX

I don’t know how curand works. I could only point out a few trivial things:

  1. block_num = (N_V +512)/ BLOCK_SIZE;
  2. Make sure that you have used cudaThreadSynchronize after the cudaMemcpy

Thank you, but the problem remains.

notice that if i remove last line in kernel ( globalRand[id] = local_state ) program works fine, returning numbers from 0 to 1.

in curand doc it says: Distribution functions ( like curand_uniform ) may use any number
of unsigned integer values from a basic generator. The number of values
consumed is not guaranteed to be fixed.

what does that mean? is it posible that every thread takes a few neighbour globalRand ( like globalRand[id+1], or[id-1] ), so it comes to some sort of conflict while updating globalRand?

I think the most correct would be num_block = (N + threadsPerBlock – 1) / threadsPerBlock

your answer returns 0 for N_V < 512 ( if you noticed that my BLOCK_SIZE is 1024 ). if not, i think you wrote the same thing as I ( only in parenthesis )

You’re right. Sorry for the wrong advice given. I always had the impression that integer division was rounded towards the nearest.

I don’t know how the compiler handles your code. Though I suspect it is affected by the size of a curandState struct. The possibility is that the statement which you commented out is forcing the compiler to do some very silly thing which crashes your kernel. You could use cuobjdump to check it out. Or maybe sometimes reading the ptx output of the compiler could also give you some clue. Did you check the error from your kernel launch? I suspect there exists an unknown error or something like that.

Also, you could attempt the following code

__global__ void test_kernel ( curandState* globalRand, float *randoms, int N_V )

{

//global index

int id = blockIdx.x * blockDim.x + threadIdx.x;

if ( id >= N_V )

return;

curandState* global_state = globalRand + id*sizeof(curandState) ; //do not copy it to local mem or register, just leave it in global mem

randoms[ id ] = curand_uniform ( global_state );

}

Also, if curandState is not something huge, and if you are able to fit 1024 of them into 48KB,you could just declare them explicitly in shared memory and then copy global states to shared memory the first time your kernel is launched. Though you will be limited to using a block number less than or equal to your number of MPs.

Does curand_uniform modify the parameter passed in?

I use CURAND in basically this way without any trouble. How are you initializing globalRand before calling test_kernel?

This is what i did:

curandState *globalRand;

cudaMalloc( (void**)&globalRand, N_V*sizeof(int) );

test_kernel<<< …

instead of : cudaMalloc( (void**)&globalRand, N_V*sizeof(curandState) )

Program worked small numbers so I didn’t think error could be in initializing.

Thank you and sorry for bothering.

Do you call curand_init() though? Unless I missed something in your code you sent out, you’re not calling it, and I believe you need to beforehand.

Also, I’m pretty sure you should be doing something like this instead (despite what you’re last post said):

curandState *globalRand;

cudaMalloc((void **)&globalRand, N_V*sizeof(curandState));

// next: init_kernel

// finally: test_kernel

where init_kernel looks something like this:

__global__ void init_kernel (curandState * state)

{

  int threadIndex = ((threadIdx.y * blockDim.x) + threadIdx.x) + (blockDim.x * blockDim.y * blockIdx.x);

// each thread get the same seed, different sequence number, and an offset of 1000

  curand_init(SEED, threadIndex, 1000, &state[threadIndex]);

}

I imagine (but don’t know for sure) that it worked for a small number of threads because you weren’t going outside it’s memory bounds or something like that. Also, if you weren’t initializing it, that would be another possible reason.

As an aside, as you probably know already, you should be checking your error codes returned from your cudaXXX functions. I didn’t do it above solely for brevity.

Hope this helps,

Matt

As mentioned by sinclair, you absolutely have to call curand_init() for all of your globalRand elements, otherwise the results are undefined. I do this the same way that he shows.

I called curand_init() before, pretty same way like described above, just accidentaly forgot to write it.

Works fine now, thanks!

EDIT: Was your previously posting stating that you had previously used sizeof(int) (incorrectly), but that once you switched it to sizeof(curandState), it started working? If so, then I just misread your post, sorry!

So what did you change then (for the benefit of the rest of us :) )?

Matt

As i described above, while doing cudaMalloc() for globalRand, i put cudaMalloc ( (void **)&globalRand, N_Vsizeof(int) ), instead N_Vsizeof(curandState). Still don’t know why it worked for numbers smaller than about 3000.

yes, thats it :)

I think you missed my edit to my previous post :), where I asked if what you just said in this post is what you changed (and it is). Sorry about missing that!

My theory for why it would work for few threads is that you weren’t running out of your memory bounds.

Matt

anybody idea about this situation ?

global void device_random_init ( curandState *globalRand, int seed, int N_V )
{
int id = blockIdx.x * blockDim.x + threadIdx.x;

curand_init ( seed, id, 0, &globalRand[id] );

}

and in main i call it like:

int main()
{
curandState globalRand;
cudaMalloc( (void
*)&globalRand, 5001000sizeof(curandState) );
device_random_init<<<500, 1000>>>( globalRand, rand(), N_V );

… }

it crashes for ( 1000*1000 ) and works for 500 * 1000 ( that is highest number without crashing ).

i work on gtx 570 ( max block size 1024, not 512 ) and it is also used for display , so is it possible kernel is taking too long and it got suspended?
kernels that crash take aprox 5 seconds before crashing

thanks

Do you have you Xserver enabled? Because if you do, I’m pretty sure the “watchdog” timer times out at 5 seconds and crashes your kernel (i.e. you can’t have a kernel that runs long than 5 seconds. So that would explain your problem.