but, even if i run kernel ONLY ONCE or more times for big N_V ( about 3000 and higher ) i get output ( when i printf randoms in host after kernel finished ) which is decreasing to 0 and after that getting negative values
I would appreciate any help,
using 64 bit windows 7, VS Pro, CUDA 3.2, 570 GTX
notice that if i remove last line in kernel ( globalRand[id] = local_state ) program works fine, returning numbers from 0 to 1.
in curand doc it says: Distribution functions ( like curand_uniform ) may use any number
of unsigned integer values from a basic generator. The number of values
consumed is not guaranteed to be fixed.
what does that mean? is it posible that every thread takes a few neighbour globalRand ( like globalRand[id+1], or[id-1] ), so it comes to some sort of conflict while updating globalRand?
I think the most correct would be num_block = (N + threadsPerBlock – 1) / threadsPerBlock
your answer returns 0 for N_V < 512 ( if you noticed that my BLOCK_SIZE is 1024 ). if not, i think you wrote the same thing as I ( only in parenthesis )
You’re right. Sorry for the wrong advice given. I always had the impression that integer division was rounded towards the nearest.
I don’t know how the compiler handles your code. Though I suspect it is affected by the size of a curandState struct. The possibility is that the statement which you commented out is forcing the compiler to do some very silly thing which crashes your kernel. You could use cuobjdump to check it out. Or maybe sometimes reading the ptx output of the compiler could also give you some clue. Did you check the error from your kernel launch? I suspect there exists an unknown error or something like that.
Also, you could attempt the following code
__global__ void test_kernel ( curandState* globalRand, float *randoms, int N_V )
{
//global index
int id = blockIdx.x * blockDim.x + threadIdx.x;
if ( id >= N_V )
return;
curandState* global_state = globalRand + id*sizeof(curandState) ; //do not copy it to local mem or register, just leave it in global mem
randoms[ id ] = curand_uniform ( global_state );
}
Also, if curandState is not something huge, and if you are able to fit 1024 of them into 48KB,you could just declare them explicitly in shared memory and then copy global states to shared memory the first time your kernel is launched. Though you will be limited to using a block number less than or equal to your number of MPs.
Does curand_uniform modify the parameter passed in?
__global__ void init_kernel (curandState * state)
{
int threadIndex = ((threadIdx.y * blockDim.x) + threadIdx.x) + (blockDim.x * blockDim.y * blockIdx.x);
// each thread get the same seed, different sequence number, and an offset of 1000
curand_init(SEED, threadIndex, 1000, &state[threadIndex]);
}
I imagine (but don’t know for sure) that it worked for a small number of threads because you weren’t going outside it’s memory bounds or something like that. Also, if you weren’t initializing it, that would be another possible reason.
As an aside, as you probably know already, you should be checking your error codes returned from your cudaXXX functions. I didn’t do it above solely for brevity.
As mentioned by sinclair, you absolutely have to call curand_init() for all of your globalRand elements, otherwise the results are undefined. I do this the same way that he shows.
EDIT: Was your previously posting stating that you had previously used sizeof(int) (incorrectly), but that once you switched it to sizeof(curandState), it started working? If so, then I just misread your post, sorry!
So what did you change then (for the benefit of the rest of us :) )?
As i described above, while doing cudaMalloc() for globalRand, i put cudaMalloc ( (void **)&globalRand, N_Vsizeof(int) ), instead N_Vsizeof(curandState). Still don’t know why it worked for numbers smaller than about 3000.
I think you missed my edit to my previous post :), where I asked if what you just said in this post is what you changed (and it is). Sorry about missing that!
My theory for why it would work for few threads is that you weren’t running out of your memory bounds.
it crashes for ( 1000*1000 ) and works for 500 * 1000 ( that is highest number without crashing ).
i work on gtx 570 ( max block size 1024, not 512 ) and it is also used for display , so is it possible kernel is taking too long and it got suspended?
kernels that crash take aprox 5 seconds before crashing
Do you have you Xserver enabled? Because if you do, I’m pretty sure the “watchdog” timer times out at 5 seconds and crashes your kernel (i.e. you can’t have a kernel that runs long than 5 seconds. So that would explain your problem.