Hi all,
Very new to CUDA so help is much appreciated. Even though my question is about curand, if you spot some fundamental flaws on how I am doing stuff please let me know.
I have the following task:
“Given a list of N numbers, and a rate 0 < R < 100 randomly zero out R% of them.”
My included solution works, and would like to get some feedback regarding how I am manipulating my curandState(s)
It is quite simple, the kernel I am using for this goes:
__global__ void dropr(float *A, curandState *globalstate, uint64_t N, float R)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
curandState localstate = globalstate[i];
A[i] *= curand_uniform(&localstate) < R ? 0: 1;
}
}
This kernel should be able to launch any number of times during the run of my application. (It’s a dropout ANN layer)
My problem is that my main application is written purely in C, and I had a lot of struggle being able to launch this exact kernel from my C code due to the curand API only being able to be compiled in C++. To the extent of how I understand how all this works. So again, let me know if I got something very wrong.
I am came up with the following solution:
- 3 files:
- main C code “testDRP.c”
- Intermediate C++ code used to initialize curandState “launcher.cc”
- CUDA kernels code “cudaDRP.cu”
testDRP.c calls functions in launcher.cc who in turns launches kernels in cudaDRP
The way I understand curand, and correct me if I got it completely wrong is that each instance of a kernel executing a curand function requires its own curandState due to the non-threadsafe nature of the curand generators. Is this interpretation correct?
So “launcher” in its “init()” function, allocates enough memory for n curandState(s), corresponding to working on n numbers. I use a global variable to store all these states. So every time I compute a random number, a unique curandState is used.
“launcher.cc”
curandState *globalstate;
extern "C" void init(int n, int *res)
{
*res = 1;
cudaError_t stat;
stat = cudaMalloc((void **)&globalstate, n*sizeof(curandState));
if (stat != cudaSuccess) {
fprintf(stderr, "cuda malloc failed\n");
*res = 0;
return;
}
curand_init(globalstate, n);
if (cudaSuccess != cudaDeviceSynchronize()) { //Is this needed?
fprintf(stderr, "curand init failed\n");
*res = 0;
}
}
Finally the actual computation is started in launcher.cc’s launch() function which passes the array, the list of global states, length of the array, and a rate.
launcher.cc
extern "C" void launch(unsigned int *a, uint64_t n, float r)
{
curand_drand(a, globalstate, n, r);
}
cudaDRP.cu
__host__ void curand_drand(float *A,
curandState *globalstate,
uint64_t L,
float R)
{
int threadsPerBlock = 256;
int blocksPerGrid = (L + threadsPerBlock - 1) / threadsPerBlock;
dropr<<<threadsPerBlock,blocksPerGrid>>>(A, globalstate, L, R);
if (cudaDeviceSynchronize() != cudaSuccess) {
fprintf(stderr, "Cuda call failed\n");
}
}
To compiled the attached filestestDRP.c (1.7 KB) launcher.cc (884 Bytes) cudaDRP.cu (1.4 KB) :
nvcc -g -c -o cudaDRP.o cudaDRP.cu
gcc -Wall -pedantic -c -g -o launcher.o launcher.cc
gcc -Wall -pedantic -o testDRP testDRP.c launcher.o cudaDRP.o -lcudart
./testDRP N R/100 | grep "0" | wc -l
You get an answer close to N*(R/100)
Thank in advance for any help or comments.