I’m using the following code but the result turns out to be NaN for large size of input data but small input works fine
__global__ void fastComputeActivation(const float *input, float *output, int width, int height, int depth, const int size)
int i = blockIdx.x * blockDim.x + threadIdx.x;
int row = (i%(width*height))/width;
int col = (i%(width*height))%width;
int d = i/(width*height);
if (i < size)
int idx = d+depth*row;
int inIdx = d*height*width+row*width;
if(col == 0) output[idx] = 0;
if(col == 0)
output[idx] = sigmoid(output[idx]);
I am new to CUDA so is syncthreads working across blocks?
I believe it does not. so I want something similar to syncthreads but like global barrier among a grid.
Or is there other problems in my code?
I want to note a few things:
_syncthreads() is a barrier where all threads of a block must check before moving, if you put it in an if statement that can potentially split the threads of a block, you will get undetermined results. You can finish the if clause before the synchronization point and restart another clause with the same condition.
Why don’t you use threads in the y and z directions, threadIdx.y/z, it is made to help you think in the natural dimensions.
According to what I understand from the code, you are basically first adding the entries of entire rows then applying some function to the sum. The second __syncthreads() looks like it wants to guarantee that the sum is compelte, however if, say there are 2 blocks that have threads working on the same row, and the first blocks adds all the entries synchronizes and moves on and applies the function to the current sum and then the second block gets scheduled adds entries to the result of the function and the reapplies the function, you got yourself some unexpected results. I suggest you assign a row to a block, perhaps more than 1 row to a single block, depending on the size of the input, and have the block do a reduction on the row and move on to apply a function or to move on to another block. Multiple blocks can also be used to do the reduction, but you’ll have to be more careful.
In any way you choose to implement, the idea is to make 100% sure that the sum is done properly, and you seem to realize this of course, but the implementation looks like it would fail to do it.
All of this of course provided that I understood your code properly haha.