Inconsistent results for reduction, except while printf or cudamemcheck

Hello,

I’m going absolutely crazy with this problem. First of all here is my code :
Kernel :

__global__ void train_network(Quaternion * __restrict__ dW, Quaternion * __restrict__ dX, Quaternion *dB, Quaternion *dOutput, Quaternion *dGradient, Quaternion *dTarget, Quaternion *dSum, int *topolo    gy, int size, int weightSize){
 25 
 26     
 27     QNN qnn(topology, size, weightSize);
 28     qnn.feed_forward(dW,dX,dB,dOutput);
 29     qnn.output_error(dOutput,dTarget,dGradient, dSum);
 30 }

output_error :

74 __device__ void QNN::output_error(Quaternion * dOutput, Quaternion * dTarget, Quaternion *dGradient, Quaternion *dSum){
 75 
 76     //extern __shared__ Quaternion test[];
 77 
 78     //exponential(dOutput);
 79 
 80     exponential_sum_reduction(dOutput,dGradient);
 81     __syncthreads();
 82     exponential_sum_reduction(dGradient,dSum);
 83 
 84 
 85 }

and the reduction :

__device__ void QNN::exponential_sum_reduction(Quaternion * dIn, Quaternion * dOut){
 35 
 36 
 37     extern __shared__ Quaternion x_shared[];
 38 
 39     unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
 40     unsigned int tid = threadIdx.x;
 41     const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
 42 
 43     x_shared[tid] = empty;
 44     __syncthreads();
 45     if (gid < topology){ 
 46         x_shared[tid] = dIn[gid];
 47        
 48
 49         //x_shared[tid].exponential_equal();
 50 
 51     }
 52 
 53 
 54     __syncthreads();
 55 
 56     for(int i=blockDim.x/2; i>0; i>>=1){
 57 
 58         if(tid<i){
 59 
 60             //printf("Thread %d, got %d\n", i, blockIdx.x);
 61             x_shared[tid] += x_shared[tid + i];
 62         }
 63         __syncthreads();
 64     }
 65 
 66     if( tid == 0){
 67 
 68         //printf("bef Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);  
 69         dOut[blockIdx.x] = x_shared[0];
 70         //printf("af Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);
 71     }
 72 }

My problem is that when i use printf, everything is ok. But when i just run the app without memcheck or printf to get the result, i got inconsistent results. Sometimes it’s the good result, sometimes not … I’m thinking of a shared_memory problem but how ? And why ? If i put only ONE printf on the code it works … Interresting thing : if i only do one reduction, result is ok and consistent … Problem only occur whith the second reduction.

I realy don’t know HOW to catch up my mistake cause i can’t check my variables when the error happens …

Thanks a lot

Are you reserving enough memory on kernel launch for the extern shared Quaternion x_shared[]; declaration?
Are you using any other extern shared variables, including within other device functions?
Note that all of these declarations are mapped to the same memory address.

Thanks for you reply,
For your first question, yes i’m sure.
For the second another extern shared x_shared, is declared in feed_forward. But those two functions aren’t running at the same time.

Oh, and you have a race condition in your code:

x_shared[tid] += x_shared[tid + i];

reads from and writes to the same variables (from different threads) without synchronization. You need to separate reading and writing using a local variable, and place one of the __syncthreads() in between.

Wow, yes you’re right. But i tested something like :

for(int i=blockDim.x/2; i>0; i>>=1){
 56 
 57         if(tid<i){
 58 
 59             __syncthreads();
 60             Quaternion local(x_shared[tid].q);
 61             __syncthreads();
 62             local += x_shared[tid + i];
 63             __syncthreads();
 64             x_shared[tid] = local;
 65             
 66             //printf("Thread %d, got %d\n", i, blockIdx.x);
 67             //x_shared[tid] += x_shared[tid + i];
 68         }
 69         __syncthreads();
 70     }

And it still doesn’t work.

Thanks

Now that code has a problem that not all threads encounter the same (number of) __syncthreads().

I don’t think there is a race condition there (at that particular line of code). So “fixing” it won’t help. If you were going to use that latest code mod (not sure why you would) as tera pointed out, you have invalid usage of __syncthreads(), and in addition the transformation is not logically correct.

Going from this:

x_shared[tid] += x_shared[tid + i];

to this:

local += x_shared[tid + i];
            x_shared[tid] = local;

is not correct. It should be:

local = x_shared[tid + i];
            x_shared[tid] += local;

If you actually provided a short, complete code that demonstrates the problem, somebody could probably fix your reduction issue with a single posting.

Well but if a thread is stuck to a __syncthreads(), he just waits for the other threads to reach this same barrier no ? So if a tid > i, he just waits for all the tid < i to end the if instruction ? Or in the if tid < i are waiting for tid > i to come to this barrier, where they can’t come ?

all threads in the thread block should participate in __syncthreads. so, try that:

__syncthreads();
Quaternion local;
__syncthreads();
if(tid<i) local = x_shared[tid].q;
__syncthreads();
if(tid<i) local += x_shared[tid + i];
__syncthreads();
if(tid<i) x_shared[tid] = local;
__syncthreads();

it waits all threads to arrive to the same BARRIER unstruction (__syncthreads() call translated to this single instruction). But some other threads doesn’t execute this code path at all! So you can wait forever :)

dup

Well even with that solution, it doesn’t works …

I can’t give a “small” code but a complete one yes :

http://gitlia.univ-avignon.fr/uapv1201947/cuqnn

Currently my reduction looks like :

__device__ void QNN::exponential_sum_reduction(Quaternion * dIn, Quaternion * dOut){
 35 
 36 
 37     extern __shared__ Quaternion x_shared[];
 38 
 39     unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
 40     unsigned int tid = threadIdx.x;
 41     Quaternion empty(0.0f,0.0f,0.0f,0.0f);
 42 
 43     if (gid < topology){
 44         x_shared[tid] = dIn[gid];
 45         //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
 46 
 47         //x_shared[tid].exponential_equal();
 48 
 49     }else
 50         x_shared[tid] = empty;
 51 
 52     Quaternion local(0.0f,0.0f,0.0f,0.0f);
 53     __syncthreads();
 54 
 55     for(int i=blockDim.x/2; i>0; i>>=1){
 56 
 57         if(tid<i)
 58             local = x_shared[tid + i];
 59         __syncthreads();
 60         if(tid<i)
 61             x_shared[tid] += local;
 62             //x_shared[tid] += x_shared[tid + i];
 63         __syncthreads();
 64     }
 65 
 66     __syncthreads();
 67     if( tid == 0){
 68 
 69         //printf("bef Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);  
 70         dOut[blockIdx.x] = x_shared[0];
 71         //printf("af Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);
 72     }
 73 }

But after reflexion and as stated by txbob,

x_shared[tid] += x_shared[tid + i] shouldn’t gives a race condition

Is there a reason why you can’t use something like Thrust to solve your reductions? Implementing your own reduction algorithm may ultimately be a needless exercise if currently existing solutions can suffice.

Thrust is only valable ffom host no ?. Plus is it possible to override the standard type with my quaternion type ? Thanks

thrust algorithms can be called from device code

https://devblogs.nvidia.com/parallelforall/power-cpp11-cuda-7/
http://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix

yes, thrust can generally work on arbitrary types

http://stackoverflow.com/questions/29597224/sorting-packed-vertices-with-thrust

Well, i’ll check that and give you some feedback ! Thanks a lot !

Thanks it works ! You are awesome ! For those interrested in, here is the final code :

84 __device__ void QNN::output_error(Quaternion * dOutput, Quaternion * dTarget, Quaternion *dGradient){    
 89     Quaternion sum(0.0f,0.0f,0.0f,0.0f);
 90     sum = thrust::reduce(thrust::device, dOutput, dOutput+topology, sum);
 91     __syncthreads();
 98 }

Thrust is pretty awesome.

It’s kind of just like C++'s STL in the sense that it’ll typically be fast enough (if you’re using it correctly) that you won’t need to worry about implementing a lot of common and basic algorithms.

Yes, but i was stuck on the impossibility of calling algorithm from device functions…

keep in mind that unless you compile with the necessary elements for a CDP environment, and meet other requirements, your call to thrust::reduce from device code will operate within the context of a single device thread.