Inconsistent results for reduction, except while printf or cudamemcheck

Zenfhou · September 9, 2016, 1:15pm

Hello,

I’m going absolutely crazy with this problem. First of all here is my code :
Kernel :

__global__ void train_network(Quaternion * __restrict__ dW, Quaternion * __restrict__ dX, Quaternion *dB, Quaternion *dOutput, Quaternion *dGradient, Quaternion *dTarget, Quaternion *dSum, int *topolo    gy, int size, int weightSize){
 25 
 26     
 27     QNN qnn(topology, size, weightSize);
 28     qnn.feed_forward(dW,dX,dB,dOutput);
 29     qnn.output_error(dOutput,dTarget,dGradient, dSum);
 30 }

output_error :

74 __device__ void QNN::output_error(Quaternion * dOutput, Quaternion * dTarget, Quaternion *dGradient, Quaternion *dSum){
 75 
 76     //extern __shared__ Quaternion test[];
 77 
 78     //exponential(dOutput);
 79 
 80     exponential_sum_reduction(dOutput,dGradient);
 81     __syncthreads();
 82     exponential_sum_reduction(dGradient,dSum);
 83 
 84 
 85 }

and the reduction :

__device__ void QNN::exponential_sum_reduction(Quaternion * dIn, Quaternion * dOut){
 35 
 36 
 37     extern __shared__ Quaternion x_shared[];
 38 
 39     unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
 40     unsigned int tid = threadIdx.x;
 41     const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
 42 
 43     x_shared[tid] = empty;
 44     __syncthreads();
 45     if (gid < topology){ 
 46         x_shared[tid] = dIn[gid];
 47        
 48
 49         //x_shared[tid].exponential_equal();
 50 
 51     }
 52 
 53 
 54     __syncthreads();
 55 
 56     for(int i=blockDim.x/2; i>0; i>>=1){
 57 
 58         if(tid<i){
 59 
 60             //printf("Thread %d, got %d\n", i, blockIdx.x);
 61             x_shared[tid] += x_shared[tid + i];
 62         }
 63         __syncthreads();
 64     }
 65 
 66     if( tid == 0){
 67 
 68         //printf("bef Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);  
 69         dOut[blockIdx.x] = x_shared[0];
 70         //printf("af Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);
 71     }
 72 }

My problem is that when i use printf, everything is ok. But when i just run the app without memcheck or printf to get the result, i got inconsistent results. Sometimes it’s the good result, sometimes not … I’m thinking of a shared_memory problem but how ? And why ? If i put only ONE printf on the code it works … Interresting thing : if i only do one reduction, result is ok and consistent … Problem only occur whith the second reduction.

I realy don’t know HOW to catch up my mistake cause i can’t check my variables when the error happens …

Thanks a lot

tera · September 9, 2016, 2:08pm

Are you reserving enough memory on kernel launch for the extern shared Quaternion x_shared[]; declaration?
Are you using any other extern shared variables, including within other device functions?
Note that all of these declarations are mapped to the same memory address.

Zenfhou · September 9, 2016, 2:13pm

Thanks for you reply,
For your first question, yes i’m sure.
For the second another extern shared x_shared, is declared in feed_forward. But those two functions aren’t running at the same time.

tera · September 9, 2016, 2:17pm

Oh, and you have a race condition in your code:

x_shared[tid] += x_shared[tid + i];

reads from and writes to the same variables (from different threads) without synchronization. You need to separate reading and writing using a local variable, and place one of the __syncthreads() in between.

Zenfhou · September 9, 2016, 2:35pm

Wow, yes you’re right. But i tested something like :

for(int i=blockDim.x/2; i>0; i>>=1){
 56 
 57         if(tid<i){
 58 
 59             __syncthreads();
 60             Quaternion local(x_shared[tid].q);
 61             __syncthreads();
 62             local += x_shared[tid + i];
 63             __syncthreads();
 64             x_shared[tid] = local;
 65             
 66             //printf("Thread %d, got %d\n", i, blockIdx.x);
 67             //x_shared[tid] += x_shared[tid + i];
 68         }
 69         __syncthreads();
 70     }

And it still doesn’t work.

Thanks

tera · September 9, 2016, 2:42pm

Now that code has a problem that not all threads encounter the same (number of) __syncthreads().

Robert_Crovella · September 9, 2016, 2:42pm

I don’t think there is a race condition there (at that particular line of code). So “fixing” it won’t help. If you were going to use that latest code mod (not sure why you would) as tera pointed out, you have invalid usage of __syncthreads(), and in addition the transformation is not logically correct.

Going from this:

x_shared[tid] += x_shared[tid + i];

to this:

local += x_shared[tid + i];
            x_shared[tid] = local;

is not correct. It should be:

local = x_shared[tid + i];
            x_shared[tid] += local;

If you actually provided a short, complete code that demonstrates the problem, somebody could probably fix your reduction issue with a single posting.

Zenfhou · September 9, 2016, 2:45pm

Well but if a thread is stuck to a __syncthreads(), he just waits for the other threads to reach this same barrier no ? So if a tid > i, he just waits for all the tid < i to end the if instruction ? Or in the if tid < i are waiting for tid > i to come to this barrier, where they can’t come ?

BulatZiganshin · September 9, 2016, 2:46pm

all threads in the thread block should participate in __syncthreads. so, try that:

__syncthreads();
Quaternion local;
__syncthreads();
if(tid<i) local = x_shared[tid].q;
__syncthreads();
if(tid<i) local += x_shared[tid + i];
__syncthreads();
if(tid<i) x_shared[tid] = local;
__syncthreads();

BulatZiganshin · September 9, 2016, 2:53pm

it waits all threads to arrive to the same BARRIER unstruction (__syncthreads() call translated to this single instruction). But some other threads doesn’t execute this code path at all! So you can wait forever :)

BulatZiganshin · September 9, 2016, 2:53pm

dup

Zenfhou · September 9, 2016, 3:10pm

Well even with that solution, it doesn’t works …

I can’t give a “small” code but a complete one yes :

http://gitlia.univ-avignon.fr/uapv1201947/cuqnn

Currently my reduction looks like :

__device__ void QNN::exponential_sum_reduction(Quaternion * dIn, Quaternion * dOut){
 35 
 36 
 37     extern __shared__ Quaternion x_shared[];
 38 
 39     unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
 40     unsigned int tid = threadIdx.x;
 41     Quaternion empty(0.0f,0.0f,0.0f,0.0f);
 42 
 43     if (gid < topology){
 44         x_shared[tid] = dIn[gid];
 45         //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
 46 
 47         //x_shared[tid].exponential_equal();
 48 
 49     }else
 50         x_shared[tid] = empty;
 51 
 52     Quaternion local(0.0f,0.0f,0.0f,0.0f);
 53     __syncthreads();
 54 
 55     for(int i=blockDim.x/2; i>0; i>>=1){
 56 
 57         if(tid<i)
 58             local = x_shared[tid + i];
 59         __syncthreads();
 60         if(tid<i)
 61             x_shared[tid] += local;
 62             //x_shared[tid] += x_shared[tid + i];
 63         __syncthreads();
 64     }
 65 
 66     __syncthreads();
 67     if( tid == 0){
 68 
 69         //printf("bef Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);  
 70         dOut[blockIdx.x] = x_shared[0];
 71         //printf("af Thread %d, got %f\n", blockIdx.x, dOut[blockIdx.x].q.x);
 72     }
 73 }

But after reflexion and as stated by txbob,

x_shared[tid] += x_shared[tid + i] shouldn’t gives a race condition

MutantJohn · September 9, 2016, 6:13pm

Is there a reason why you can’t use something like Thrust to solve your reductions? Implementing your own reduction algorithm may ultimately be a needless exercise if currently existing solutions can suffice.

Zenfhou · September 9, 2016, 6:31pm

Thrust is only valable ffom host no ?. Plus is it possible to override the standard type with my quaternion type ? Thanks

Robert_Crovella · September 9, 2016, 7:07pm

thrust algorithms can be called from device code

[url]https://devblogs.nvidia.com/parallelforall/power-cpp11-cuda-7/[/url]
[url]sorting - How to use Thrust to sort the rows of a matrix? - Stack Overflow

yes, thrust can generally work on arbitrary types

[url]c++ - Sorting packed vertices with thrust - Stack Overflow

Zenfhou · September 9, 2016, 7:18pm

Well, i’ll check that and give you some feedback ! Thanks a lot !

Zenfhou · September 9, 2016, 10:08pm

Thanks it works ! You are awesome ! For those interrested in, here is the final code :

84 __device__ void QNN::output_error(Quaternion * dOutput, Quaternion * dTarget, Quaternion *dGradient){    
 89     Quaternion sum(0.0f,0.0f,0.0f,0.0f);
 90     sum = thrust::reduce(thrust::device, dOutput, dOutput+topology, sum);
 91     __syncthreads();
 98 }

MutantJohn · September 10, 2016, 4:10pm

Thrust is pretty awesome.

It’s kind of just like C++'s STL in the sense that it’ll typically be fast enough (if you’re using it correctly) that you won’t need to worry about implementing a lot of common and basic algorithms.

Zenfhou · September 10, 2016, 11:11pm

Yes, but i was stuck on the impossibility of calling algorithm from device functions…

Robert_Crovella · September 11, 2016, 12:16am

keep in mind that unless you compile with the necessary elements for a CDP environment, and meet other requirements, your call to thrust::reduce from device code will operate within the context of a single device thread.

Topic		Replies	Views
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5061	May 7, 2015
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1930	March 1, 2011
matrix multiply reduction CUDA Programming and Performance	41	35533	January 15, 2011
help with kernel synchronization? CUDA Programming and Performance	22	13898	August 26, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Unified Memory Behavior... CUDA Programming and Performance	4	3401	August 11, 2014
CUDA Fortran matrix-multiply 10x slower than CUDA C version Legacy PGI Compilers	5	6891	July 14, 2010
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4397	February 7, 2012
How to efficiently sort 5 arrays of integers? CUDA Programming and Performance	7	1162	June 19, 2015

Inconsistent results for reduction, except while printf or cudamemcheck

Related topics