I’m currently trying to implement a softmax function in cuda in order to handle the last layer of my neural network. The softmax function is defined as following : https://en.wikipedia.org/wiki/Softmax_function
I need to deal with the output of my feed forward device function which gives me a matrix that contains all the values of each node of the neural network, for example :
[layer0[n0,n2,n3],layer1[n0,n1],layer2[n0,n1,n2]]
With no idea of where is the beginning index for layer2. To do this softmax, i first need to sum all the layer2’s nodes (exponentionalized), and then divide each of them by this value. I don’t know how to do this in a paralel way. Currently all my threads are doing a sequential sum in order to get this sum value, but it takes me sooooo long. Here is the code :

int outputSize = topology; //topology is a *int storing the NN topology (number of neurons for each layer) while size is the number of layers
int i = topology[0];
for(int tmp = 1; tmp < size-1; ++tmp)
i+=topology[tmp]; // Get the beginning of the last layer
int bound = i+outputSize; // Get the last item
while(i<bound){
sum+=dIn[i].exponential();
i++;
}

There is no coalesced memory access AND big global memory accesses … PLUS, there is no need to do this sum for each threads ? I’ve looked for some reductions, but i think that a reduction would only be better if my last layer is large enough. And in a classification task, the output layer isn’t that large. I know that the last thing to do is divide each node by this value, so this can be parralelized easily, but how to do this first sum efficiently, and where to store the value ?

If the last layer is small enough that a reduction or other optimized technique is not worthwhile, then I’m not sure why you’re asking this question or why you are attaching such importance to this one operation.

Nevertheless, the standard advice would be to perform operations in a thread-parallel way on the GPU, rather than have a single thread do all the work of a summing operation, for example.

The two basic thread-parallel approaches would be to use atomics to generate the sum or a classical parallel reduction. To make the computed sum most easily available across threads, usually one would store it in global memory somewhere. If you want to use atomics, then global atomics are pretty fast on kepler and beyond. For a parallel reduction, one would usually do the sweeping within shared memory, and put the final result in global memory, where it can be used by other threads for the subsequent division operation.

Thanks txbob for you reply. I was considering the fact of doing this part of the calculation on the host but i’m not sure that the time spent on data copy would be worth it. I’ll look for atomic operations, but in term of parallel reduction, if i’m right you need as much reduction call as the number of dimensions ? The first reduction gives you the sum of each block and the second the sum of this result ?

I’m still having some trouble trying to figure out how to use the shared memory when you have only a float or an int to use in each thread.

For parallel reduction, at most two reduction kernel calls are necessary, but in fact in most cases can be done with a single kernel call using a threadblock-draining approach.

For basic parallel reduction, I suggest you study the CUDA sample code and accompanying tutorial:

__device__ void QNN::exponential_sum_reduction(Quaternion * __restrict__ dIn, Quaternion * dOut){
78
79
80 __shared__ Quaternion x_shared[BLOCK_SIZE];
81
82
83 const unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
84 const unsigned int tid = threadIdx.x;
85 const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
86
87 if (gid < topology){ //Giving the number of elements in dIn
88 x_shared[tid] = dIn[gid];
89 //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
90
91 //x_shared[tid].exponential_equal();
92 }
93 else{
94 x_shared[threadIdx.x] = empty;
95 }
96 __syncthreads();
97
98 for(int i=blockDim.x/2; i>0; i>>=1){
99
100 if(tid<i){
101
102 x_shared[tid] += x_shared[tid + i];
103 }
104 __syncthreads();
105 }
106
107 if( tid == 0){
108
109 dOut[blockIdx.x] = x_shared[0];
110 }
111 }

But it doesn’t work … I tried to just copy / paste the optimized solution from http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf But i don’t want to use template for the moment (maybe later ?). And i replace the blockSize template by my BLOCK_SIZE i got some varying results _ (race condition problem ?). Anyway, back to my code, i always got a solution that almost gives me the solution but it’s always wrong. And i can’t understand why because it almost gives me the good solution while i’m doing this only once (it should run twice before giving me the good result no ?) …

I just a device function which takes two * and gives me the sum without looping outside this function …

I checked with cuda-memcheck and nothing went wrong in this function (there was another memory out of bound in another function). My problem is quite simple i think for someone who knows CUDA. I have a Quaternion *table. (Quaternion mean hyper-complex number, all the operators have been implemented). And i want to do a softmax on this Quaternion *table. It can be 2 Quaternions such as 2048 Quaternions. Doing a softmax imply :
1 - Do a sum of the exponentials of each element in *table.
2 - Divide each element by this sum.

So i first need to do as fast as possible a sum, or a reduction of this *table. That’s where i am. The code is the same than the one posted above.

__device__ void QNN::exponential_sum_reduction(Quaternion * __restrict__ dIn, Quaternion * dOut){
78
79
80 __shared__ Quaternion x_shared[BLOCK_SIZE];
81
82
83 const unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
84 const unsigned int tid = threadIdx.x;
85 const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
86
87 if (gid < topology){ //Giving the number of elements in dIn
88 x_shared[tid] = dIn[gid];
89 //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
90
91 //x_shared[tid].exponential_equal();
92 }
93 else{
94 x_shared[threadIdx.x] = empty;
95 }
96 __syncthreads();
97
98 for(int i=blockDim.x/2; i>0; i>>=1){
99
100 if(tid<i){
101
102 x_shared[tid] += x_shared[tid + i];
103 }
104 __syncthreads();
105 }
106
107 if( tid == 0){
108
109 dOut[blockIdx.x] = x_shared[0];
110 }
111 }

If i understand this reduction (inspired by the official CUDA presentation stated above) it will perform a sum of each block and store the result in a *Out. But the size of this *Out should be the number of block right ? For example, if i try to process a 1024 *dIn with 64 threads per blocks, *Out should contains 16 values. And i would need to sum another time these values to get the true value ?

In fact i really just want to sum a *dIn in one shot and get the value stored in the global memory in order to have the access from all the other threads (or maybe a local copy, it’s just a float so …)

I can’t use atomic ops, cause i’m dealing with custom objects :/