Implementing a softmax in CUDA ?

Zenfhou · September 6, 2016, 3:52pm

hello,

I’m currently trying to implement a softmax function in cuda in order to handle the last layer of my neural network. The softmax function is defined as following :
https://en.wikipedia.org/wiki/Softmax_function
I need to deal with the output of my feed forward device function which gives me a matrix that contains all the values of each node of the neural network, for example :
[layer0[n0,n2,n3],layer1[n0,n1],layer2[n0,n1,n2]]
With no idea of where is the beginning index for layer2. To do this softmax, i first need to sum all the layer2’s nodes (exponentionalized), and then divide each of them by this value. I don’t know how to do this in a paralel way. Currently all my threads are doing a sequential sum in order to get this sum value, but it takes me sooooo long. Here is the code :

int outputSize = topology; //topology is a *int storing the NN topology (number of neurons for each layer) while size is the number of layers
int i = topology[0];
for(int tmp = 1; tmp < size-1; ++tmp)
   i+=topology[tmp]; // Get the beginning of the last layer
int bound = i+outputSize; // Get the last item
while(i<bound){
   sum+=dIn[i].exponential();
   i++;
}

There is no coalesced memory access AND big global memory accesses … PLUS, there is no need to do this sum for each threads ? I’ve looked for some reductions, but i think that a reduction would only be better if my last layer is large enough. And in a classification task, the output layer isn’t that large. I know that the last thing to do is divide each node by this value, so this can be parralelized easily, but how to do this first sum efficiently, and where to store the value ?

Thanks a lot !

Robert_Crovella · September 6, 2016, 6:15pm

If the last layer is small enough that a reduction or other optimized technique is not worthwhile, then I’m not sure why you’re asking this question or why you are attaching such importance to this one operation.

Nevertheless, the standard advice would be to perform operations in a thread-parallel way on the GPU, rather than have a single thread do all the work of a summing operation, for example.

The two basic thread-parallel approaches would be to use atomics to generate the sum or a classical parallel reduction. To make the computed sum most easily available across threads, usually one would store it in global memory somewhere. If you want to use atomics, then global atomics are pretty fast on kepler and beyond. For a parallel reduction, one would usually do the sweeping within shared memory, and put the final result in global memory, where it can be used by other threads for the subsequent division operation.

Zenfhou · September 6, 2016, 7:03pm

Thanks txbob for you reply. I was considering the fact of doing this part of the calculation on the host but i’m not sure that the time spent on data copy would be worth it. I’ll look for atomic operations, but in term of parallel reduction, if i’m right you need as much reduction call as the number of dimensions ? The first reduction gives you the sum of each block and the second the sum of this result ?

I’m still having some trouble trying to figure out how to use the shared memory when you have only a float or an int to use in each thread.

Robert_Crovella · September 6, 2016, 7:20pm

For parallel reduction, at most two reduction kernel calls are necessary, but in fact in most cases can be done with a single kernel call using a threadblock-draining approach.

For basic parallel reduction, I suggest you study the CUDA sample code and accompanying tutorial:

[url]https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf[/url]

For the threadblock-draining approach to avoid the second kernel call, review the threadfence reduction cuda sample code:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#threadfencereduction[/url]

Zenfhou · September 6, 2016, 7:26pm

Thanks for you advices !

Zenfhou · September 7, 2016, 2:32pm

Hey,

I did this :

__device__ void QNN::exponential_sum_reduction(Quaternion * __restrict__ dIn, Quaternion * dOut){
 78 
 79 
 80     __shared__ Quaternion x_shared[BLOCK_SIZE];
 81 
 82 
 83     const unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
 84     const unsigned int tid = threadIdx.x;
 85     const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
 86 
 87     if (gid <  topology){ //Giving the number of elements in dIn
 88         x_shared[tid] = dIn[gid];
 89         //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
 90 
 91         //x_shared[tid].exponential_equal();
 92     }
 93     else{
 94         x_shared[threadIdx.x] = empty;
 95     }
 96     __syncthreads();
 97     
 98     for(int i=blockDim.x/2; i>0; i>>=1){
 99     
100         if(tid<i){
101 
102             x_shared[tid] += x_shared[tid + i];
103         }
104         __syncthreads();
105     }
106     
107     if( tid == 0){
108         
109         dOut[blockIdx.x] = x_shared[0];
110     }
111 }

But it doesn’t work … I tried to just copy / paste the optimized solution from http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf But i don’t want to use template for the moment (maybe later ?). And i replace the blockSize template by my BLOCK_SIZE i got some varying results _ (race condition problem ?). Anyway, back to my code, i always got a solution that almost gives me the solution but it’s always wrong. And i can’t understand why because it almost gives me the good solution while i’m doing this only once (it should run twice before giving me the good result no ?) …

I just a device function which takes two * and gives me the sum without looping outside this function …

Thanks !

Robert_Crovella · September 7, 2016, 4:53pm

What happens if you run your code with cuda-memcheck?

Another approach would be to provide a short, complete code that demonstrates the problem you are having.

Zenfhou · September 7, 2016, 5:20pm

I checked with cuda-memcheck and nothing went wrong in this function (there was another memory out of bound in another function). My problem is quite simple i think for someone who knows CUDA. I have a Quaternion *table. (Quaternion mean hyper-complex number, all the operators have been implemented). And i want to do a softmax on this Quaternion *table. It can be 2 Quaternions such as 2048 Quaternions. Doing a softmax imply :
1 - Do a sum of the exponentials of each element in *table.
2 - Divide each element by this sum.

So i first need to do as fast as possible a sum, or a reduction of this *table. That’s where i am. The code is the same than the one posted above.

__device__ void QNN::exponential_sum_reduction(Quaternion * __restrict__ dIn, Quaternion * dOut){
     78 
     79 
     80     __shared__ Quaternion x_shared[BLOCK_SIZE];
     81 
     82 
     83     const unsigned int gid = threadIdx.x + blockIdx.x * blockDim.x;
     84     const unsigned int tid = threadIdx.x;
     85     const Quaternion empty(0.0f,0.0f,0.0f,0.0f);
     86 
     87     if (gid <  topology){ //Giving the number of elements in dIn
     88         x_shared[tid] = dIn[gid];
     89         //printf("Thread%d, on layer num %d,%d,%f\n", threadIdx.x, layerNum, biasIterator+(threadIdx.x + m * BLOCK_SIZE), dX[biasIterator+(threadIdx.x + m * BLOCK_SIZE)].q.x);
     90 
     91         //x_shared[tid].exponential_equal();
     92     }
     93     else{
     94         x_shared[threadIdx.x] = empty;
     95     }
     96     __syncthreads();
     97     
     98     for(int i=blockDim.x/2; i>0; i>>=1){
     99     
    100         if(tid<i){
    101 
    102             x_shared[tid] += x_shared[tid + i];
    103         }
    104         __syncthreads();
    105     }
    106     
    107     if( tid == 0){
    108         
    109         dOut[blockIdx.x] = x_shared[0];
    110     }
    111 }

If i understand this reduction (inspired by the official CUDA presentation stated above) it will perform a sum of each block and store the result in a *Out. But the size of this *Out should be the number of block right ? For example, if i try to process a 1024 *dIn with 64 threads per blocks, *Out should contains 16 values. And i would need to sum another time these values to get the true value ?

In fact i really just want to sum a *dIn in one shot and get the value stored in the global memory in order to have the access from all the other threads (or maybe a local copy, it’s just a float so …)

I can’t use atomic ops, cause i’m dealing with custom objects :/

Thanks txbob !

Topic		Replies	Views
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3650	February 1, 2016
sequential sum within a kernel. CUDA Programming and Performance	23	5001	September 8, 2008
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4457	February 7, 2012
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7761	July 22, 2010
Array Sum in cuda CUDA Programming and Performance	5	11494	May 30, 2010
Inconsistent results for reduction, except while printf or cudamemcheck CUDA Programming and Performance	29	2417	September 13, 2016
CUDA - calculation of a sum CUDA Programming and Performance	7	5557	April 30, 2010
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3171	April 10, 2009
Sum vectors CUDA Programming and Performance	24	5108	December 21, 2011
total sum example CUDA Programming and Performance	3	7244	December 2, 2015

Implementing a softmax in CUDA ?

Related topics