1 over x 1/x

Who has ever seen this? I have a piece of code and when I use this

prob0=1.0f;

prob1=1.0f/prob0;

prob1 becomes 1 as you expect. But if I use

prob0=exp(0.0f);

prob1=1.0f/prob0;

prob1 becomes 0?

What is going on?

I add prob1 with a atomicAdd to a memory location and read out that memory. All entries are 0 in the second case and have the expected number in the first case.

exp is a double precision function. Does your gpu support it? What happens if you use expf?

I have a gtx 460, which should also support double precision, but if I try expf(0.0f) I still get zeros

the strange thing is that when I use

prob1=exp(0.0f)

no problem

but

prob0=exp(0.0f)

prob1=1.0/prob0

problem

prob1=exp(-0.0f); also no problem, which is effectively the same as above

The thing is that I want to calculate

prob1=1.0f/(1.0f+exp(-dp));

which also returns 0

It’s hard to say what exactly is going on, most likely something else is wrong in the code. But I want to point out that the code

prob0=1.0f;
prob1=1.0f/prob0;

will really be optimized by the NVCC compiler to

prob0=1.0f;
prob1=1.0f;

Whereas the second version will perform all calculations (exp and division).

If you have a small self-contained app that reproduces the problem, please post it. I did a quick check with the app below and got the expected results:

prob0 = 1.000000000e+000

prob1 = 1.000000000e+000

prob2 = 5.000000000e-001

#include <stdio.h>

#include <stdlib.h>

__global__ void exp_trial (float dp, float *res) 

{

    float prob0, prob1, prob2;

    prob0 = expf(0.0f);

    prob1 = 1.0f/prob0;

    prob2 = 1.0f/(1.0f+expf(-dp));

    res[0] = prob0;

    res[1] = prob1;

    res[2] = prob2;

}

int main (void)

{

    cudaError_t cudaStat;

    float *res_d = 0;

    float res[3];

    cudaStat = cudaMalloc ((void**)&res_d, sizeof(res));

    if (cudaStat != cudaSuccess) {

        printf ("cudaMalloc failed: %s\n", cudaGetErrorString (cudaStat));

        return EXIT_FAILURE;

    }

    exp_trial<<<1,1>>>(0.0f, res_d);

    cudaStat = cudaMemcpy (res, res_d, sizeof(res), cudaMemcpyDeviceToHost);

    if (cudaStat != cudaSuccess) {

        printf ("cudaMemcpy failed: %s\n", cudaGetErrorString (cudaStat));

        return EXIT_FAILURE;

    }

    cudaStat = cudaFree (res_d);

    if (cudaStat != cudaSuccess) {

        printf ("cudaFree failed: %s\n", cudaGetErrorString (cudaStat));

        return EXIT_FAILURE;

    }

    printf ("prob0 = % 16.9e\n", res[0]);

    printf ("prob1 = % 16.9e\n", res[1]);

    printf ("prob2 = % 16.9e\n", res[2]);

    return EXIT_SUCCESS;

}

It is definitely not the case that exp() is never working (also in my code no problems in other places), it is a combination with something else.

This is the whole kernel routine that still fails, when I take the maximum amount of other code out. I noticed that if I take the condition out of the atomicAdd (if (isword0)), it no longer fails. Since d_probs is initialized with zero, it would suggest that the condition is not satisfied and this causes the zero output and not the exp. However, I can leave the condition in and replace prob1=1.0/exp(0.0f) by prob1=1.0f and numbers do show up in d_probs. This means the condition is met. Failure requires both the condition to be there and the exp function.

__global__ void perplexity_lbl(int loops, float* d_predRw, size_t d_predRwp, float *d_qk, size_t d_qkp, unsigned int* d_qki, float* d_probs, unsigned int* d_wordnumbermatrix, unsigned int mspaceplus1, unsigned int num_leaf_cells, unsigned int* d_leaf_cell_table, unsigned int num_words)

{  unsigned int index;

   unsigned int qvector;

   unsigned int word0;

   unsigned int word1;

   bool one_branche;

   unsigned int prev_qvec;

   unsigned int right_word;

   float prob0, prob1, prob;

   bool isword0, isword1;

const unsigned int word_num = blockIdx.y;

   isword0=false;

   isword1=false;

   right_word=d_wordnumbermatrix[word_num];

__syncthreads();

   for (int i=0;i<loops;++i) {

      index=(blockIdx.x * loops + i) * blockDim.x + threadIdx.x; //0..num_leaf_cells-1

      if (index<num_leaf_cells) {//for all q nodes with leafs in tree

         qvector=d_leaf_cell_table[index];

         prev_qvec=d_qki[3*qvector];

         word0=d_qki[3*qvector+1];

         word1=d_qki[3*qvector+2];

         if ((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word)) {

            isword0=true;

         }

         if ((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word)) {

            isword1=true;

         }

         if (isword0 || isword1) {

            prob1=1.0f/(1.0f+exp(0.0f));

            if (isword0) {

               atomicAdd(&d_probs[word_num], prob1);

            }

            if (isword1) {

               atomicAdd(&d_probs[word_num], prob1);

            }

         }

      }

   }

}

The kernel routine is called like this

unsigned int lpw;

   cudaError_t err;

   blcks=min((int)ceil((num_leaf_cells)/(float)(THREADS)), MAXBLOCKS);

   dim3 dimBlock(THREADS,1,1);

   dim3 dimGrid(blcks,1024,1);

   lpw=(int)ceil(num_leaf_cells/float(THREADS*blcks));

   printf("num_leaf_cells=%d\n", num_leaf_cells);

   cudaMemset(d_probs, 0, 1024*sizeof(d_probs[0]));

   err=cudaGetLastError();

   if (err!=0)  printf("cuda error memset %s\n", cudaGetErrorString(err));

   perplexity_lbl<<<dimGrid, dimBlock>>>(lpw, d_predRw, d_predRwp, d_qk, d_qkp, d_qki, d_probs, d_wordnumbermatrix, mspaceplus1, num_leaf_cells, d_leaf_cell_table, last_word-first_word);

   err=cudaGetLastError();

   if (err!=0)  printf("cuda error perplexity %s\n", cudaGetErrorString(err));

   cutilSafeCall(cudaThreadSynchronize());

   err=cudaGetLastError();

   if (err!=0)  printf("cuda error perplexity%s\n", cudaGetErrorString(err));

   cutilSafeCall(cudaMemcpy(h_probs, d_probs, 1024*sizeof(float), cudaMemcpyDeviceToHost));

unfortunately to really run it independently, you need a lot of arrays to be intialized.

I don’t see anything wrong with that code. Maybe atomicAdd is somehow a culprit? You don’t need that many atomicAdd’s anyway. I’d create a local float variable that’s initialized to 0, do all additions to that variable, and then in the end call atomicAdd once to write the result into d_probs.

gee, that’s a great suggestion, thanks a lot. First time I got the atomicadd out of the loop, it still failed. This version however, does work.

__global__ void perplexity_lbl(int loops, float* d_predRw, size_t d_predRwp, float *d_qk, size_t d_qkp, unsigned int* d_qki, float* d_probs, unsigned int* d_wordnumbermatrix, unsigned int mspaceplus1, unsigned int num_leaf_cells, unsigned int* d_leaf_cell_table, unsigned int num_words)

{  unsigned int index;

   unsigned int qvector;

   unsigned int word0;

   unsigned int word1;

   bool one_branche;

   unsigned int prev_qvec;

   unsigned int right_word;

   float prob0, prob1, prob;

   float probtot;

const unsigned int word_num = blockIdx.y;

   right_word=d_wordnumbermatrix[word_num];

   probtot=0.0f;

__syncthreads();

   for (int i=0;i<loops;++i) {

      index=(blockIdx.x * loops + i) * blockDim.x + threadIdx.x; //0..num_leaf_cells-1

      if (index<num_leaf_cells) {//for all q nodes with leafs in tree

         qvector=d_leaf_cell_table[index];

         prev_qvec=d_qki[3*qvector];

         word0=d_qki[3*qvector+1];

         word1=d_qki[3*qvector+2];

         if ((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word)) {

            probtot+=1.0f/(1.0f+expf(0.0f));

         }

         if ((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word)) {

            probtot+=(1.0f-1.0f/(1.0f+expf(0.0f)));

         }

      }

   }

   atomicAdd(&d_probs[word_num], probtot);

}

It started working when I took the intermediate variable isword0/isword1 out. Still weird.

Just noticed that you didn’t reset isword0 and isword1 to false each loop, it seems that you should.

Another excellent spot. Strange thing is that this would cause isword0/isword1 to be set to true somewhere and you would expect additions taking place from then on. It seemed however that addition were not taking place. I’ll try what happens if I correct it though.

It had something to do with it, because the code

probtot=0.0f;

   for (int i=0;i<loops;++i) {

      index=(blockIdx.x * loops + i) * blockDim.x + threadIdx.x; //0..num_leaf_cells-1

      if (index<num_leaf_cells) {//for all q nodes with leafs in tree

         qvector=d_leaf_cell_table[index];

         prev_qvec=d_qki[3*qvector];

         word0=d_qki[3*qvector+1];

         word1=d_qki[3*qvector+2];

         isword0=((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word));

         isword1=((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word));

         if (isword0 || isword1) {

            prob1=1.0f/(exp(0.0f)+1.0f);

            probtot+=prob1;

         }

      }

   }

   atomicAdd(&d_probs[word_num], probtot);

works. Maybe a loop

word_num=blockIdx.y;

for (unsigned int i=0;i<10;i++) {

a=1.0f/exp(0.0f);

atomicAdd(&d_probs[word_num], a);

}

fails because atomicAdds are stacked on top of each other in the same thread. What the relation ship with the exp function is, is not clear.

isword0=((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word));         

isword1=((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word));

is different from

if ((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word)) {            

    isword0=true;         

}         

if ((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word)) {            

    isword1=true;         

}

it is the same as

if ((word0>0x80000000) && ((word0 & 0x7fffffff)==right_word)) {            

    isword0=true;         

}else{

    isword0=false;         

}        

if ((word1>0x80000000) && ((word1 & 0x7fffffff)==right_word)) {            

    isword1=true;         

}else{

    isword1=false;

}

You are right, LSChien, I changed that after hamster143 noticed isword0 and isword1 had to be reset to false every iteration. This is another way of doing that.
It seems to solve the problem, but not explain it, since the original implementation should set them to true on the first hit and keep adding after that moment. Instead it returned 0.0 and only in the case the value to be added comes from a 1/exp(0.0f).