indexing problem in CUDA kernel function

Hi,

I have a problem in my CUDA kernel function.
This is the function

//GPU Kernel
global void CalcProbKernel(float* Mu, float* Sigma, float* MLEConst1_minus_LogSigma,
int NbrOfRobotPos,
float* AnodeData, int* BestRobotPos, int ArraySize) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= ArraySize)
return;

float Help; 

float Prob;
float BestProb = -1*1E10;   
BestRobotPos[idx] = 0;
int index1,index2;
int Cntr1,Cntr2;        
float LocalMu, LocalSigma, LocalMLEConst1, LocalAnodeData;        
   
for (Cntr1=0;Cntr1<NbrOfRobotPos;Cntr1++){     
    Prob = 0;
    for (Cntr2=0;Cntr2<64;Cntr2++){
         index1 = Cntr1*64+Cntr2;      
         index2 = Cntr2+idx*64;
         
         LocalMu = Mu[index1];
         LocalSigma = Sigma[index1];
         LocalMLEConst1 = MLEConst1_minus_LogSigma[index1];
         LocalAnodeData = AnodeData[index2];
                            
         Help = (LocalAnodeData-LocalMu) / LocalSigma;  
         Prob = Prob + LocalMLEConst1 - 0.5 * Help * Help;            
    }

    if (Prob > BestProb){
        BestProb = Prob;
        BestRobotPos[idx] = Cntr1;            
    }              

}

}

It searches the index with the heighest probability. The problem is that it crashes (windows warning: display driver stopped responding)
If I change

index1 = Cntr1*64+Cntr2;

to

index1 = Cntr1*64+1; then it does not crash.

it works with every possible ConstantNbr in a range from 0 to 63
index1 = Cntr1*64+ConstantNbr;

It seems as if Cntr2 causes some problem. I am sure that the indexes in the kernel function stay in the appropriate range.
Has anybody got an idea?

Thanks
Karel D

Hi,

I have a problem in my CUDA kernel function.
This is the function

//GPU Kernel
global void CalcProbKernel(float* Mu, float* Sigma, float* MLEConst1_minus_LogSigma,
int NbrOfRobotPos,
float* AnodeData, int* BestRobotPos, int ArraySize) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= ArraySize)
return;

float Help; 

float Prob;
float BestProb = -1*1E10;   
BestRobotPos[idx] = 0;
int index1,index2;
int Cntr1,Cntr2;        
float LocalMu, LocalSigma, LocalMLEConst1, LocalAnodeData;        
   
for (Cntr1=0;Cntr1<NbrOfRobotPos;Cntr1++){     
    Prob = 0;
    for (Cntr2=0;Cntr2<64;Cntr2++){
         index1 = Cntr1*64+Cntr2;      
         index2 = Cntr2+idx*64;
         
         LocalMu = Mu[index1];
         LocalSigma = Sigma[index1];
         LocalMLEConst1 = MLEConst1_minus_LogSigma[index1];
         LocalAnodeData = AnodeData[index2];
                            
         Help = (LocalAnodeData-LocalMu) / LocalSigma;  
         Prob = Prob + LocalMLEConst1 - 0.5 * Help * Help;            
    }

    if (Prob > BestProb){
        BestProb = Prob;
        BestRobotPos[idx] = Cntr1;            
    }              

}

}

It searches the index with the heighest probability. The problem is that it crashes (windows warning: display driver stopped responding)
If I change

index1 = Cntr1*64+Cntr2;

to

index1 = Cntr1*64+1; then it does not crash.

it works with every possible ConstantNbr in a range from 0 to 63
index1 = Cntr1*64+ConstantNbr;

It seems as if Cntr2 causes some problem. I am sure that the indexes in the kernel function stay in the appropriate range.
Has anybody got an idea?

Thanks
Karel D

What is about changing index2 = Cntr2+idx*64;? You said you changed only one line. Did you change second line too?

What is about changing index2 = Cntr2+idx*64;? You said you changed only one line. Did you change second line too?

How long does it take to run ?

Does it work if you give it a smaller array

How long does it take to run ?

Does it work if you give it a smaller array

Hello Kbam,

That’s interesting: it works with smaller arrays.
If the arrays (Mu,…) have 10000 elements then it works but in my case the array should have 40000 elements and that doesn’t work.
I guess there must be a memory problem.

It’s strange because the cudamalloc and cudaMemcpy all give a Success status:

if ( cudaMalloc( (void**)&GPUMu, 64*sizeof(float)*NbrOfRobotPos ) != cudaSuccess )
    mexErrMsgTxt("GPUMLE: GPUMu: Memory allocating failure on the GPU.\n");

if ( cudaMemcpy( GPUMu, Mu, 64*sizeof(float)*NbrOfRobotPos, cudaMemcpyHostToDevice) != cudaSuccess)
    mexErrMsgTxt("GPUMLE: Mu: cudaMemcpy failure\n");

I have added printf(“ERROR string: %s\n”,cudaGetErrorString(cudaGetLastError())) in my code
and with a small array it says : no error code
but with a big array it says: invalid device pointer

Any idea?

Thanks and best regards,
Karel

Hello Kbam,

That’s interesting: it works with smaller arrays.
If the arrays (Mu,…) have 10000 elements then it works but in my case the array should have 40000 elements and that doesn’t work.
I guess there must be a memory problem.

It’s strange because the cudamalloc and cudaMemcpy all give a Success status:

if ( cudaMalloc( (void**)&GPUMu, 64*sizeof(float)*NbrOfRobotPos ) != cudaSuccess )
    mexErrMsgTxt("GPUMLE: GPUMu: Memory allocating failure on the GPU.\n");

if ( cudaMemcpy( GPUMu, Mu, 64*sizeof(float)*NbrOfRobotPos, cudaMemcpyHostToDevice) != cudaSuccess)
    mexErrMsgTxt("GPUMLE: Mu: cudaMemcpy failure\n");

I have added printf(“ERROR string: %s\n”,cudaGetErrorString(cudaGetLastError())) in my code
and with a small array it says : no error code
but with a big array it says: invalid device pointer

Any idea?

Thanks and best regards,
Karel

check index2 and allocation of Anodedata.

check index2 and allocation of Anodedata.

Hi,

I changed index2 and fixed it at zero. Then I recompiled my code.
The first two times that I ran the code it worked. The third time it crashed. Sometimes it crashes sometimes it does not.

I have also changed my kernel-call:

CalcProbKernel<<< grid, block >>>(GPUMu,GPUSigma,GPUMLEConst1_minus_LogSigma,
                                  NbrOfRobotPos,
                                  GPUAnodeData,                                          
                                  GPUBestRobotPos,                                              
                                  ParallelGPUProc);
                                  
[b]for (int i=0;i<1000;i++) printf(".");[/b]

 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 [b]cudaMemcpy( BestRobotPos, GPUBestRobotPos, ParallelGPUProc*sizeof(int), cudaMemcpyDeviceToHost); [/b]
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 

After the Kernel call there is a pause during which the loop prints dots on the screen. This makes it easier to find the problem. I have noticed now that the crash does occur during the memcpy. Note that it does not crash always.
If it crashes I get diffferent errors

  • the launch timed out and was terminated
  • no error
  • invalid device pointer

My biggest problem is that I think that there is something wrong with the GPU hardware or driver because the results are not consistent.

Thanks for helping,
Karel

Hi,

I changed index2 and fixed it at zero. Then I recompiled my code.
The first two times that I ran the code it worked. The third time it crashed. Sometimes it crashes sometimes it does not.

I have also changed my kernel-call:

CalcProbKernel<<< grid, block >>>(GPUMu,GPUSigma,GPUMLEConst1_minus_LogSigma,
                                  NbrOfRobotPos,
                                  GPUAnodeData,                                          
                                  GPUBestRobotPos,                                              
                                  ParallelGPUProc);
                                  
[b]for (int i=0;i<1000;i++) printf(".");[/b]

 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 [b]cudaMemcpy( BestRobotPos, GPUBestRobotPos, ParallelGPUProc*sizeof(int), cudaMemcpyDeviceToHost); [/b]
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 

After the Kernel call there is a pause during which the loop prints dots on the screen. This makes it easier to find the problem. I have noticed now that the crash does occur during the memcpy. Note that it does not crash always.
If it crashes I get diffferent errors

  • the launch timed out and was terminated
  • no error
  • invalid device pointer

My biggest problem is that I think that there is something wrong with the GPU hardware or driver because the results are not consistent.

Thanks for helping,
Karel

Try to run a progrem in emulation mode and debug it. Check allocation of arrays. Also try to run memcpy with out running kernell.

Try to run a progrem in emulation mode and debug it. Check allocation of arrays. Also try to run memcpy with out running kernell.

Hi,

I found out that the problem is related to the watchdog timeout. (http://stackoverflow.com/questions/497685/how-do-you-get-around-the-maximum-cuda-run-time)
I’m currently trying to split up my kernel function.

Best regards,
Karel

Hi,

I found out that the problem is related to the watchdog timeout. (http://stackoverflow.com/questions/497685/how-do-you-get-around-the-maximum-cuda-run-time)
I’m currently trying to split up my kernel function.

Best regards,
Karel

You can also disable watchdog timer.

You can also disable watchdog timer.

Hi,

My CUDA program works (and it’s at least 10x faster as the CPU version).
The problem was the watchdog timer. I have split up my kernel function and now I don’t have the problem.
I have read about disabling the watchdog but most people advice to keep it enabled.

thanks and best regards,
Karel

Hi,

My CUDA program works (and it’s at least 10x faster as the CPU version).
The problem was the watchdog timer. I have split up my kernel function and now I don’t have the problem.
I have read about disabling the watchdog but most people advice to keep it enabled.

thanks and best regards,
Karel