indexing problem in CUDA kernel function

Karel_D · October 7, 2010, 9:29pm

Hi,

I have a problem in my CUDA kernel function.
This is the function

//GPU Kernel
global void CalcProbKernel(float* Mu, float* Sigma, float* MLEConst1_minus_LogSigma,
int NbrOfRobotPos,
float* AnodeData, int* BestRobotPos, int ArraySize) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= ArraySize)
return;

float Help; 

float Prob;
float BestProb = -1*1E10;   
BestRobotPos[idx] = 0;
int index1,index2;
int Cntr1,Cntr2;        
float LocalMu, LocalSigma, LocalMLEConst1, LocalAnodeData;        
   
for (Cntr1=0;Cntr1<NbrOfRobotPos;Cntr1++){     
    Prob = 0;
    for (Cntr2=0;Cntr2<64;Cntr2++){
         index1 = Cntr1*64+Cntr2;      
         index2 = Cntr2+idx*64;
         
         LocalMu = Mu[index1];
         LocalSigma = Sigma[index1];
         LocalMLEConst1 = MLEConst1_minus_LogSigma[index1];
         LocalAnodeData = AnodeData[index2];
                            
         Help = (LocalAnodeData-LocalMu) / LocalSigma;  
         Prob = Prob + LocalMLEConst1 - 0.5 * Help * Help;            
    }

    if (Prob > BestProb){
        BestProb = Prob;
        BestRobotPos[idx] = Cntr1;            
    }

}

It searches the index with the heighest probability. The problem is that it crashes (windows warning: display driver stopped responding)
If I change

index1 = Cntr1*64+Cntr2;

to

index1 = Cntr1*64+1; then it does not crash.

it works with every possible ConstantNbr in a range from 0 to 63
index1 = Cntr1*64+ConstantNbr;

It seems as if Cntr2 causes some problem. I am sure that the indexes in the kernel function stay in the appropriate range.
Has anybody got an idea?

Thanks
Karel D

Karel_D · October 7, 2010, 9:29pm

Hi,

I have a problem in my CUDA kernel function.
This is the function

//GPU Kernel
global void CalcProbKernel(float* Mu, float* Sigma, float* MLEConst1_minus_LogSigma,
int NbrOfRobotPos,
float* AnodeData, int* BestRobotPos, int ArraySize) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (idx >= ArraySize)
return;

float Help; 

float Prob;
float BestProb = -1*1E10;   
BestRobotPos[idx] = 0;
int index1,index2;
int Cntr1,Cntr2;        
float LocalMu, LocalSigma, LocalMLEConst1, LocalAnodeData;        
   
for (Cntr1=0;Cntr1<NbrOfRobotPos;Cntr1++){     
    Prob = 0;
    for (Cntr2=0;Cntr2<64;Cntr2++){
         index1 = Cntr1*64+Cntr2;      
         index2 = Cntr2+idx*64;
         
         LocalMu = Mu[index1];
         LocalSigma = Sigma[index1];
         LocalMLEConst1 = MLEConst1_minus_LogSigma[index1];
         LocalAnodeData = AnodeData[index2];
                            
         Help = (LocalAnodeData-LocalMu) / LocalSigma;  
         Prob = Prob + LocalMLEConst1 - 0.5 * Help * Help;            
    }

    if (Prob > BestProb){
        BestProb = Prob;
        BestRobotPos[idx] = Cntr1;            
    }

}

It searches the index with the heighest probability. The problem is that it crashes (windows warning: display driver stopped responding)
If I change

index1 = Cntr1*64+Cntr2;

to

index1 = Cntr1*64+1; then it does not crash.

it works with every possible ConstantNbr in a range from 0 to 63
index1 = Cntr1*64+ConstantNbr;

It seems as if Cntr2 causes some problem. I am sure that the indexes in the kernel function stay in the appropriate range.
Has anybody got an idea?

Thanks
Karel D

Lev · October 7, 2010, 10:47pm

What is about changing index2 = Cntr2+idx*64;? You said you changed only one line. Did you change second line too?

Lev · October 7, 2010, 10:47pm

What is about changing index2 = Cntr2+idx*64;? You said you changed only one line. Did you change second line too?

kbam · October 7, 2010, 11:39pm

How long does it take to run ?

Does it work if you give it a smaller array

kbam · October 7, 2010, 11:39pm

How long does it take to run ?

Does it work if you give it a smaller array

Karel_D · October 8, 2010, 6:32am

Hello Kbam,

That’s interesting: it works with smaller arrays.
If the arrays (Mu,…) have 10000 elements then it works but in my case the array should have 40000 elements and that doesn’t work.
I guess there must be a memory problem.

It’s strange because the cudamalloc and cudaMemcpy all give a Success status:

if ( cudaMalloc( (void**)&GPUMu, 64*sizeof(float)*NbrOfRobotPos ) != cudaSuccess )
    mexErrMsgTxt("GPUMLE: GPUMu: Memory allocating failure on the GPU.\n");

if ( cudaMemcpy( GPUMu, Mu, 64*sizeof(float)*NbrOfRobotPos, cudaMemcpyHostToDevice) != cudaSuccess)
    mexErrMsgTxt("GPUMLE: Mu: cudaMemcpy failure\n");

I have added printf(“ERROR string: %s\n”,cudaGetErrorString(cudaGetLastError())) in my code
and with a small array it says : no error code
but with a big array it says: invalid device pointer

Any idea?

Thanks and best regards,
Karel

Karel_D · October 8, 2010, 6:32am

Hello Kbam,

That’s interesting: it works with smaller arrays.
If the arrays (Mu,…) have 10000 elements then it works but in my case the array should have 40000 elements and that doesn’t work.
I guess there must be a memory problem.

It’s strange because the cudamalloc and cudaMemcpy all give a Success status:

if ( cudaMalloc( (void**)&GPUMu, 64*sizeof(float)*NbrOfRobotPos ) != cudaSuccess )
    mexErrMsgTxt("GPUMLE: GPUMu: Memory allocating failure on the GPU.\n");

if ( cudaMemcpy( GPUMu, Mu, 64*sizeof(float)*NbrOfRobotPos, cudaMemcpyHostToDevice) != cudaSuccess)
    mexErrMsgTxt("GPUMLE: Mu: cudaMemcpy failure\n");

I have added printf(“ERROR string: %s\n”,cudaGetErrorString(cudaGetLastError())) in my code
and with a small array it says : no error code
but with a big array it says: invalid device pointer

Any idea?

Thanks and best regards,
Karel

Lev · October 8, 2010, 9:58am

check index2 and allocation of Anodedata.

Lev · October 8, 2010, 9:58am

check index2 and allocation of Anodedata.

Karel_D · October 8, 2010, 6:38pm

Hi,

I changed index2 and fixed it at zero. Then I recompiled my code.
The first two times that I ran the code it worked. The third time it crashed. Sometimes it crashes sometimes it does not.

I have also changed my kernel-call:

CalcProbKernel<<< grid, block >>>(GPUMu,GPUSigma,GPUMLEConst1_minus_LogSigma,
                                  NbrOfRobotPos,
                                  GPUAnodeData,                                          
                                  GPUBestRobotPos,                                              
                                  ParallelGPUProc);
                                  
[b]for (int i=0;i<1000;i++) printf(".");[/b]

 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 [b]cudaMemcpy( BestRobotPos, GPUBestRobotPos, ParallelGPUProc*sizeof(int), cudaMemcpyDeviceToHost); [/b]
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError()));

After the Kernel call there is a pause during which the loop prints dots on the screen. This makes it easier to find the problem. I have noticed now that the crash does occur during the memcpy. Note that it does not crash always.
If it crashes I get diffferent errors

the launch timed out and was terminated
no error
invalid device pointer

My biggest problem is that I think that there is something wrong with the GPU hardware or driver because the results are not consistent.

Thanks for helping,
Karel

Karel_D · October 8, 2010, 6:38pm

Hi,

I changed index2 and fixed it at zero. Then I recompiled my code.
The first two times that I ran the code it worked. The third time it crashed. Sometimes it crashes sometimes it does not.

I have also changed my kernel-call:

CalcProbKernel<<< grid, block >>>(GPUMu,GPUSigma,GPUMLEConst1_minus_LogSigma,
                                  NbrOfRobotPos,
                                  GPUAnodeData,                                          
                                  GPUBestRobotPos,                                              
                                  ParallelGPUProc);
                                  
[b]for (int i=0;i<1000;i++) printf(".");[/b]

 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 [b]cudaMemcpy( BestRobotPos, GPUBestRobotPos, ParallelGPUProc*sizeof(int), cudaMemcpyDeviceToHost); [/b]
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError())); 
 printf("%s\n",cudaGetErrorString(cudaGetLastError()));

After the Kernel call there is a pause during which the loop prints dots on the screen. This makes it easier to find the problem. I have noticed now that the crash does occur during the memcpy. Note that it does not crash always.
If it crashes I get diffferent errors

the launch timed out and was terminated
no error
invalid device pointer

My biggest problem is that I think that there is something wrong with the GPU hardware or driver because the results are not consistent.

Thanks for helping,
Karel

Lev · October 8, 2010, 6:44pm

Try to run a progrem in emulation mode and debug it. Check allocation of arrays. Also try to run memcpy with out running kernell.

Lev · October 8, 2010, 6:44pm

Try to run a progrem in emulation mode and debug it. Check allocation of arrays. Also try to run memcpy with out running kernell.

Karel_D · October 11, 2010, 1:36pm

Hi,

I found out that the problem is related to the watchdog timeout. (timeout - CUDA apps time out & fail after several seconds - how to work around this? - Stack Overflow)
I’m currently trying to split up my kernel function.

Best regards,
Karel

Karel_D · October 11, 2010, 1:36pm

Hi,

I found out that the problem is related to the watchdog timeout. (timeout - CUDA apps time out & fail after several seconds - how to work around this? - Stack Overflow)
I’m currently trying to split up my kernel function.

Best regards,
Karel

Lev · October 11, 2010, 2:33pm

You can also disable watchdog timer.

Lev · October 11, 2010, 2:33pm

You can also disable watchdog timer.

Karel_D · October 12, 2010, 1:29pm

Hi,

My CUDA program works (and it’s at least 10x faster as the CPU version).
The problem was the watchdog timer. I have split up my kernel function and now I don’t have the problem.
I have read about disabling the watchdog but most people advice to keep it enabled.

thanks and best regards,
Karel

Karel_D · October 12, 2010, 1:29pm

Hi,

My CUDA program works (and it’s at least 10x faster as the CPU version).
The problem was the watchdog timer. I have split up my kernel function and now I don’t have the problem.
I have read about disabling the watchdog but most people advice to keep it enabled.

thanks and best regards,
Karel

Topic		Replies	Views
Crash of kernel function with an array of device pointers as a parameter Unknown error after kernel CUDA Programming and Performance	7	3987	October 28, 2011
Possible nvcc bug? CUDA Programming and Performance	13	8834	January 9, 2011
CUDA 2.1 Bug Compiler bug CUDA Programming and Performance	8	9153	March 4, 2009
Strange problem with a __device__ function CUDA Programming and Performance	14	8064	September 3, 2010
Problem with cudaMalloc & indexes CUDA Programming and Performance	4	1685	April 6, 2010
CUDA kernel fails to write a hard-coded value to a linear array when the array index > 65535 but CUDA Programming and Performance	4	7031	December 1, 2008
bug using arrays in kernels CUDA Programming and Performance	0	1821	November 30, 2011
strange error in summation memory problems CUDA Programming and Performance	14	16653	July 21, 2010
Indexing Errors with a large array CUDA Programming and Performance	0	862	February 20, 2009
char array - my mistake? CUDA Programming and Performance	33	7571	October 22, 2010

indexing problem in CUDA kernel function

Related topics