"Failed to read the virtual PC on CUDA device."

MrNightLifeLover · May 27, 2011, 12:15pm

I’m getting this error in cuda-gdb (on Linux). Quickly googling it returned nothing?? What does it exactly mean? Does PC stand for program counter? What is most likely the cause for this error?

mkaushik · May 27, 2011, 6:00pm

MrNightLifeLover, which cuda toolkit version are you using? Can you paste the commands you entered when you hit this error?

mkaushik · May 27, 2011, 6:08pm

Also, which GPU?

mkaushik · May 31, 2011, 4:14pm

Also, try using the latest 4.0 toolkit, quite a few such bugs were fixed in the 4.0 release:

MrNightLifeLover · June 1, 2011, 6:44am

Toolkit : The newest one. GPU: Fermi (GTX 470) The bug had something to do with accessing invalid pointers, don’t remember exactly… But what does the error mean? Does PC mean program counter or something?

mkaushik · June 1, 2011, 7:28am

It means cuda-gdb, for some reason, could not figure out the Program Counter address that the warps were on (in the block in focus). It should not be caused by accessing invalid pointers.

If you could produce an app that reproduces this bug, I might be able to dig further.

chazix · July 11, 2011, 12:33pm

I found this message easily reproducible with the newdelete example in the SDK. I build newdelete, launch cuda-gdb on it, and just do “run”, and I see this message. It doesn’t always happen at the same point, but it has happened at least once in every launch.

I entered a bug report earlier today (or at least, I think I did - there doesn’t seem to be any immediate confirmation).

Regards

chazix · July 11, 2011, 12:40pm

Ah, just got an email with the incident report ID: 850072

fji · September 9, 2011, 5:55pm

Any new bugfix or findings on this issue?

Feng

nic21 · October 5, 2011, 8:55pm

I am using cuda tollkit 4.0 . I have a kernel that does image convolution. My image contains a lot of pixels which have value -3.40282002e+38 (FLT_MIN). When I read the image value in the convolution loop, I check if it is equal to FLT_MIN. If it is, I don’t compute.

    for (int yi = yi1; yi <= yi2; yi++)
    {
        w = *ptr++;    
    float v = in->m_array[yi*in->XSize()+xi];
    if (v!= UNKNOWN)
   {
 

        res += w * v;
        len += v * v;
   }
   
    }
    }

When I used the debugger, i could see that even v was UNKNOWN, it continued with res computation. Why is this ?
Anyway, I changed it to

    for (int xi = xi1; xi <= xi2; xi++)
    {
    for (int yi = yi1; yi <= yi2; yi++)
    {
        w = *ptr++;    
    float v = in->m_array[yi*in->XSize()+xi];
    if (v == UNKNOWN) return;
           
        res += w * v;
        len += v * v;
   
    }
    }

In this case, when v = UNKNOWN, the debugger gives an error message “Failed to read the virtual PC on CUDA device 0 (error=10)”. Why is this happening ? help needed asap !

mkaushik · October 5, 2011, 9:37pm

Usual questions first: Which GPU/Driver/CUDA version?

nic21 · October 5, 2011, 9:52pm

These are the details :

cuda toolkit 4.0

tesla M2090

nic21 · October 5, 2011, 10:21pm

I just made all of those values -100 and ran the kernel. I got the PC error again during the third iteration of yi loop … I printed out xi1, xi2, yi2, yi1, in->m_array and ptr using device printf and they seem right. I am not sure what is causing this error.

mkaushik · October 5, 2011, 10:25pm

Can you provide a stand-alone app that reproduces the problem? It’s very difficult to diagnose the problem with just that code snippet.

nic21 · October 6, 2011, 12:32am

Hi kaushik

Thank you for replying. Here is a simplified version of a part of the code that causes the problem.

Nandhini

*/

include

include <stdio.h>

using namespace std;

struct Template {

int    image_ydim;

int    image_xdim;

int    image_fdim;

float  image_ybegin;

float  image_ywidth;

float  image_xbegin;

float  image_xwidth;

float *image_array;

__host__ Template(int ydim, int xdim, int fdim, float ybegin, float ywidth, float xbegin, float xwidth)

{	

		image_ydim  = ydim;

		image_xdim  = xdim;

		image_fdim  = fdim;

		image_ybegin = ybegin;

		image_ywidth = ywidth;

		image_xbegin = xbegin;

		image_xwidth = xwidth;

	image_array = (float *)malloc(ydim * xdim * fdim * sizeof(float));

	

}

__host__ Template()

{

}

 __host__ __device__ int Ydim () { return image_ydim ; }

 __host__ __device__ int Xdim () { return image_xdim ; }

__host__ __device__ int Fdim () { return image_fdim ; };

__device__  float Ywidth() { return image_ywidth; };

__device__  float Xwidth(){ return image_xwidth; };

__device__ float Ypoint(int i)

{

return image_ybegin + (float)i * image_ywidth;

}

__device__ float Xpoint(int i) 

{

 return image_xbegin + (float)i * image_xwidth;

}

__device__ bool fieldx()

{

	// function

	//modify values of xi1 xi2

}

__device__ bool fieldy()

{

	// function

	//modify values of yi1 yi2

}

};

global void FilterComputeKernel(Template *in, Template *out_d, float * gabors_d, int scale, int filter_size)

{

const int bx = blockIdx.x;

const int by = blockIdx.y;

const int tx = threadIdx.x;

const int ty = threadIdx.y;

int block_size = blockDim.x;

int block_num = gridDim.x;

int block_nuimage_orientation = gridDim.y;

int xpos = (bx % block_nuimage_orientation) * block_size + tx;   //modulo because blocks for all filters are launched in parallel

int ypos = (by % block_nuimage_orientation) * block_size + ty; 

	

if ((xpos < out_d->Xdim()) && (ypos < out_d->Ydim())) // check to avoid unnecessary computation of threads with invalid values. 

{

	float xc = out_d->Xpoint(xpos);

	float yc = out_d->Ypoint(ypos);	

	

 int yi1, yi2, xi1, xi2;





bool c2 = in->fieldy(yc, filter_size, yi1, yi2);  

bool c1 = in->fieldx(xc, filter_size, xi1, xi2);

	 	





    if (c1 && c2)

    {

float res = 0.0f;

        float len = 0.0f;

float *ptr = gabors_d + (int)(floorf(bx / block_nuimage_orientation)) * filter_size * filter_size;

    float w;

	

        int flag=0;





	    for (int yi = yi1; yi <= yi2; yi++)

	    {

           for (int xi = xi1; xi <= xi2; xi++)

	       {

		  w = *ptr++; 

	  float v = in->image_array[yi*in->Xdim()+xi];

	  if (v == UNKNOWN)

              {

	 	flag = 1;

                    break;

              }

          else

	  {

	        res += w * v;

		        len += v * v;

	  }

	       }

	    }



	    if (flag==0)

    {

	        res = fabsf(res);

	        if (len > 0.0f) res /= sqrtf(len);

    	int f =(int)floorf(bx/block_nuimage_orientation);

	        out_d->image_array[f*out_d->Xdim()*out_d->Ydim() + ypos*out_d->Xdim() + xpos] = res;



	     }



       } 

  }

}

void FilterCompute(Template *in, Template *out_d, Template *out, float *gabors_d, int scale, const int opt)

{

int block_dim = BLOCK_DIMENSION; //number of threads 

int block_size = (out->Xdim() > block_dim)? block_dim : out->Xdim();

int block_nuimage_orientation = ((out->Xdim() + block_dim-1 )/block_dim);  // dimension of the image Template / number of threads per block = number of blocks for one orientation. this value is multiplied by total number of filters to get the total number of blocks launched for one scale. 



int block_num = block_nuimage_orientation * out->Fdim();



int meimage_size = out->Fdim() * out->Xdim() * out->Ydim() * sizeof(float);

dim3 grid(block_num,block_nuimage_orientation);

dim3 block(block_size, block_size);

cout<<"launching output kernels"<<endl;

gaborFilterComputeKernel<<<grid, block>>>(in, out_d, gabors_d, scale,output_edge_sz);

cudaThreadSynchronize();

}

int main() {

// Generate gabor filter 

float *gabors;

gabors= FilterGenerate(output_edge_sz, 0.3f, 5.6410f, 4.5128f, n_filters);



float *gabors_d;

int meimage_size = output_edge_sz * output_edge_sz * n_filters * sizeof(float);

cudaMalloc((void**) &gabors_d, meimage_size);

cudaMemcpy(gabors_d, gabors, meimage_size, cudaMemcpyHostToDevice);

Template *input_d[nsi];

Template *output_d[noutput];



Template *input_p[nsi];

Template *output_p[noutput];





Template *input[ninput];

input[ 0] = new Template(scale[0],  scale[0],  1, v1,v2,v3,v4);

input[ 1] = new Template(scale[1],  scale[1],  1, x1,x2,x3,x4);

input[ 2] = new Template(scale[2],  scale[2],  1, k1,k2,k3,k4);

input[ 3] = new Template(scale[3],  scale[3],  1, z1,z2,z3,z4);





for (int s = 0; s < ninput; s++) input[s]->SetTemplate(image_pyramid[s]);

float *p_array[nsi];	

for (int s = 0; s < ninput; s++) 

{

input_p[s] = new struct Template();



cudaMalloc( (void**)&p_array[s], scale[s]*scale[s]*sizeof(float));

cudaMemcpy(p_array[s], input[s]->image_array, (scale[s]*scale[s]*sizeof(float)) ,cudaMemcpyHostToDevice);

input_p[s]->image_xdim = input[s]->image_xdim;

input_p[s]->image_ydim = input[s]->image_ydim;

input_p[s]->image_fdim = input[s]->image_fdim;

input_p[s]->image_ybegin = input[s]->image_ybegin;

input_p[s]->image_ywidth = input[s]->image_ywidth;

input_p[s]->image_xbegin = input[s]->image_xbegin;

input_p[s]->image_xwidth = input[s]->image_xwidth; 

input_p[s]->image_array = p_array[s];



cudaMalloc((void**)&input_d[s], sizeof(Template)); 

cudaMemcpy(input_d[s], input_p[s], sizeof(Template) , cudaMemcpyHostToDevice);

}





Template *output[noutput];

output[ 0] = new Template(output_scale[0] , output_scale[0] , n_filters, a1,a2,a3,a4);

output[ 1] = new Template(output_scale[1] , output_scale[1] , n_filters, b1,b2,b3,b4);

output[ 2] = new Template(output_scale[2] , output_scale[2] , n_filters, c1,c2,c3,c4);

output[ 3] = new Template(output_scale[3] , output_scale[3] , n_filters, d1,d2,d3,d4);



for (int s = 0; s < noutput; s++) 

{

output_p[s] = new struct Template();

float *p_array;

cudaMalloc( (void**)&p_array, output_scale[s]*output_scale[s]*n_filters*sizeof(float));

cudaMemcpy(p_array, output[s]->image_array, (output_scale[s]*output_scale[s]*n_filters*sizeof(float)) ,cudaMemcpyHostToDevice);

output_p[s]->image_xdim = output[s]->image_xdim;

output_p[s]->image_ydim = output[s]->image_ydim;

output_p[s]->image_fdim = output[s]->image_fdim;

output_p[s]->image_ybegin = output[s]->image_ybegin;

output_p[s]->image_ywidth = output[s]->image_ywidth;

output_p[s]->image_xbegin = output[s]->image_xbegin;

output_p[s]->image_xwidth = output[s]->image_xwidth; 

output_p[s]->image_array = p_array;

cudaMalloc((void**)&output_d[s], sizeof(Template)); 

cudaMemcpy(output_d[s], output_p[s], sizeof(Template), cudaMemcpyHostToDevice);

}

// call kernel function 

for (int s = 0; s < nsi; s++) {

FilterCompute(input_d[s],output_d[s],output[s],gabors_d,s,1);





}

return 0;

}

mkaushik · October 6, 2011, 1:01am

Thanks, I’m looking into it.

nic21 · October 6, 2011, 2:01pm

Did you find anything ?

nic21 · October 9, 2011, 4:39pm

can anyone please reply ?? When I run the program without the debugger, my outputs aren’t correct and I would like to know if it is because of this . Please help.

nic21 · October 9, 2011, 6:10pm

Does this error affect the correct execution of the program ?
Here is what I observed in the GDB – It occurs only sometimes and it can occur at any line in the kernel. I was able to step through the execution of one thread completely without the error. All memory accesses were correct and all my values were correct and the convolution was done correctly. I focussed on thread 7,0,0 in block 0,0,0. when I was stepping through the code the gdb shifted focus to some other thread only once. After that I tried to do the same thread and other threads many times, but I always kept getting the error.
I also did “set cuda memcheck on” before running but it does not give any other information except the error. I am not able to find any other information about this error. But this is very frustrating.