"Failed to read the virtual PC on CUDA device."

I’m getting this error in cuda-gdb (on Linux). Quickly googling it returned nothing?? What does it exactly mean? Does PC stand for program counter? What is most likely the cause for this error?

MrNightLifeLover, which cuda toolkit version are you using? Can you paste the commands you entered when you hit this error?

Also, which GPU?

Also, try using the latest 4.0 toolkit, quite a few such bugs were fixed in the 4.0 release:

Toolkit : The newest one. GPU: Fermi (GTX 470) The bug had something to do with accessing invalid pointers, don’t remember exactly… But what does the error mean? Does PC mean program counter or something?

It means cuda-gdb, for some reason, could not figure out the Program Counter address that the warps were on (in the block in focus). It should not be caused by accessing invalid pointers.

If you could produce an app that reproduces this bug, I might be able to dig further.

I found this message easily reproducible with the newdelete example in the SDK. I build newdelete, launch cuda-gdb on it, and just do “run”, and I see this message. It doesn’t always happen at the same point, but it has happened at least once in every launch.

I entered a bug report earlier today (or at least, I think I did - there doesn’t seem to be any immediate confirmation).

Regards

Ah, just got an email with the incident report ID: 850072

Any new bugfix or findings on this issue?

Feng

I am using cuda tollkit 4.0 . I have a kernel that does image convolution. My image contains a lot of pixels which have value -3.40282002e+38 (FLT_MIN). When I read the image value in the convolution loop, I check if it is equal to FLT_MIN. If it is, I don’t compute.

    for (int yi = yi1; yi <= yi2; yi++)
    {
        w = *ptr++;    
    float v = in->m_array[yi*in->XSize()+xi];
    if (v!= UNKNOWN)
   {
 

        res += w * v;
        len += v * v;
   }
   
    }
    }

When I used the debugger, i could see that even v was UNKNOWN, it continued with res computation. Why is this ?
Anyway, I changed it to

    for (int xi = xi1; xi <= xi2; xi++)
    {
    for (int yi = yi1; yi <= yi2; yi++)
    {
        w = *ptr++;    
    float v = in->m_array[yi*in->XSize()+xi];
    if (v == UNKNOWN) return;
           
        res += w * v;
        len += v * v;
   
    }
    }

In this case, when v = UNKNOWN, the debugger gives an error message “Failed to read the virtual PC on CUDA device 0 (error=10)”. Why is this happening ? help needed asap !

Usual questions first: Which GPU/Driver/CUDA version?

These are the details :

cuda toolkit 4.0

tesla M2090

I just made all of those values -100 and ran the kernel. I got the PC error again during the third iteration of yi loop … I printed out xi1, xi2, yi2, yi1, in->m_array and ptr using device printf and they seem right. I am not sure what is causing this error.

Can you provide a stand-alone app that reproduces the problem? It’s very difficult to diagnose the problem with just that code snippet.

Hi kaushik

Thank you for replying. Here is a simplified version of a part of the code that causes the problem.

Nandhini

*/

include

include

include <stdio.h>

using namespace std;

struct Template {

int    image_ydim;

int    image_xdim;

int    image_fdim;

float  image_ybegin;

float  image_ywidth;

float  image_xbegin;

float  image_xwidth;

float *image_array;

__host__ Template(int ydim, int xdim, int fdim, float ybegin, float ywidth, float xbegin, float xwidth)

{	

		image_ydim  = ydim;

		image_xdim  = xdim;

		image_fdim  = fdim;

		image_ybegin = ybegin;

		image_ywidth = ywidth;

		image_xbegin = xbegin;

		image_xwidth = xwidth;

	image_array = (float *)malloc(ydim * xdim * fdim * sizeof(float));

	

}

__host__ Template()

{

}

 __host__ __device__ int Ydim () { return image_ydim ; }

 __host__ __device__ int Xdim () { return image_xdim ; }

__host__ __device__ int Fdim () { return image_fdim ; };

__device__  float Ywidth() { return image_ywidth; };

__device__  float Xwidth(){ return image_xwidth; };

__device__ float Ypoint(int i)

{

return image_ybegin + (float)i * image_ywidth;

}

__device__ float Xpoint(int i) 

{

 return image_xbegin + (float)i * image_xwidth;

}

__device__ bool fieldx()

{

	// function

	//modify values of xi1 xi2

}

__device__ bool fieldy()

{

	// function

	//modify values of yi1 yi2

}

};

global void FilterComputeKernel(Template *in, Template *out_d, float * gabors_d, int scale, int filter_size)

{

const int bx = blockIdx.x;

const int by = blockIdx.y;

const int tx = threadIdx.x;

const int ty = threadIdx.y;

int block_size = blockDim.x;

int block_num = gridDim.x;

int block_nuimage_orientation = gridDim.y;

int xpos = (bx % block_nuimage_orientation) * block_size + tx;   //modulo because blocks for all filters are launched in parallel

int ypos = (by % block_nuimage_orientation) * block_size + ty; 

	

if ((xpos < out_d->Xdim()) && (ypos < out_d->Ydim())) // check to avoid unnecessary computation of threads with invalid values. 

{

	float xc = out_d->Xpoint(xpos);

	float yc = out_d->Ypoint(ypos);	

	

 int yi1, yi2, xi1, xi2;





bool c2 = in->fieldy(yc, filter_size, yi1, yi2);  

bool c1 = in->fieldx(xc, filter_size, xi1, xi2);

	 	





    if (c1 && c2)

    {		

float res = 0.0f;

        float len = 0.0f;

float *ptr = gabors_d + (int)(floorf(bx / block_nuimage_orientation)) * filter_size * filter_size;

    float w;

	

        int flag=0;





	    for (int yi = yi1; yi <= yi2; yi++)

	    {

           for (int xi = xi1; xi <= xi2; xi++)

	       {

		  w = *ptr++; 

	  float v = in->image_array[yi*in->Xdim()+xi];

	  if (v == UNKNOWN)

              {

	 	flag = 1;

                    break;

              }

          else

	  {

	        res += w * v;

		        len += v * v;

	  }

	       }

	    }



	    if (flag==0)

    {

	        res = fabsf(res);

	        if (len > 0.0f) res /= sqrtf(len);

    	int f =(int)floorf(bx/block_nuimage_orientation);

	        out_d->image_array[f*out_d->Xdim()*out_d->Ydim() + ypos*out_d->Xdim() + xpos] = res;



	     }



       } 

  }

}

void FilterCompute(Template *in, Template *out_d, Template *out, float *gabors_d, int scale, const int opt)

{

int block_dim = BLOCK_DIMENSION; //number of threads 

int block_size = (out->Xdim() > block_dim)? block_dim : out->Xdim();

int block_nuimage_orientation = ((out->Xdim() + block_dim-1 )/block_dim);  // dimension of the image Template / number of threads per block = number of blocks for one orientation. this value is multiplied by total number of filters to get the total number of blocks launched for one scale. 



int block_num = block_nuimage_orientation * out->Fdim();



int meimage_size = out->Fdim() * out->Xdim() * out->Ydim() * sizeof(float);

dim3 grid(block_num,block_nuimage_orientation);

dim3 block(block_size, block_size);

cout<<"launching output kernels"<<endl;

gaborFilterComputeKernel<<<grid, block>>>(in, out_d, gabors_d, scale,output_edge_sz);

cudaThreadSynchronize();

}

int main() {

// Generate gabor filter 

float *gabors;

gabors= FilterGenerate(output_edge_sz, 0.3f, 5.6410f, 4.5128f, n_filters);



float *gabors_d;

int meimage_size = output_edge_sz * output_edge_sz * n_filters * sizeof(float);

cudaMalloc((void**) &gabors_d, meimage_size);

cudaMemcpy(gabors_d, gabors, meimage_size, cudaMemcpyHostToDevice);

Template *input_d[nsi];

Template *output_d[noutput];



Template *input_p[nsi];

Template *output_p[noutput];





Template *input[ninput];

input[ 0] = new Template(scale[0],  scale[0],  1, v1,v2,v3,v4);

input[ 1] = new Template(scale[1],  scale[1],  1, x1,x2,x3,x4);

input[ 2] = new Template(scale[2],  scale[2],  1, k1,k2,k3,k4);

input[ 3] = new Template(scale[3],  scale[3],  1, z1,z2,z3,z4);





for (int s = 0; s < ninput; s++) input[s]->SetTemplate(image_pyramid[s]);

float *p_array[nsi];	

for (int s = 0; s < ninput; s++) 

{

input_p[s] = new struct Template();



cudaMalloc( (void**)&p_array[s], scale[s]*scale[s]*sizeof(float));

cudaMemcpy(p_array[s], input[s]->image_array, (scale[s]*scale[s]*sizeof(float)) ,cudaMemcpyHostToDevice);

input_p[s]->image_xdim = input[s]->image_xdim;

input_p[s]->image_ydim = input[s]->image_ydim;

input_p[s]->image_fdim = input[s]->image_fdim;

input_p[s]->image_ybegin = input[s]->image_ybegin;

input_p[s]->image_ywidth = input[s]->image_ywidth;

input_p[s]->image_xbegin = input[s]->image_xbegin;

input_p[s]->image_xwidth = input[s]->image_xwidth; 

input_p[s]->image_array = p_array[s];



cudaMalloc((void**)&input_d[s], sizeof(Template)); 

cudaMemcpy(input_d[s], input_p[s], sizeof(Template) , cudaMemcpyHostToDevice);

}





Template *output[noutput];

output[ 0] = new Template(output_scale[0] , output_scale[0] , n_filters, a1,a2,a3,a4);

output[ 1] = new Template(output_scale[1] , output_scale[1] , n_filters, b1,b2,b3,b4);

output[ 2] = new Template(output_scale[2] , output_scale[2] , n_filters, c1,c2,c3,c4);

output[ 3] = new Template(output_scale[3] , output_scale[3] , n_filters, d1,d2,d3,d4);



for (int s = 0; s < noutput; s++) 

{

output_p[s] = new struct Template();

float *p_array;

cudaMalloc( (void**)&p_array, output_scale[s]*output_scale[s]*n_filters*sizeof(float));

cudaMemcpy(p_array, output[s]->image_array, (output_scale[s]*output_scale[s]*n_filters*sizeof(float)) ,cudaMemcpyHostToDevice);

output_p[s]->image_xdim = output[s]->image_xdim;

output_p[s]->image_ydim = output[s]->image_ydim;

output_p[s]->image_fdim = output[s]->image_fdim;

output_p[s]->image_ybegin = output[s]->image_ybegin;

output_p[s]->image_ywidth = output[s]->image_ywidth;

output_p[s]->image_xbegin = output[s]->image_xbegin;

output_p[s]->image_xwidth = output[s]->image_xwidth; 

output_p[s]->image_array = p_array;

cudaMalloc((void**)&output_d[s], sizeof(Template)); 

cudaMemcpy(output_d[s], output_p[s], sizeof(Template), cudaMemcpyHostToDevice);

}

// call kernel function 

for (int s = 0; s < nsi; s++) {

FilterCompute(input_d[s],output_d[s],output[s],gabors_d,s,1);





}

return 0;

}

Thanks, I’m looking into it.

Did you find anything ?

can anyone please reply ?? When I run the program without the debugger, my outputs aren’t correct and I would like to know if it is because of this . Please help.

Does this error affect the correct execution of the program ?
Here is what I observed in the GDB – It occurs only sometimes and it can occur at any line in the kernel. I was able to step through the execution of one thread completely without the error. All memory accesses were correct and all my values were correct and the convolution was done correctly. I focussed on thread 7,0,0 in block 0,0,0. when I was stepping through the code the gdb shifted focus to some other thread only once. After that I tried to do the same thread and other threads many times, but I always kept getting the error.
I also did “set cuda memcheck on” before running but it does not give any other information except the error. I am not able to find any other information about this error. But this is very frustrating.