Deep subroutine use causes cudaError_enum?

Hi all-

I have been successfully running and developing code, but now have hit a road block. As I have added more code, my .cu file is about 2,900 lines now, I am now getting the following runtime error:

First-chance exception at 0x7c812afb in SNEAK.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0117eca4…

The code does run in emulation mode fine, and I have a good bit of ‘printf’ type logging of its operation in that mode.

To debug I have been commenting out or bypassing subroutines and can get it to run (without the correct results of course). In one case, I found that a certain routine was a problem. I changed this subroutine that set two float3 values from their pointers and returned a true/false value into one that returns a struct containing the two float3 values and a Bool, and then the runtime would continue beyond this point.

However, it then seems to then have a similar problem at another subroutine, in this case it has about 12 arguments. Putting a ‘return’ right at this routine’s entry point doesn’t help …

… so I am wondering if I am running into a capability limit of some type. The kernel in question has the following parameters:

1>ptxas info : Used 60 registers, 864+288 bytes lmem, 64+52 bytes smem, 204 bytes cmem[1]

and my specs:
8800 GTS 640MB RAM
WinXP
CUDA 2.1
Visual Studio 2005 Pro

Thanks for any comment on this!

To add: but if i instead comment out the subroutine call entirely, instead of instantly returning within the routine, the code execution does proceed.

So this really seems like some capability limit. I know that CUDA inlines subroutines, so do I just have too much code???

CUDA 2.1 is almost a year old at this point–can you see if the issue reproduces in 3.0?

I did try version 2.3 and had the same problem, will try 3.0 now …

OK I updated to the Beta 3 version on a different computer. This is a Vista 64 platform. Same problem.

There is one other thing I noticed on both though. Using the typical thread id code:

uint i = blockDim.x * blockIdx.x + threadIdx.x;

if I put in

if( i != 0 ) return;

in the kernel in question then it does run for that thread to completion. Puzzling?

OK, we’ve figured out a bit more on this … we’ve been working on other areas of the code so it took awhile :D

I’ve summarized the pertinent things below:

[codebox]DECLARATIONS:

float3 hullpoints[8];

int nhullpoints = 0;

Bool bch = convexhull_jarvis3Dto2D(npoints, points, miny_index, &nhullpoints, hullpoints ); 

float3 boxpts[4];

Vertex ipa[9];	// there are never more than 8, we need one more

Vertex ipb[5];	// one more than 4

EMU DEBUGGER:

  •   hullpoints	0x000000000c91f860 {x=1928.2489 y=730.61542 z=50.805321 }	__cuda_emu::float3 [8]
    
  •   boxpts		0x000000000c91f650 {x=1900.0000 y=700.00000 z=0.00000000 }	__cuda_emu::float3 [4]
    

CALL:

float intersectarea = intersectionarea(hullpoints, nhullpoints, boxpts, 4,

				   minxall, maxxall, 

				   minyall, maxyall,

				   ipa,

				   ipb);

FUNCTION DEFINITION:

__device__ __host__ float intersectionarea(float3 * a, int na, float3 * b, int nb,

				   float minxall, float maxxall, 

				   float minyall, float maxyall,

				   Vertex *ipa, // must be sized to na+1

				   Vertex *ipb)	// must be sized to nb+1

VARIABLE PTRS CORRUPTED (a is hullpoints and b is boxpts):

  •   a	0x0000000044e38000 {x=??? y=??? z=??? }	__cuda_emu::float3 *
    
  •   b	0x0000000000000002 {x=??? y=??? z=??? }	__cuda_emu::float3 *
    

[/codebox]

Using the EMU version, we’ve been looking at the code execution …

Two arrays are declared, hullpoints and boxpts, along with others. These are initialized (not shown) and seem to have valid pointer values, as seen in the first debugger listing. A function declaration is shown called intersectionarea(). The first and third parameters of this function are the hullpoints and boxpts arguments. However, once inside the function, the pointer values have changed and in the case of the ‘b’ argument.

This is now on Windows 7 using VS 2008 and CUDA 2.3 (we tried Beta 3 too before per above). If we comment out the call, the code runs (albeit without the downstream effects of just setting intersectionarea to 0). With the code in, the driver crashes and then recovers. We also notice that the release version compile takes a long time.

I also note that intersectionarea itself uses a few routines, etc. But my concern here is what seems to be bad argument passing. I am assuming that its valid to look at these in the watch list.

Any ideas of how to bypass this problem would be great

OK, I made a few changes to the code and now get this for the Release version (the debugEMU version compiles ok):

[codebox]1>------ Build started: Project: SNEAK, Configuration: Release x64 ------

1>Compiling CUDA code

1>gpuray.cu

1>gpuray.cu(3105): warning: variable “soffset1” was declared but never referenced

1>gpuray.cu(3106): warning: variable “soffset2” was declared but never referenced

1>gpuray.cu(3146): warning: variable “hsoffset1” was declared but never referenced

1>gpuray.cu(3147): warning: variable “hsoffset2” was declared but never referenced

1>gpuray.cu(3130): warning: variable “nwedges” was declared but never referenced

1>gpuray.cu(59): warning: variable “use_adv_gpu” was set but never used

1>tmpxft_00001180_00000000-3_gpuray.cudafe1.gpu

1>tmpxft_00001180_00000000-8_gpuray.cudafe2.gpu

[b]1>### Assertion failure at line 123 of …/…/be/cg/NVISA/expand.cxx:

1>### Compiler Error in file C:\Users\Dad\AppData\Local\Temp/tmpxft_00001180_00000000-9_gpuray.cpp3.i during Code_Expansion phase:

1>### unexpected mtype[/b]

1>nvopencc ERROR: C:\CUDA\bin64/…/open64/lib//be.exe returned non-zero status 1

1>Linking…

1>LINK : fatal error LNK1181: cannot open input file ‘.\Release\gpuray.obj’[/codebox]The code that was changed is:[codebox]//

// The following routines are for the area intersection routine.

//

typedef long long Hp;

typedef struct{long x; long y;} Ipoint;

typedef struct{long mn; long mx;} Rng;

typedef struct{Ipoint ip; Rng rx; Rng ry; short in;} Vertex;

device host void ia_fit(float3 * x, int cx, Vertex * ix,

							int fudge, float mid,

							float minxall, float sclx, 

							float minyall, float scly )

{

/* Original code, undoubtedly wrong as elements that are used later on are not initialized.

int c=cx; 

while(c--)

{

	ix[c].ip.x = (long)((x[c].x - minxall)*sclx - mid) & ~7L | fudge | c&1;

	ix[c].ip.y = (long)((x[c].y - minyall)*scly - mid) & ~7L | fudge;

}

ix[0].ip.y += cx&1;

ix[cx] = ix[0];

c=cx; 

while(c--)

{

	Rng rng1 = {ix[c].ip.x,ix[c+1].ip.x};

	Rng rng2 = {ix[c+1].ip.x,ix[c].ip.x};

	Rng rng3 = {ix[c].ip.y,ix[c+1].ip.y};

	Rng rng4 = {ix[c+1].ip.y,ix[c].ip.y};

	ix[c].rx = ix[c].ip.x < ix[c+1].ip.x ? rng1 : rng2;

	ix[c].ry = ix[c].ip.y < ix[c+1].ip.y ? rng3 : rng4;

	ix[c].in = 0;

}

*/

// My corrected code ...

Rng rng0 = {0L,0L};

int c=cx; 

while(c >= 0)

{

	ix[c].ip.x = (long)((x[c].x - minxall)*sclx - mid) & ~7L | fudge | c&1;

	ix[c].ip.y = (long)((x[c].y - minyall)*scly - mid) & ~7L | fudge;

	ix[c].rx = rng0;	// makes sure that ix[cx] when copied from ix[0] has initialized rx,ry and in elements

	ix[c].ry = rng0;

	ix[c].in = 0;

	c--;

}

ix[0].ip.y += cx&1;

ix[cx] = ix[0];	

c=cx; 

while(c >= 0)

{

	Rng rng1 = {ix[c].ip.x,ix[c+1].ip.x};

	Rng rng2 = {ix[c+1].ip.x,ix[c].ip.x};

	Rng rng3 = {ix[c].ip.y,ix[c+1].ip.y};

	Rng rng4 = {ix[c+1].ip.y,ix[c].ip.y};

	ix[c].rx = ix[c].ip.x < ix[c+1].ip.x ? rng1 : rng2;

	ix[c].ry = ix[c].ip.y < ix[c+1].ip.y ? rng3 : rng4;

	c--;

}

}

device host Hp ia_area(Ipoint a, Ipoint p, Ipoint q)

{

return (Hp)p.x*q.y - (Hp)p.y*q.x + (Hp)a.x*(p.y - q.y) + (Hp)a.y*(q.x - p.x);

}

device host void ia_cntrib(Hp *s, Ipoint f, Ipoint t, short w)

{

(*s) += (Hp)w*(t.x-f.x)*(t.y+f.y)/2;

}

device host int ia_ovl(Rng p, Rng q)

{

return p.mn < q.mx && q.mn < p.mx;

}

device host void ia_cross(Hp *s, Vertex * a, Vertex * b, Vertex * c, Vertex * d,

							  float a1, float a2, float a3, float a4)

{

float r1 = a1/(a1+a2);

float r2 = a3/(a3+a4);

Ipoint ipoint1 = {(long)(a->ip.x + r1*(b->ip.x - a->ip.x)), (long)(a->ip.y + r1*(b->ip.y - a->ip.y))};

Ipoint ipoint2 = {(long)(c->ip.x + r2*(d->ip.x - c->ip.x)), (long)(c->ip.y + r2*(d->ip.y - c->ip.y))};

ia_cntrib(s, ipoint1, b->ip, 1);

ia_cntrib(s, d->ip, ipoint2, 1);

++a->in; 

--c->in;

}

device host void ia_inness(Hp *ss, Vertex * P, int cP, Vertex * Q, int cQ)

{

int s=0;

int c=cQ; 

Ipoint p = P[0].ip;

while(c-- > 0 )  // added > 0 to avoid accessing -1 index. 

{

	if(Q[c].rx.mn < p.x && p.x < Q[c].rx.mx)

	{

		int sgn = 0 < ia_area(p, Q[c].ip, Q[c+1].ip);

		s += sgn != Q[c].ip.x < Q[c+1].ip.x ? 0 : (sgn?-1:1); 

	}

}

for(int j=0; j<cP; ++j)

{

	if(s != 0) 

		ia_cntrib(ss, P[j].ip, P[j+1].ip, s);

	s += P[j].in;

}

}

device host float intersectionarea(float3 * a, int na, float3 * b, int nb,

									   float minxall, float maxxall, 

									   float minyall, float maxyall,

									   Vertex *ipa, // must be sized to na+1

									   Vertex *ipb)	// must be sized to nb+1

{ // see http://www.cap-lore.com/MathPhys/IP/

// find the area of intersection of two simple polygons.

// the arguments are float3 types, but only x,y dimensions are pertinent.

// a zero return means that they do not overlap.

// also note that points must be wound the same way (both lists clockwise or counter)

if(na < 3 || nb < 3) return 0.0f;	// less than three points do not an area make!

const float gamut = 500000000.0f;

const float mid = gamut/2.0f;

float rngx = maxxall - minxall;

float sclx = gamut/rngx;

float rngy = maxyall - minyall;

float scly = gamut/rngy;

ia_fit(a, na, ipa, 0, mid, minxall, sclx, minyall, scly); 

ia_fit(b, nb, ipb, 2, mid, minxall, sclx, minyall, scly); 

float ascale = sclx*scly;



Hp s = 0L; 

/*

fprintf(flog,“\nQ1 **************\n”);

for(int jj=0;jj< na+1; jj++ )

fprintf(flog,"Q1 ipa[%d] => ip={x=%ld,y=%ld}, rx={mn=%ld,mx=%ld}, ry={mn=%ld,mx=%ld}, in=%hd\n", jj, ipa[jj].ip.x,ipa[jj].ip.y, ipa[jj].rx.mn,ipa[jj].rx.mx, ipa[jj].ry.mn,ipa[jj].ry.mx, ipa[jj].in);

fprintf(flog,“Q1\n”);

for(int jj=0;jj< nb+1; jj++ )

fprintf(flog,"Q1 ipb[%d] => ip={x=%ld,y=%ld}, rx={mn=%ld,mx=%ld}, ry={mn=%ld,mx=%ld}, in=%hd\n", jj, ipb[jj].ip.x,ipb[jj].ip.y, ipb[jj].rx.mn,ipb[jj].rx.mx, ipb[jj].ry.mn,ipb[jj].ry.mx, ipb[jj].in);

*/

for(int j=0; j<na; ++j) 

	for(int k=0; k<nb; ++k)

		if(ia_ovl(ipa[j].rx, ipb[k].rx) && ia_ovl(ipa[j].ry, ipb[k].ry))

		{

			Hp a1 = -ia_area(ipa[j].ip, ipb[k].ip, ipb[k+1].ip);

			Hp a2 = ia_area(ipa[j+1].ip, ipb[k].ip, ipb[k+1].ip);

			

			int o = a1<0; 

			if(o == a2<0)

			{

				Hp a3 = ia_area(ipb[k].ip, ipa[j].ip, ipa[j+1].ip);

				Hp a4 = -ia_area(ipb[k+1].ip, ipa[j].ip, ipa[j+1].ip);

				if(a3<0 == a4<0) 

				{

					if(o) ia_cross(&s, &ipa[j], &ipa[j+1], &ipb[k], &ipb[k+1], 

								   (float)a1, (float)a2, (float)a3, (float)a4);

					else ia_cross(&s, &ipb[k], &ipb[k+1], &ipa[j], &ipa[j+1],

								  (float)a3, (float)a4, (float)a1, (float)a2);

				}

			}

		}

ia_inness(&s, ipa, na, ipb, nb); 

ia_inness(&s, ipb, nb, ipa, na);

/*

fprintf(flog,“Q1\n”);

for(int jj=0;jj< na+1; jj++ )

fprintf(flog,"Q1 ipa[%d] => ip={x=%ld,y=%ld}, rx={mn=%ld,mx=%ld}, ry={mn=%ld,mx=%ld}, in=%hd\n", jj, ipa[jj].ip.x,ipa[jj].ip.y, ipa[jj].rx.mn,ipa[jj].rx.mx, ipa[jj].ry.mn,ipa[jj].ry.mx, ipa[jj].in);

fprintf(flog,“Q1\n”);

for(int jj=0;jj< nb+1; jj++ )

fprintf(flog,"Q1 ipb[%d] => ip={x=%ld,y=%ld}, rx={mn=%ld,mx=%ld}, ry={mn=%ld,mx=%ld}, in=%hd\n", jj, ipb[jj].ip.x,ipb[jj].ip.y, ipb[jj].rx.mn,ipb[jj].rx.mx, ipb[jj].ry.mn,ipb[jj].ry.mx, ipb[jj].in);

*/

return s/ascale;

}

[/codebox]which determines area of overlapping polygons as modified from http://www.cap-lore.com/MathPhys/IP/

OK, I fixed the excpetion mentioned in the last post via searching for ‘unexpected mtype’ here … still can’t figure out the core problem though.

OK we’ve narrowed down the problem, it happens in this function:

[codebox]device host void ia_cntrib(Hp *s, Ipoint f, Ipoint t, short w)

{

(*s) += (Hp)w*(t.x-f.x)*(t.y+f.y)/2;

}

[/codebox]when the value of *s is attempted to be incremented. s is ultimately a register value as it is an automatic var from intersectionarea(), maybe its not being updated as such? The manual says automatic vars could be in registers or shared mem …

I hate to be the only one posting here, but maybe my attempts will help someone else.

I assume you got the “Advisory: cannot tell what pointer refers to, assuming global memory” message on compile?

No, did not get the advisory.

That’s weird, you probably should have…

If you follow the code I posted before, the following applies:

  1. the type Hp is long long

  2. a local variable ‘s’ is declared in intersectionarea as type Hp and initialized to 0

  3. a pointer to this variable is passed into ia_cross and then into ia_cntrib.

  4. ia_cntrib attempts to update *s which is where we crash.

I tried changing the type to just long and have the same problem. The code works in EMU mode.

I could send you the .cu file if that helps

This bold code (var increment/decrements are causing the driver to crash. There isn’t anything logically wrong with this, but maybe the compiler thinks the variables are in global memory and not automatic vars?[codebox]device host Hp ia_cross(Vertex * a, Vertex * b, Vertex * c, Vertex * d,

							  float a1, float a2, float a3, float a4)

{

float r1 = a1/(a1+a2);

float r2 = a3/(a3+a4);

Ipoint ipoint1 = {(long)(a->ip.x + r1*(b->ip.x - a->ip.x)), (long)(a->ip.y + r1*(b->ip.y - a->ip.y))};

Ipoint ipoint2 = {(long)(c->ip.x + r2*(d->ip.x - c->ip.x)), (long)(c->ip.y + r2*(d->ip.y - c->ip.y))};

Hp s = ia_cntrib(ipoint1, b->ip, 1);

  s += ia_cntrib(d->ip, ipoint2, 1);

// this causes problems in CUDA:

[b] ++a->in;

--c->in;[/b]

return s;

}[/codebox]I also tried commenting these lines out and instead doing the increment/decrement at the point of the call, but that is also in a subroutine from the original declaration of the vars.

Are there attributes I can put on these to tell the compiler how to treat? Are all pointers assumed to be in global memory?

I guess this has me totally stumped right now.

After a lot more trial and tribulation, for some reason – in most cases – CUDA simply can’t read the element ‘in’ from our Vertex type.[codebox]typedef struct{Ipoint ip; Rng rx; Rng ry; short in;} Vertex;

[/codebox]We’ve changed its type from short to int, moved it within the struct, tried changing the typename and its variable name (thinking it might conflict with built in types somehow), etc. but almost any attempted read of its value will crash the driver. You can assign values to it but not read it mostly. For example[codebox]Vertex ipa[9];

ipa[0].in = 6;

short vv = ipa[0].in;[/codebox]will crash the driver. The only place we seem to be able to read it is in the ia_inness function[codebox]device host Hp ia_inness(Vertex * P, int cP, Vertex * Q, int cQ)

{

Hp ss = 0L;

for(int j=0; j<cP; ++j)

{

	if(s != 0) 

		ss += ia_cntrib(P[j].ip, P[j+1].ip, s);

	s += <b>P[j].in;</b>

}

return ss;

}[/codebox]We thought that this may work since it is accessed as an array, rather than as a pointer to a single element as in ia_cross, which is where the problem is. But we changed ia_cross to access elements as parts of arrays and it still doesn’t work.

Of course, from failure of the first example, you can’t locally declare an array of these Vertex structs and then assign a value to the first one’s ‘in’ and then read it without failure.

Could we have reach some compiler or execution limit? We just can’t figure this out at all[codebox]1>ptxas info : Compiling entry function ‘_Z18computeRays_kernelPvS_S_S_jjfbjj6float3’

1>ptxas info : Used 60 registers, 976+0 bytes lmem, 68+16 bytes smem, 284 bytes cmem[1][/codebox]The only other recourse seems to be to preallocate space for these Vertex elements for each thread so that they are definitely in global memory.