Weird pointer arithmetic bug ?

Hi,

I’m trying to debug an “unspecified launch failure” I encounter on two different machines with a Tesla C2050 each, using both Cuda 4.0 and driver 270.41.19

However the failure doesn’t occur when the code is compiled with -G ; thus it becomes harder to find out where the error is coming from exactly using cuda-gdb. Any hints ?

Moreover the exact same binary run smoothly on an older computer with a Tesla C1060 (Cuda 4.0 and driver 270.41.19).

Using cuda-memcheck I get a invalide read :

========= Invalid global read of size 8
========= at 0x00000800 in p4a_wrapper_main
========= by thread (0,0,0) in block (2,0,0)
========= Address 0x320a80800 is out of bounds

========= ERROR SUMMARY: 1 error

Ok then I tried to debug my kernel using device printf, and launching the kernel with only 1 thread (so that I avoid any memory race issue) :

printf(“cxph : %p - Cxm : %d - (cxph+Cxm) : %p - &cxph[Cxm] : %p\n”,cxph,Cxm,(cxph+Cxm), &cxph[Cxm]);

Which outputs :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x320c81200 - &cxph[Cxm] : 0x320c81200

Hum… seems that Fermi has hard time doing simple pointer arithmetic ?

Do I miss something ? I might have to have a look to the PTX, but since it works (without printf) on a C1060 the issue is maybe in the driver jitter…

Any hints on this issue would be welcome :-)

What you are seeing is almost certainly due to out of bounds shared memory access. Fermi has much better hardware protection for shared memory than older designs did (after all, out of bounds shared memory access potentially means L2 cache and global memory corruption in Fermi). The reason why it goes away in debugging builds is because all shared memory gets spilled to local memory by the compiler so that the debugger can have access to it during kernel execution.

As for the specifics of why, without a repro case it is impossible to say more.

I don’t see your point with shared memory and L2, since I thought shared memory is closer to the SM than L2 and shared memory access are not cached in L2.

Moreover my kernel doesn’t make use of shared memory, I can try to extract the kernel from the framework to provide a reproductible testcase, in the meantime here is the kernel (run on one thread).

for(iy = 0; iy <= Cym-1; iy += 1) {

         for(ix = 0; ix <= Cxm-1; ix += 1) {

*(clf+iy+257*iz) = *(Ex+ix+257*iy+66049*iz)-*(Ex+ix+257*iy+66049*iz+257)+*(Ey+ix+257*iy+66049*iz+1)-*(Ey+ix+257*iy+66049*iz);

            *(tmp+iy+257*iz) = *(cymh+iy)/(*(cyph+iy))**(Bza+ix+257*iy+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

            *(Hz+ix+257*iy+66049*iz) = *(cxmh+ix)/(*(cxph+ix))**(Hz+ix+257*iy+66049*iz)+mui**(czp+iz)/(*(cxph+ix))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+ix))**(Bza+ix+257*iy+66049*iz);

            *(Bza+ix+257*iy+66049*iz) = *(tmp+iy+257*iz);

         }

*(clf+iy+257*iz) = *(Ex+257*iy+Cxm+66049*iz)-*(Ex+257*iy+Cxm+66049*iz+257)+*(Ry+iy+257*iz)-*(Ey+257*iy+Cxm+66049*iz);

         *(tmp+iy+257*iz) = *(cymh+iy)/(*(cyph+iy))**(Bza+257*iy+Cxm+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

         *(Hz+257*iy+Cxm+66049*iz) = *(cxmh+Cxm)/(*(cxph+Cxm))**(Hz+257*iy+Cxm+66049*iz)+mui**(czp+iz)/(*(cxph+Cxm))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+Cxm))**(Bza+257*iy+Cxm+66049*iz);

         *(Bza+257*iy+Cxm+66049*iz) = *(tmp+iy+257*iz);

         for(ix = 0; ix <= Cxm-1; ix += 1) {

*(clf+iy+257*iz) = *(Ex+ix+257*Cym+66049*iz)-*(Ax+ix+257*iz)+*(Ey+ix+257*Cym+66049*iz+1)-*(Ey+ix+257*Cym+66049*iz);

            *(tmp+iy+257*iz) = *(cymh+Cym)/(*(cyph+iy))**(Bza+ix+257*iy+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

            *(Hz+ix+257*Cym+66049*iz) = *(cxmh+ix)/(*(cxph+ix))**(Hz+ix+257*Cym+66049*iz)+mui**(czp+iz)/(*(cxph+ix))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+ix))**(Bza+ix+257*Cym+66049*iz);

            *(Bza+ix+257*Cym+66049*iz) = *(tmp+iy+257*iz);

         }

*(clf+iy+257*iz) = *(Ex+Cxm+257*Cym+66049*iz)-*(Ax+Cxm+257*iz)+*(Ry+Cym+257*iz)-*(Ey+Cxm+257*Cym+66049*iz);

         *(tmp+iy+257*iz) = *(cymh+Cym)/(*(cyph+Cym))**(Bza+Cxm+257*Cym+66049*iz)-ch/(*(cyph+Cym))**(clf+iy+257*iz);

         *(Hz+Cxm+257*Cym+66049*iz) = *(cxmh+Cxm)/(*(cxph+Cxm))**(Hz+Cxm+257*Cym+66049*iz)+mui**(czp+iz)/(*(cxph+Cxm))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+Cxm))**(Bza+Cxm+257*Cym+66049*iz);

         *(Bza+Cxm+257*Cym+66049*iz) = *(tmp+iy+257*iz);

      }

L1 cache, sorry, that was a typo. Shared memory and L1 cache are physically the same thing in Fermi. There is 64Kb per SM which is either configured as 16kb shared memory and 48kb L2 cache, or 48kb shared memory and 16kb L2 cache.

Ok now I get it :-)
Anyway is the compiler able to use part of the shared memory automatically for some purpose ? Because as stated before, I don’t make any explicit use of the shared memory.

And it would explain the issue I reported using printf.

What is wrong with the printf() output? Seems perfectly reasonable to me.

Please find attached the source code to reproduce the bug.

You can compile it with :

“nvcc bug_pointer.cu -o bug_pointer -arch=sm_20”

In this version, the bug occurs also with -G, and I get in cuda-gdb (with memcheck) : CUDA_EXCEPTION_1 :
“Lane Illegal Address”
bug_pointer.cu (3.94 KB)

Hum… I double, thus an offset of 256 is 2048 bytes ; aka 0x800 in hexa. I get :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x320c81200 - &cxph[Cxm] : 0x320c81200

There a … 0X100000800 offset instead of 0x800 !! I would expect to get :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x220c81200 - &cxph[Cxm] : 0x220c81200

(take care of the first digit which changes from 2 to 3)

On the CPU as expected the very same printf shows correct offset :

cxph : 0x7fff8ed3a460 - Cxm : 256 - (cxph+Cxm) : 0x7fff8ed3ac60 - &cxph[Cxm] : 0x7fff8ed3ac60

Your repro case is passing host pointers to the device code. Surely that isn’t what you mean to do, is it?

Oops, sorry, apparently I was still in 32 bit mode (ignoring the difference in the upper word…). That indeed looks like a bug. I’d guess you could work around it by compiling for 32 bit: [font=“Courier New”]nvcc -m32[/font], although it’s certainly something that should get fixed.

ooooops ™ ;-)

Sorry, I did a quick extraction, it seems that it was too quick :-(

Here is the correct code, I compile it with -arch=sm_13 so that it can run on the C1060, the same binary trigger the bug on a C2050.

bug_pointer.cu (4.88 KB)

After installing some 32bits libraries, it stop complaining about that but cuda is not happy anyway :

“File bug_pointer.cu - Line 109 - kernel execution : CUDA driver version is insufficient for CUDA runtime version”

I can see the out of bounds error with -arch=sm_13, but not with -arch=sm_20.

If I compile with “-arch=sm_13 --ptxas-options=-v”, the compiler informs me that the kernel uses “152+16 bytes smem” (though it’s not clear to me why it would need to use shared memory here). With “-arch=sm_13 --ptxas-options=-v”, no shared memory is used.

It could be an compiler bug. Maybe, with too many registers in use, the compiler tries to spill some registers into shared memory(?), but it is unaware that, on Fermi, pointer size is 40 bits, not 32 bits.

The kernel argument list is stored in shared memory in pre-Fermi cards.

For sm_1x the first 16 bytes of shared memory contain the grid / block configuration. This data is followed by the kernel arguments.

Thanks for the information :-)

Anyway it doesn’t help me a lot with my issue, I would probably fill a bug report, bug I’m unsure it’s related to nvcc since it works on a C1060, it might rather be related to the JIT in the driver ?

Ah, I see. It still looks like a compiler bug, though: any tweaking of the code that reduces the number of registers significantly below 60 makes the problem go away.

Here’s a reduced version of the same kernel that still shows the out-of-bounds read:

__global__ void kernel_bugged(int iz, double *Ax, double *Bza, int Cxm, int Cym, int Cz, double *Ex, double *Ey, double *Hz, double *Ry, double ch, double *clf, double *cxmh, double *cxph, double *cymh, double *cyph, double *czm, double *czp, double mui, double *tmp)

{

   int ix, iy;

for(iy = 0; iy <= 1; iy += 1) 

      {

         tmp[iy+257*iz] = 0;

         Hz[257*iy+Cxm+66049*iz] = 0;        

for(ix = 0; ix <= Cxm-1; ix += 1) 

         {

            *(clf+iy+257*iz) = 

                Ax[ix+257*iz]

                +Ey[ix+257*Cym+66049*iz]

                ;

            *(Hz+ix+257*Cym+66049*iz) = 

                cxmh[ix]/cxph[ix]*Hz[ix+257*Cym+66049*iz]

                -czm[iz]/cxph[ix]*Bza[ix+257*Cym+66049*iz]

                ;

         }         

clf[iy+257*iz] = 

            Ex[Cxm+257*Cym+66049*iz]

            +Ax[Cxm+257*iz]

            +Ry[Cym+257*iz]

            +Ey[Cxm+257*Cym+66049*iz]

            ;

         *(tmp+iy+257*iz) = cymh[Cym]/cyph[Cym]*Bza[Cxm+257*Cym+66049*iz];

         *(Hz+Cxm+257*Cym+66049*iz) = 

            cxmh[Cxm]/cxph[Cxm]*Hz[Cxm+257*Cym+66049*iz]

            +czp[iz]/cxph[Cxm]*tmp[iy+257*iz]

            +czm[iz]/cxph[Cxm]*Bza[Cxm+257*Cym+66049*iz]

            ;

      }

}

Removing any single term from any of the lines or even replacing any remaining *(xxx+yyy) with xxx[yyy] makes the out-of-bounds read go away.

Here’s is the same code update with a second version of the same kernel with non linearized array accesses. It’s interesting to note that compiling with sm_20 I got 63 registers for each of the kernels, but with sm_13 the linearized version uses only 62 registers while the non linearized one uses 65 registers.

Anyway, the non linearized version run smoothly on both C1060 and C2050 (with 65 registers), while the linearized version breaks on C2050 only (compiled using sm_13 or sm_20).
bug_pointer.cu (7.36 KB)