Weird pointer arithmetic bug ?

Joky · July 12, 2011, 12:36pm

Hi,

I’m trying to debug an “unspecified launch failure” I encounter on two different machines with a Tesla C2050 each, using both Cuda 4.0 and driver 270.41.19

However the failure doesn’t occur when the code is compiled with -G ; thus it becomes harder to find out where the error is coming from exactly using cuda-gdb. Any hints ?

Moreover the exact same binary run smoothly on an older computer with a Tesla C1060 (Cuda 4.0 and driver 270.41.19).

Using cuda-memcheck I get a invalide read :

========= Invalid global read of size 8
========= at 0x00000800 in p4a_wrapper_main
========= by thread (0,0,0) in block (2,0,0)
========= Address 0x320a80800 is out of bounds

========= ERROR SUMMARY: 1 error

Ok then I tried to debug my kernel using device printf, and launching the kernel with only 1 thread (so that I avoid any memory race issue) :

printf(“cxph : %p - Cxm : %d - (cxph+Cxm) : %p - &cxph[Cxm] : %p\n”,cxph,Cxm,(cxph+Cxm), &cxph[Cxm]);

Which outputs :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x320c81200 - &cxph[Cxm] : 0x320c81200

Hum… seems that Fermi has hard time doing simple pointer arithmetic ?

Do I miss something ? I might have to have a look to the PTX, but since it works (without printf) on a C1060 the issue is maybe in the driver jitter…

Any hints on this issue would be welcome :-)

avidday · July 12, 2011, 1:05pm

What you are seeing is almost certainly due to out of bounds shared memory access. Fermi has much better hardware protection for shared memory than older designs did (after all, out of bounds shared memory access potentially means L2 cache and global memory corruption in Fermi). The reason why it goes away in debugging builds is because all shared memory gets spilled to local memory by the compiler so that the debugger can have access to it during kernel execution.

As for the specifics of why, without a repro case it is impossible to say more.

Joky · July 12, 2011, 1:09pm

I don’t see your point with shared memory and L2, since I thought shared memory is closer to the SM than L2 and shared memory access are not cached in L2.

Moreover my kernel doesn’t make use of shared memory, I can try to extract the kernel from the framework to provide a reproductible testcase, in the meantime here is the kernel (run on one thread).

for(iy = 0; iy <= Cym-1; iy += 1) {

         for(ix = 0; ix <= Cxm-1; ix += 1) {

*(clf+iy+257*iz) = *(Ex+ix+257*iy+66049*iz)-*(Ex+ix+257*iy+66049*iz+257)+*(Ey+ix+257*iy+66049*iz+1)-*(Ey+ix+257*iy+66049*iz);

            *(tmp+iy+257*iz) = *(cymh+iy)/(*(cyph+iy))**(Bza+ix+257*iy+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

            *(Hz+ix+257*iy+66049*iz) = *(cxmh+ix)/(*(cxph+ix))**(Hz+ix+257*iy+66049*iz)+mui**(czp+iz)/(*(cxph+ix))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+ix))**(Bza+ix+257*iy+66049*iz);

            *(Bza+ix+257*iy+66049*iz) = *(tmp+iy+257*iz);

         }

*(clf+iy+257*iz) = *(Ex+257*iy+Cxm+66049*iz)-*(Ex+257*iy+Cxm+66049*iz+257)+*(Ry+iy+257*iz)-*(Ey+257*iy+Cxm+66049*iz);

         *(tmp+iy+257*iz) = *(cymh+iy)/(*(cyph+iy))**(Bza+257*iy+Cxm+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

         *(Hz+257*iy+Cxm+66049*iz) = *(cxmh+Cxm)/(*(cxph+Cxm))**(Hz+257*iy+Cxm+66049*iz)+mui**(czp+iz)/(*(cxph+Cxm))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+Cxm))**(Bza+257*iy+Cxm+66049*iz);

         *(Bza+257*iy+Cxm+66049*iz) = *(tmp+iy+257*iz);

         for(ix = 0; ix <= Cxm-1; ix += 1) {

*(clf+iy+257*iz) = *(Ex+ix+257*Cym+66049*iz)-*(Ax+ix+257*iz)+*(Ey+ix+257*Cym+66049*iz+1)-*(Ey+ix+257*Cym+66049*iz);

            *(tmp+iy+257*iz) = *(cymh+Cym)/(*(cyph+iy))**(Bza+ix+257*iy+66049*iz)-ch/(*(cyph+iy))**(clf+iy+257*iz);

            *(Hz+ix+257*Cym+66049*iz) = *(cxmh+ix)/(*(cxph+ix))**(Hz+ix+257*Cym+66049*iz)+mui**(czp+iz)/(*(cxph+ix))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+ix))**(Bza+ix+257*Cym+66049*iz);

            *(Bza+ix+257*Cym+66049*iz) = *(tmp+iy+257*iz);

         }

*(clf+iy+257*iz) = *(Ex+Cxm+257*Cym+66049*iz)-*(Ax+Cxm+257*iz)+*(Ry+Cym+257*iz)-*(Ey+Cxm+257*Cym+66049*iz);

         *(tmp+iy+257*iz) = *(cymh+Cym)/(*(cyph+Cym))**(Bza+Cxm+257*Cym+66049*iz)-ch/(*(cyph+Cym))**(clf+iy+257*iz);

         *(Hz+Cxm+257*Cym+66049*iz) = *(cxmh+Cxm)/(*(cxph+Cxm))**(Hz+Cxm+257*Cym+66049*iz)+mui**(czp+iz)/(*(cxph+Cxm))**(tmp+iy+257*iz)-mui**(czm+iz)/(*(cxph+Cxm))**(Bza+Cxm+257*Cym+66049*iz);

         *(Bza+Cxm+257*Cym+66049*iz) = *(tmp+iy+257*iz);

      }

avidday · July 12, 2011, 1:15pm

L1 cache, sorry, that was a typo. Shared memory and L1 cache are physically the same thing in Fermi. There is 64Kb per SM which is either configured as 16kb shared memory and 48kb L2 cache, or 48kb shared memory and 16kb L2 cache.

Joky · July 12, 2011, 1:17pm

Ok now I get it :-)
Anyway is the compiler able to use part of the shared memory automatically for some purpose ? Because as stated before, I don’t make any explicit use of the shared memory.

And it would explain the issue I reported using printf.

tera · July 12, 2011, 1:51pm

What is wrong with the printf() output? Seems perfectly reasonable to me.

Joky · July 12, 2011, 1:53pm

Please find attached the source code to reproduce the bug.

You can compile it with :

“nvcc bug_pointer.cu -o bug_pointer -arch=sm_20”

In this version, the bug occurs also with -G, and I get in cuda-gdb (with memcheck) : CUDA_EXCEPTION_1 :
â€œLane Illegal Addressâ€
bug_pointer.cu (3.94 KB)

Joky · July 12, 2011, 2:05pm

Hum… I double, thus an offset of 256 is 2048 bytes ; aka 0x800 in hexa. I get :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x320c81200 - &cxph[Cxm] : 0x320c81200

There a … 0X100000800 offset instead of 0x800 !! I would expect to get :

cxph : 0x220c80a00 - Cxm : 256 - (cxph+Cxm) : 0x220c81200 - &cxph[Cxm] : 0x220c81200

(take care of the first digit which changes from 2 to 3)

On the CPU as expected the very same printf shows correct offset :

cxph : 0x7fff8ed3a460 - Cxm : 256 - (cxph+Cxm) : 0x7fff8ed3ac60 - &cxph[Cxm] : 0x7fff8ed3ac60

avidday · July 12, 2011, 2:08pm

Your repro case is passing host pointers to the device code. Surely that isn’t what you mean to do, is it?

tera · July 12, 2011, 2:29pm

Oops, sorry, apparently I was still in 32 bit mode (ignoring the difference in the upper word…). That indeed looks like a bug. I’d guess you could work around it by compiling for 32 bit: [font=“Courier New”]nvcc -m32[/font], although it’s certainly something that should get fixed.

Joky · July 12, 2011, 3:58pm

ooooops ™ ;-)

Sorry, I did a quick extraction, it seems that it was too quick :-(

Here is the correct code, I compile it with -arch=sm_13 so that it can run on the C1060, the same binary trigger the bug on a C2050.

Joky · July 12, 2011, 4:05pm

bug_pointer.cu (4.88 KB)

Joky · July 12, 2011, 4:37pm

After installing some 32bits libraries, it stop complaining about that but cuda is not happy anyway :

“File bug_pointer.cu - Line 109 - kernel execution : CUDA driver version is insufficient for CUDA runtime version”

hamster143 · July 12, 2011, 5:10pm

I can see the out of bounds error with -arch=sm_13, but not with -arch=sm_20.

If I compile with “-arch=sm_13 --ptxas-options=-v”, the compiler informs me that the kernel uses “152+16 bytes smem” (though it’s not clear to me why it would need to use shared memory here). With “-arch=sm_13 --ptxas-options=-v”, no shared memory is used.

It could be an compiler bug. Maybe, with too many registers in use, the compiler tries to spill some registers into shared memory(?), but it is unaware that, on Fermi, pointer size is 40 bits, not 32 bits.

avidday · July 12, 2011, 5:12pm

The kernel argument list is stored in shared memory in pre-Fermi cards.

njuffa · July 12, 2011, 5:33pm

For sm_1x the first 16 bytes of shared memory contain the grid / block configuration. This data is followed by the kernel arguments.

Joky · July 12, 2011, 5:43pm

Thanks for the information :-)

Anyway it doesn’t help me a lot with my issue, I would probably fill a bug report, bug I’m unsure it’s related to nvcc since it works on a C1060, it might rather be related to the JIT in the driver ?

hamster143 · July 12, 2011, 6:33pm

Ah, I see. It still looks like a compiler bug, though: any tweaking of the code that reduces the number of registers significantly below 60 makes the problem go away.

Here’s a reduced version of the same kernel that still shows the out-of-bounds read:

__global__ void kernel_bugged(int iz, double *Ax, double *Bza, int Cxm, int Cym, int Cz, double *Ex, double *Ey, double *Hz, double *Ry, double ch, double *clf, double *cxmh, double *cxph, double *cymh, double *cyph, double *czm, double *czp, double mui, double *tmp)

{

   int ix, iy;

for(iy = 0; iy <= 1; iy += 1) 

      {

         tmp[iy+257*iz] = 0;

         Hz[257*iy+Cxm+66049*iz] = 0;        

for(ix = 0; ix <= Cxm-1; ix += 1) 

         {

            *(clf+iy+257*iz) = 

                Ax[ix+257*iz]

                +Ey[ix+257*Cym+66049*iz]

                ;

            *(Hz+ix+257*Cym+66049*iz) = 

                cxmh[ix]/cxph[ix]*Hz[ix+257*Cym+66049*iz]

                -czm[iz]/cxph[ix]*Bza[ix+257*Cym+66049*iz]

                ;

         }         

clf[iy+257*iz] = 

            Ex[Cxm+257*Cym+66049*iz]

            +Ax[Cxm+257*iz]

            +Ry[Cym+257*iz]

            +Ey[Cxm+257*Cym+66049*iz]

            ;

         *(tmp+iy+257*iz) = cymh[Cym]/cyph[Cym]*Bza[Cxm+257*Cym+66049*iz];

         *(Hz+Cxm+257*Cym+66049*iz) = 

            cxmh[Cxm]/cxph[Cxm]*Hz[Cxm+257*Cym+66049*iz]

            +czp[iz]/cxph[Cxm]*tmp[iy+257*iz]

            +czm[iz]/cxph[Cxm]*Bza[Cxm+257*Cym+66049*iz]

            ;

      }

}

Removing any single term from any of the lines or even replacing any remaining *(xxx+yyy) with xxx[yyy] makes the out-of-bounds read go away.

Joky · July 12, 2011, 6:52pm

Here’s is the same code update with a second version of the same kernel with non linearized array accesses. It’s interesting to note that compiling with sm_20 I got 63 registers for each of the kernels, but with sm_13 the linearized version uses only 62 registers while the non linearized one uses 65 registers.

Anyway, the non linearized version run smoothly on both C1060 and C2050 (with 65 registers), while the linearized version breaks on C2050 only (compiled using sm_13 or sm_20).
bug_pointer.cu (7.36 KB)

Topic		Replies	Views
Problem with arch=sm_20 CUDA Programming and Performance	16	4435	March 4, 2011
Memcheck CUDA Programming and Performance	2	584	July 20, 2017
Kernel within a kernel: out of bounds CUDA Programming and Performance	2	77	August 1, 2024
Error cp.async.ca.shared.global CUDA Programming and Performance	12	1663	August 17, 2022
nvcc -arch sm_20 causes access violations in shared memory CUDA Programming and Performance	8	2668	March 30, 2013
COMPUTE-SANITIZER " Invalid __shared__ write of size 16 bytes " CUDA Programming and Performance	3	1278	August 17, 2022
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7291	July 22, 2011
Using cudaMemCheck Legacy PGI Compilers	3	6539	November 21, 2013
pointer to shared memory compiler problems CUDA Programming and Performance	19	14840	June 7, 2008
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	3187	September 23, 2010

Weird pointer arithmetic bug ?

========= Invalid global read of size 8 ========= at 0x00000800 in p4a_wrapper_main ========= by thread (0,0,0) in block (2,0,0) ========= Address 0x320a80800 is out of bounds

Related topics

========= Invalid global read of size 8
========= at 0x00000800 in p4a_wrapper_main
========= by thread (0,0,0) in block (2,0,0)
========= Address 0x320a80800 is out of bounds