Why moving code from card with computability 1.x to 2.0 fails? allocation memory fails on Tesla card

My code runs very well on GPU card Geforce 9400 on Mac and also Geforce GTX 280 in windows 7, but when I moved them on a windows machine with Tesla C2050 card and a linux server with two Tesla C2070 cards, I met problems. The compilation is successful, but the code cannot run anymore on both of two machines. Even the cudaMalloc fails with the error message: unspecified launch failure.

I know the Tesla C2070 card has the cuda computability 2.0, so in the make file I used the flags: -arch sm_20 and -ftz=false -prec-div=true -prec-sqrt=true to ensure IEEE compliance.

I used the newest version of cud 4.1.28 on Mac and windows machine. And cud 4.0 on linux server.

Is any difference in allocation memory in 2.0 card and 1.x card? How can I solve my problem?

Thanks a lot.

Use cuda-memcheck to look for out of bound accesses in shared memory.

Thank you for your instant reply.

Once I thought of this method but I donot know how to use it. It said on the webpage:Running CUDA-MEMCHECK on your application is easy; simply pass your application’s name as a parameter to CUDA-MEMCHECK on the command line.

But my cuda code are in the mex funtions of Matlab and I call these compiled mex files in Matlab. How can I use the MEMCHECK in this case? Thank you so much.

Now I know how to use it. Just compile the Matlab file to .exe file and then as a parameter to CUDA-MEMCHECK.

But the checking result is strange. I cannot understand. It shows the address error happens in the latter function. But when I run on linux machine, it shows the error " unspecified launch failure." in the first mex file where just allocate cuda memory in it.

And the most strange thing is why on former cards with computability 1.x, there is no such address error?

The hardware of sm_1x devices is not as thorough as the hardware of sm_2x devices when it comes to checking for out of bounds accesses. If I recall correctly, this applies to out-of-bounds shared memory accesses in particular.

I run the cuda-memcheck for several times, each time I get different address errors.

For example:

========= Invalid global read of size 4

========= at 0x00000178 in vecProject

========= by thread (0,0,0) in block (8,5,0)

========= Address 0x00660014 is out of bounds


========= ERROR SUMMARY: 1 error

Next time is

========= Invalid global read of size 4

========= at 0x00000178 in vecProject

========= by thread (0,0,0) in block (7,2,0)

========= Address 0x00660014 is out of bounds


========= ERROR SUMMARY: 1 error

Is this right? I checked my code but I donot know how to find the error. Any suggestions? Thanks.

Yes this is right. The order of execution of threads is undefined and can vary from run to run, so each time a different thread will hit the stray read first.
However in both cases the read happens at the same instruction, and even accessing the same location in memory.

Compile your program with the [font=“Courier New”]-G[/font] option to nvcc to include debugging information and rerun cuda-memcheck to see the problematic line of your kernel.

Sometimes I wish this too STRICT memory checking can be turned off if desired

Why? Would you prefer your code silently returned wrong results?