Program works on GTX 260 but not on GTX 480

Hello,

I have a rather complex CUDA program for simulating shallow water flows that works perfectly on a GTX 260 card but it deviates from the correct results and even freezes on a GTX 480 card. I have applied the volatile modification explained in the Fermi Compatibility Guide, since I have a reduction kernel, but it continues showing incorrect results or freezing. Apart from the volatile issue, is there any other reason that may explain that a program works on a 260 card but not on a 480?

The program is fully implemented in double precision and uses the CUDA runtime API. I’m using WinXP, Cuda 3.1 and compile with “-arch=sm_20” when using the 480 card.

Thanks.

Probably program contains wrong cynchronization. Cause in Fermi it works differently, all problems appears quickly and program frees. Use atomics etc.

Probably program contains wrong cynchronization. Cause in Fermi it works differently, all problems appears quickly and program frees. Use atomics etc.

Hi Cudouer. The most major difference in architecture between the GTX 260 and GTX 480 is the cache structure, and that’s exactly where I’d start looking for your problem. In the GTX 260, any global memory accesses you perform go out to main memory. However, there is a L1 and L2 cache structure for GTX 480. One thing that trips up a lot of people is that the L1 cache is not coherent among separate SMs. L2 is coherent. Based on that info, I’d guess the most likely cause for problems is threads from different blocks reading/writing to the same global memory location. With a GTX 260, one thread can write to global memory, and that would eventually become readable by other threads in other blocks. However in a GTX 480, a thread can write to a global memory position, but that data may be reside in the L1 cache for that particular SM until the data is kicked out to L2. This could cause some problems. For example, if other threads write to the same memory location, the data could be lost!

You mentioned you have a reduction kernel, and you use the volatile keyword, which is good. But that isn’t going to solve cross-block thread communication issues. As Lev already said, there is likely some sort of synchronization or race condition in your code. For cross-block data communication, you definitely need to use atomics.

Take a look at section G.4.2 int he CUDA programming guide for instructions on how to disable L1 global caching. See if that solves your problem, then you can debug from there.

Hi Cudouer. The most major difference in architecture between the GTX 260 and GTX 480 is the cache structure, and that’s exactly where I’d start looking for your problem. In the GTX 260, any global memory accesses you perform go out to main memory. However, there is a L1 and L2 cache structure for GTX 480. One thing that trips up a lot of people is that the L1 cache is not coherent among separate SMs. L2 is coherent. Based on that info, I’d guess the most likely cause for problems is threads from different blocks reading/writing to the same global memory location. With a GTX 260, one thread can write to global memory, and that would eventually become readable by other threads in other blocks. However in a GTX 480, a thread can write to a global memory position, but that data may be reside in the L1 cache for that particular SM until the data is kicked out to L2. This could cause some problems. For example, if other threads write to the same memory location, the data could be lost!

You mentioned you have a reduction kernel, and you use the volatile keyword, which is good. But that isn’t going to solve cross-block thread communication issues. As Lev already said, there is likely some sort of synchronization or race condition in your code. For cross-block data communication, you definitely need to use atomics.

Take a look at section G.4.2 int he CUDA programming guide for instructions on how to disable L1 global caching. See if that solves your problem, then you can debug from there.

Thanks very much for your comments. I’ve disabled the L1 cache (compiling with -arch=sm_20 -Xptxas -dlcm=cg) but it didn’t work, it happens the same. As far as I know, there aren’t any write to/read from global memory dependencies between threads of different blocks. Apart from the reduction kernel, in all the other kernels each thread is independent of the other threads and write its results to global memory at the end of the kernel. I have exactly the same synchronization scheme in other programs for shallow water simulations and they work fine on the 480 card. The only difference is that this program is more computationally intensive and one kernel has many register spills to local memory. At one specific point, each thread calls the rg Eispack function (translated to C using the f2c utility, along with its dependent functions) to get the eigenvalues and eigenvectors of a 6x6 matrix. I believe the problem on the 480 may be related to this but I’m not sure.

Thanks very much for your comments. I’ve disabled the L1 cache (compiling with -arch=sm_20 -Xptxas -dlcm=cg) but it didn’t work, it happens the same. As far as I know, there aren’t any write to/read from global memory dependencies between threads of different blocks. Apart from the reduction kernel, in all the other kernels each thread is independent of the other threads and write its results to global memory at the end of the kernel. I have exactly the same synchronization scheme in other programs for shallow water simulations and they work fine on the 480 card. The only difference is that this program is more computationally intensive and one kernel has many register spills to local memory. At one specific point, each thread calls the rg Eispack function (translated to C using the f2c utility, along with its dependent functions) to get the eigenvalues and eigenvectors of a 6x6 matrix. I believe the problem on the 480 may be related to this but I’m not sure.

Try another version of sdk, btw, what is about debug mode of compiler?

Try another version of sdk, btw, what is about debug mode of compiler?

Anecdotal thought: Out-of-bounds shared memory accesses are handled differently between GT200 and GF100. IN GT200 the out of bound accesses may not be detected… I almost even think that they “wrap around” and only the low bits of the address are checked. In GF100, Bad Things happen… either corruption or all zero or kernel termination.

This bit me once when my GT200 code seemed to work but not GF100. It was a bug in my code, but the GT200’s behavior hid the problem better. (This is a BAD characteristic, by the way, you want failures to be obvious!)

Anyway, that’s just anecdotal and I’ve never deeply analyzed it. But check your shared memory accesses, make sure you’re always in bounds.

Anecdotal thought: Out-of-bounds shared memory accesses are handled differently between GT200 and GF100. IN GT200 the out of bound accesses may not be detected… I almost even think that they “wrap around” and only the low bits of the address are checked. In GF100, Bad Things happen… either corruption or all zero or kernel termination.

This bit me once when my GT200 code seemed to work but not GF100. It was a bug in my code, but the GT200’s behavior hid the problem better. (This is a BAD characteristic, by the way, you want failures to be obvious!)

Anyway, that’s just anecdotal and I’ve never deeply analyzed it. But check your shared memory accesses, make sure you’re always in bounds.

I also had a problem when moving my code on Fermi… with reduction! I have found why, this is a synchro issue. Before FERMI, all threads in the same warp were supposed synchronized, so it was nos required to synchronized them. See the reduction exemple in the CUDA SDK. For FERMI, this seems no longer true, so I had to add synchro. The following is the new reduction code (last part):

[codebox] if(it < 256) {

	itu=it+256;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 128) {

	itu=it+128;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 64) {

	itu=it+64;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

// last warp

if(it < 32) {

	itu=it+32;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();     // for FERMI    

if(it < 16) {

	itu=it+16;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI	

if(it < 8) {

	itu=it+8;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 4) {

	itu=it+4;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 2) {

	itu=it+2;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

// last one

if(it == 0)	{

	itu=it+1;

	i[it]+=i[itu];

	m[it]+=m[itu];

	ISum[blockIdx.y] = i[0];

	MSum[blockIdx.y] = m[0];

[/codebox]

I even tried to remove the las syncro, without success !

Yves

I also had a problem when moving my code on Fermi… with reduction! I have found why, this is a synchro issue. Before FERMI, all threads in the same warp were supposed synchronized, so it was nos required to synchronized them. See the reduction exemple in the CUDA SDK. For FERMI, this seems no longer true, so I had to add synchro. The following is the new reduction code (last part):

[codebox] if(it < 256) {

	itu=it+256;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 128) {

	itu=it+128;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 64) {

	itu=it+64;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

// last warp

if(it < 32) {

	itu=it+32;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();     // for FERMI    

if(it < 16) {

	itu=it+16;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI	

if(it < 8) {

	itu=it+8;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 4) {

	itu=it+4;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 2) {

	itu=it+2;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

// last one

if(it == 0)	{

	itu=it+1;

	i[it]+=i[itu];

	m[it]+=m[itu];

	ISum[blockIdx.y] = i[0];

	MSum[blockIdx.y] = m[0];

[/codebox]

I even tried to remove the las syncro, without success !

Yves

So… Fermi’s warps are not really “warps”… ??? Oops… meaning, a block with 32 threads still need __syncthreads()?

So… Fermi’s warps are not really “warps”… ??? Oops… meaning, a block with 32 threads still need __syncthreads()?

I am not a CUDA expert, but this is my experience , I had to add these syncthreads in reduction kernels!

Maybe others have experienced reduction with FERMI ?

Yves

I am not a CUDA expert, but this is my experience , I had to add these syncthreads in reduction kernels!

Maybe others have experienced reduction with FERMI ?

Yves

Very interesting. This maybe also compiler issue too.

Very interesting. This maybe also compiler issue too.

It isn’t a compiler issue and warps are still warps in Fermi. See the CUDA 3.1 “Fermi Compatability Guide” section 1.2.2 for details. The short version is that Fermi doesn’t have instructions to operate directly on shared memory, it loads and stores to registers first, so unless you do something explicit (using volatile), compiler optimization can leave values that should be in shared memory in registers, and that breaks the implicit interwarp synchronization a lot of code uses for things like reductions.