Program works on GTX 260 but not on GTX 480

cudouer · September 30, 2010, 2:41pm

Hello,

I have a rather complex CUDA program for simulating shallow water flows that works perfectly on a GTX 260 card but it deviates from the correct results and even freezes on a GTX 480 card. I have applied the volatile modification explained in the Fermi Compatibility Guide, since I have a reduction kernel, but it continues showing incorrect results or freezing. Apart from the volatile issue, is there any other reason that may explain that a program works on a 260 card but not on a 480?

The program is fully implemented in double precision and uses the CUDA runtime API. I’m using WinXP, Cuda 3.1 and compile with “-arch=sm_20” when using the 480 card.

Thanks.

Lev · September 30, 2010, 2:50pm

Probably program contains wrong cynchronization. Cause in Fermi it works differently, all problems appears quickly and program frees. Use atomics etc.

Lev · September 30, 2010, 2:50pm

Probably program contains wrong cynchronization. Cause in Fermi it works differently, all problems appears quickly and program frees. Use atomics etc.

ColinS · September 30, 2010, 7:17pm

Hi Cudouer. The most major difference in architecture between the GTX 260 and GTX 480 is the cache structure, and that’s exactly where I’d start looking for your problem. In the GTX 260, any global memory accesses you perform go out to main memory. However, there is a L1 and L2 cache structure for GTX 480. One thing that trips up a lot of people is that the L1 cache is not coherent among separate SMs. L2 is coherent. Based on that info, I’d guess the most likely cause for problems is threads from different blocks reading/writing to the same global memory location. With a GTX 260, one thread can write to global memory, and that would eventually become readable by other threads in other blocks. However in a GTX 480, a thread can write to a global memory position, but that data may be reside in the L1 cache for that particular SM until the data is kicked out to L2. This could cause some problems. For example, if other threads write to the same memory location, the data could be lost!

You mentioned you have a reduction kernel, and you use the volatile keyword, which is good. But that isn’t going to solve cross-block thread communication issues. As Lev already said, there is likely some sort of synchronization or race condition in your code. For cross-block data communication, you definitely need to use atomics.

Take a look at section G.4.2 int he CUDA programming guide for instructions on how to disable L1 global caching. See if that solves your problem, then you can debug from there.

ColinS · September 30, 2010, 7:17pm

Hi Cudouer. The most major difference in architecture between the GTX 260 and GTX 480 is the cache structure, and that’s exactly where I’d start looking for your problem. In the GTX 260, any global memory accesses you perform go out to main memory. However, there is a L1 and L2 cache structure for GTX 480. One thing that trips up a lot of people is that the L1 cache is not coherent among separate SMs. L2 is coherent. Based on that info, I’d guess the most likely cause for problems is threads from different blocks reading/writing to the same global memory location. With a GTX 260, one thread can write to global memory, and that would eventually become readable by other threads in other blocks. However in a GTX 480, a thread can write to a global memory position, but that data may be reside in the L1 cache for that particular SM until the data is kicked out to L2. This could cause some problems. For example, if other threads write to the same memory location, the data could be lost!

You mentioned you have a reduction kernel, and you use the volatile keyword, which is good. But that isn’t going to solve cross-block thread communication issues. As Lev already said, there is likely some sort of synchronization or race condition in your code. For cross-block data communication, you definitely need to use atomics.

Take a look at section G.4.2 int he CUDA programming guide for instructions on how to disable L1 global caching. See if that solves your problem, then you can debug from there.

cudouer · October 4, 2010, 6:12pm

Thanks very much for your comments. I’ve disabled the L1 cache (compiling with -arch=sm_20 -Xptxas -dlcm=cg) but it didn’t work, it happens the same. As far as I know, there aren’t any write to/read from global memory dependencies between threads of different blocks. Apart from the reduction kernel, in all the other kernels each thread is independent of the other threads and write its results to global memory at the end of the kernel. I have exactly the same synchronization scheme in other programs for shallow water simulations and they work fine on the 480 card. The only difference is that this program is more computationally intensive and one kernel has many register spills to local memory. At one specific point, each thread calls the rg Eispack function (translated to C using the f2c utility, along with its dependent functions) to get the eigenvalues and eigenvectors of a 6x6 matrix. I believe the problem on the 480 may be related to this but I’m not sure.

cudouer · October 4, 2010, 6:12pm

Thanks very much for your comments. I’ve disabled the L1 cache (compiling with -arch=sm_20 -Xptxas -dlcm=cg) but it didn’t work, it happens the same. As far as I know, there aren’t any write to/read from global memory dependencies between threads of different blocks. Apart from the reduction kernel, in all the other kernels each thread is independent of the other threads and write its results to global memory at the end of the kernel. I have exactly the same synchronization scheme in other programs for shallow water simulations and they work fine on the 480 card. The only difference is that this program is more computationally intensive and one kernel has many register spills to local memory. At one specific point, each thread calls the rg Eispack function (translated to C using the f2c utility, along with its dependent functions) to get the eigenvalues and eigenvectors of a 6x6 matrix. I believe the problem on the 480 may be related to this but I’m not sure.

Lev · October 4, 2010, 6:15pm

Try another version of sdk, btw, what is about debug mode of compiler?

Lev · October 4, 2010, 6:15pm

Try another version of sdk, btw, what is about debug mode of compiler?

SPWorley · October 4, 2010, 7:40pm

Anecdotal thought: Out-of-bounds shared memory accesses are handled differently between GT200 and GF100. IN GT200 the out of bound accesses may not be detected… I almost even think that they “wrap around” and only the low bits of the address are checked. In GF100, Bad Things happen… either corruption or all zero or kernel termination.

This bit me once when my GT200 code seemed to work but not GF100. It was a bug in my code, but the GT200’s behavior hid the problem better. (This is a BAD characteristic, by the way, you want failures to be obvious!)

Anyway, that’s just anecdotal and I’ve never deeply analyzed it. But check your shared memory accesses, make sure you’re always in bounds.

SPWorley · October 4, 2010, 7:40pm

Anecdotal thought: Out-of-bounds shared memory accesses are handled differently between GT200 and GF100. IN GT200 the out of bound accesses may not be detected… I almost even think that they “wrap around” and only the low bits of the address are checked. In GF100, Bad Things happen… either corruption or all zero or kernel termination.

This bit me once when my GT200 code seemed to work but not GF100. It was a bug in my code, but the GT200’s behavior hid the problem better. (This is a BAD characteristic, by the way, you want failures to be obvious!)

Anyway, that’s just anecdotal and I’ve never deeply analyzed it. But check your shared memory accesses, make sure you’re always in bounds.

globs47 · October 5, 2010, 8:23am

I also had a problem when moving my code on Fermi… with reduction! I have found why, this is a synchro issue. Before FERMI, all threads in the same warp were supposed synchronized, so it was nos required to synchronized them. See the reduction exemple in the CUDA SDK. For FERMI, this seems no longer true, so I had to add synchro. The following is the new reduction code (last part):

[codebox] if(it < 256) {

	itu=it+256;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 128) {

	itu=it+128;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 64) {

	itu=it+64;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

// last warp

if(it < 32) {

	itu=it+32;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();     // for FERMI    

if(it < 16) {

	itu=it+16;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI	

if(it < 8) {

	itu=it+8;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 4) {

	itu=it+4;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 2) {

	itu=it+2;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

// last one

if(it == 0)	{

	itu=it+1;

	i[it]+=i[itu];

	m[it]+=m[itu];

	ISum[blockIdx.y] = i[0];

	MSum[blockIdx.y] = m[0];

[/codebox]

I even tried to remove the las syncro, without success !

Yves

globs47 · October 5, 2010, 8:23am

I also had a problem when moving my code on Fermi… with reduction! I have found why, this is a synchro issue. Before FERMI, all threads in the same warp were supposed synchronized, so it was nos required to synchronized them. See the reduction exemple in the CUDA SDK. For FERMI, this seems no longer true, so I had to add synchro. The following is the new reduction code (last part):

[codebox] if(it < 256) {

	itu=it+256;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 128) {

	itu=it+128;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

if(it < 64) {

	itu=it+64;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();

// last warp

if(it < 32) {

	itu=it+32;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();     // for FERMI    

if(it < 16) {

	itu=it+16;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI	

if(it < 8) {

	itu=it+8;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 4) {

	itu=it+4;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

if(it < 2) {

	itu=it+2;

	i[it]+=i[itu];

	m[it]+=m[itu];

}

__syncthreads();	 // for FERMI

// last one

if(it == 0)	{

	itu=it+1;

	i[it]+=i[itu];

	m[it]+=m[itu];

	ISum[blockIdx.y] = i[0];

	MSum[blockIdx.y] = m[0];

[/codebox]

I even tried to remove the las syncro, without success !

Yves

Sarnath · October 5, 2010, 9:29am

So… Fermi’s warps are not really “warps”… ??? Oops… meaning, a block with 32 threads still need __syncthreads()?

Sarnath · October 5, 2010, 9:29am

So… Fermi’s warps are not really “warps”… ??? Oops… meaning, a block with 32 threads still need __syncthreads()?

globs47 · October 5, 2010, 9:52am

I am not a CUDA expert, but this is my experience , I had to add these syncthreads in reduction kernels!

Maybe others have experienced reduction with FERMI ?

Yves

globs47 · October 5, 2010, 9:52am

I am not a CUDA expert, but this is my experience , I had to add these syncthreads in reduction kernels!

Maybe others have experienced reduction with FERMI ?

Yves

Lev · October 5, 2010, 1:21pm

Very interesting. This maybe also compiler issue too.

Lev · October 5, 2010, 1:21pm

Very interesting. This maybe also compiler issue too.

avidday · October 5, 2010, 1:38pm

It isn’t a compiler issue and warps are still warps in Fermi. See the CUDA 3.1 “Fermi Compatability Guide” section 1.2.2 for details. The short version is that Fermi doesn’t have instructions to operate directly on shared memory, it loads and stores to registers first, so unless you do something explicit (using volatile), compiler optimization can leave values that should be in shared memory in registers, and that breaks the implicit interwarp synchronization a lot of code uses for things like reductions.

Topic		Replies	Views
float accuracy different chips CUDA Programming and Performance	6	13729	March 22, 2011
Results differ when compiled with sm_10 and sm_20 CUDA Programming and Performance	19	2472	June 17, 2011
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3166	October 5, 2015
Code runs 3x times faster on X260 than on tesla c1060 CUDA Programming and Performance	21	5891	October 7, 2009
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	2998	September 23, 2010
Peculiar Shared Memory Behavior - NEED HELP! A test program appears to have two arrays sharing v CUDA Programming and Performance	21	2214	June 5, 2011
the result is not always right ! CUDA Programming and Performance	2	1672	April 6, 2008
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4035	August 26, 2010
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22561	May 5, 2010
Freeze when running some kernels on 8800gtx CUDA Programming and Performance	4	2646	June 3, 2009

Program works on GTX 260 but not on GTX 480

Related topics