Results differ when compiled with sm_10 and sm_20

Hi,

I have two versions of the same code: one uses global memory and the other uses shared memory for optimizations.

The shared memory version when compiled with sm_20 outputs wrong results. However, its sm_10 output is correct.

It seems there is a problem with the thread scheduler but I don’t see any race condition in my code.

I have reduced the code size to minimal and attached the code with the necessary input file.

Thanks in advance.

Didi

Some more explanations:

The main computation is the following: It is very simple.

_sh_block_data[_idy][_idx] = data[indexdata];                                            

if (_idy == 3) {                                                                         

                _sh_block_data[_idy - 2][_idx] = data[indexdata - widthdata * 2];                      

                _sh_block_data[_idy - 1][_idx] = data[indexdata - widthdata];                          

                _sh_block_data[_idy - 3][_idx] = data[indexdata - widthdata * 3];                      

              }                                                                                        

              __syncthreads();                                                                         

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

In the both versions of the code, if I replace the last statement with

          cornerness[indexcornerness] = w[0];     

or

          cornerness[indexcornerness] = dz;     

then, the results match but the multiplication of them doesn’t.

Specs for the device

Device 0: “Tesla C2050”

Major revision number: 2

Minor revision number: 0

Total amount of global memory: 2817720320 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes
sharedmem-fermi-bug.tar.gz (4.28 MB)

Hi,

I have two versions of the same code: one uses global memory and the other uses shared memory for optimizations.

The shared memory version when compiled with sm_20 outputs wrong results. However, its sm_10 output is correct.

It seems there is a problem with the thread scheduler but I don’t see any race condition in my code.

I have reduced the code size to minimal and attached the code with the necessary input file.

Thanks in advance.

Didi

Some more explanations:

The main computation is the following: It is very simple.

_sh_block_data[_idy][_idx] = data[indexdata];                                            

if (_idy == 3) {                                                                         

                _sh_block_data[_idy - 2][_idx] = data[indexdata - widthdata * 2];                      

                _sh_block_data[_idy - 1][_idx] = data[indexdata - widthdata];                          

                _sh_block_data[_idy - 3][_idx] = data[indexdata - widthdata * 3];                      

              }                                                                                        

              __syncthreads();                                                                         

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

In the both versions of the code, if I replace the last statement with

          cornerness[indexcornerness] = w[0];     

or

          cornerness[indexcornerness] = dz;     

then, the results match but the multiplication of them doesn’t.

Specs for the device

Device 0: “Tesla C2050”

Major revision number: 2

Minor revision number: 0

Total amount of global memory: 2817720320 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

Strangely if I put synchronization point between these two statements, I get correct

results. But there is no need for the synch point.

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

[/code]

Strangely if I put synchronization point between these two statements, I get correct

results. But there is no need for the synch point.

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

[/code]

Best I can tell from looking at the code in the attached archive, the code shown above is located inside a divergent region, so not all threads can reach the __syncthreads(). __syncthreads() must be invoked from within non-divergent code, otherwise the behavior is undefined. The typical way to achieve that in the presence of divergent if-statements is to precompute the condition, then test for it repeatedly. So instead of

cond =

if (cond) {

   work1

   __syncthreads();

   work2

}

one uses

cond =

if (cond) {

   work1

}

__syncthreads()

if (cond) {

   work2

}

[later:]

The relevant section of the CUDA C Programming Guide is B.6, which states:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

Best I can tell from looking at the code in the attached archive, the code shown above is located inside a divergent region, so not all threads can reach the __syncthreads(). __syncthreads() must be invoked from within non-divergent code, otherwise the behavior is undefined. The typical way to achieve that in the presence of divergent if-statements is to precompute the condition, then test for it repeatedly. So instead of

cond =

if (cond) {

   work1

   __syncthreads();

   work2

}

one uses

cond =

if (cond) {

   work1

}

__syncthreads()

if (cond) {

   work2

}

[later:]

The relevant section of the CUDA C Programming Guide is B.6, which states:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

Thanks for the comment.

I have tried what you suggested by taking the synchronization point outside of the
if-statement but still the result is not correct. In fact the default input size in the code
is divisible by the number of thread blocks. So all the threads
should execute the if-statement.

I have attached the code that synchronizes threads outside of the if-statement as suggested.

Any ideas?
sharedmem-fermi-bug.tar.gz (4.28 MB)

Thanks for the comment.

I have tried what you suggested by taking the synchronization point outside of the
if-statement but still the result is not correct. In fact the default input size in the code
is divisible by the number of thread blocks. So all the threads
should execute the if-statement.

I have attached the code that synchronizes threads outside of the if-statement as suggested.

Any ideas?

You can try lowering the backend optimizations via -Xptxas -O{2|1|0}, the default is -O3. If the problem goes away as you lower optimizations one level at a time, this could be indicative of a compiler issue (doesn’t have to).

Is the attached code self-contained with respect to building and running (it seems to require a data file)? Is there an easy way to tell whether the output is correct or not? Based on the header file includes, it looks like you are developing on Linux? Do you observe unexpected behavior with CUDA 4.0, or with an older toolchain?

You can try lowering the backend optimizations via -Xptxas -O{2|1|0}, the default is -O3. If the problem goes away as you lower optimizations one level at a time, this could be indicative of a compiler issue (doesn’t have to).

Is the attached code self-contained with respect to building and running (it seems to require a data file)? Is there an easy way to tell whether the output is correct or not? Based on the header file includes, it looks like you are developing on Linux? Do you observe unexpected behavior with CUDA 4.0, or with an older toolchain?

Hi,

The attached code is self-contained, should compile.

I have included the input data file called Engine.raw as well.

I think the makefile uses test.cu as the input program. Sorry about that.

So, before you type make,

cp test_shared.cu test.cu

The output should print l2norm to test the correctness.

I am using CUDA 3.2 not yet upgraded to 4.0.

I tried changing the backend optimization flag but the problem didn’t go away.

Hi,

The attached code is self-contained, should compile.

I have included the input data file called Engine.raw as well.

I think the makefile uses test.cu as the input program. Sorry about that.

So, before you type make,

cp test_shared.cu test.cu

The output should print l2norm to test the correctness.

I am using CUDA 3.2 not yet upgraded to 4.0.

I tried changing the backend optimization flag but the problem didn’t go away.

I finally got around to building this on a 64-bit Linux system, with CUDA 4.0. The app builds fine, and even with compiler warnings cranked to the max I don’t see any issues reported beyond a few “unused variable” warnings. When I run, I do not see any differences between fully optimized and totally unoptimized builds other than the reported performance). I don’t have a pre-sm_20 device readily available for use with this machine, but tried building for sm_10 and sm_13 with JITting to sm_20, which again did not change results.

I noticed that the GFLOPS number reported (here: C2050) seems impossibly high and suspected that the app may be silently failing somewhere, but best I can tell from a quick inspection of the source code all CUDA API calls appear are checked, as well as the kernel invocation. The bogus GFLOPS number may be a read herring for all I know, maybe you simply trimmed down the app for use as a repro case without adjusting the GFLOPS computation?

In summary, I see no obvious issues with either CUDA toolchain or driver. You stated that the app runs fine when it runs on an sm_10 platform. Have you tried building for sm_10, and then moving the working app to the sm_20 platform (with JITing, this should work fine). Have you tried running this with cuda-memcheck to see whether there are any issues with wayward pointers or out of bounds accesses?

Here is the output from one of my runs, the other runs looked essentially identical as far as max and l2norm were concerned.

~/[...]/r4.0/sharedmem-fermi-bug $ test_shared

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.258840858936e-02

Program         Time(sec)       Gflops

==========================================

Point3D   0.003         14752.404

Done

I finally got around to building this on a 64-bit Linux system, with CUDA 4.0. The app builds fine, and even with compiler warnings cranked to the max I don’t see any issues reported beyond a few “unused variable” warnings. When I run, I do not see any differences between fully optimized and totally unoptimized builds other than the reported performance). I don’t have a pre-sm_20 device readily available for use with this machine, but tried building for sm_10 and sm_13 with JITting to sm_20, which again did not change results.

I noticed that the GFLOPS number reported (here: C2050) seems impossibly high and suspected that the app may be silently failing somewhere, but best I can tell from a quick inspection of the source code all CUDA API calls appear are checked, as well as the kernel invocation. The bogus GFLOPS number may be a read herring for all I know, maybe you simply trimmed down the app for use as a repro case without adjusting the GFLOPS computation?

In summary, I see no obvious issues with either CUDA toolchain or driver. You stated that the app runs fine when it runs on an sm_10 platform. Have you tried building for sm_10, and then moving the working app to the sm_20 platform (with JITing, this should work fine). Have you tried running this with cuda-memcheck to see whether there are any issues with wayward pointers or out of bounds accesses?

Here is the output from one of my runs, the other runs looked essentially identical as far as max and l2norm were concerned.

~/[...]/r4.0/sharedmem-fermi-bug $ test_shared

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.258840858936e-02

Program         Time(sec)       Gflops

==========================================

Point3D   0.003         14752.404

Done

Yes, I didn’t change the GFLOPs number reported after I modified the original code for use as a repro case. Sorry if that leads to some confusion.

The output you got is correct. On Fermi 2050, when compiled with sm_20, I get the following result

where l2norm is different than what you get.

Maybe if I upgrade the CUDA 3.2 to 4.0, the problem will disappear. I just wanted to make sure that there is no race condition in my code.

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.075446680188e-02

Program  	Time(sec)	Gflops

==========================================

Point3D   0.004 	11913.049

By the way, how do I do cuda-memorycheck?

Thanks.

Didem

Yes, I didn’t change the GFLOPs number reported after I modified the original code for use as a repro case. Sorry if that leads to some confusion.

The output you got is correct. On Fermi 2050, when compiled with sm_20, I get the following result

where l2norm is different than what you get.

Maybe if I upgrade the CUDA 3.2 to 4.0, the problem will disappear. I just wanted to make sure that there is no race condition in my code.

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.075446680188e-02

Program  	Time(sec)	Gflops

==========================================

Point3D   0.004 	11913.049

By the way, how do I do cuda-memorycheck?

Thanks.

Didem

Basically, you pass the name of your app and its commandline arguments to the tool:

cuda-memcheck [your-program's-name] [your-program's-args]

There are also a few options as I recall. For details please see cuda-memcheck.pdf which is part of the distribution packages. You should be able to find it in the same directory as the other CUDA documents (.pdf files).

Basically, you pass the name of your app and its commandline arguments to the tool:

cuda-memcheck [your-program's-name] [your-program's-args]

There are also a few options as I recall. For details please see cuda-memcheck.pdf which is part of the distribution packages. You should be able to find it in the same directory as the other CUDA documents (.pdf files).

No error with cuda memcheck. I guess that’s a good sign :)

========= CUDA-MEMCHECK

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.075446680188e-02

Program  	Time(sec)	Gflops

==========================================

Point3D   0.110 	404.915

Done

========= ERROR SUMMARY: 0 errors

No error with cuda memcheck. I guess that’s a good sign :)

========= CUDA-MEMCHECK

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.075446680188e-02

Program  	Time(sec)	Gflops

==========================================

Point3D   0.110 	404.915

Done

========= ERROR SUMMARY: 0 errors