Results differ when compiled with sm_10 and sm_20

didem · June 15, 2011, 9:38pm

Hi,

I have two versions of the same code: one uses global memory and the other uses shared memory for optimizations.

The shared memory version when compiled with sm_20 outputs wrong results. However, its sm_10 output is correct.

It seems there is a problem with the thread scheduler but I don’t see any race condition in my code.

I have reduced the code size to minimal and attached the code with the necessary input file.

Thanks in advance.

Didi

Some more explanations:

The main computation is the following: It is very simple.

_sh_block_data[_idy][_idx] = data[indexdata];                                            

if (_idy == 3) {                                                                         

                _sh_block_data[_idy - 2][_idx] = data[indexdata - widthdata * 2];                      

                _sh_block_data[_idy - 1][_idx] = data[indexdata - widthdata];                          

                _sh_block_data[_idy - 3][_idx] = data[indexdata - widthdata * 3];                      

              }                                                                                        

              __syncthreads();                                                                         

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

In the both versions of the code, if I replace the last statement with

          cornerness[indexcornerness] = w[0];

or

          cornerness[indexcornerness] = dz;

then, the results match but the multiplication of them doesn’t.

Specs for the device

Device 0: “Tesla C2050”

Major revision number: 2

Minor revision number: 0

Total amount of global memory: 2817720320 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes
sharedmem-fermi-bug.tar.gz (4.28 MB)

didem · June 15, 2011, 9:38pm

Hi,

I have two versions of the same code: one uses global memory and the other uses shared memory for optimizations.

The shared memory version when compiled with sm_20 outputs wrong results. However, its sm_10 output is correct.

It seems there is a problem with the thread scheduler but I don’t see any race condition in my code.

I have reduced the code size to minimal and attached the code with the necessary input file.

Thanks in advance.

Didi

Some more explanations:

The main computation is the following: It is very simple.

_sh_block_data[_idy][_idx] = data[indexdata];                                            

if (_idy == 3) {                                                                         

                _sh_block_data[_idy - 2][_idx] = data[indexdata - widthdata * 2];                      

                _sh_block_data[_idy - 1][_idx] = data[indexdata - widthdata];                          

                _sh_block_data[_idy - 3][_idx] = data[indexdata - widthdata * 3];                      

              }                                                                                        

              __syncthreads();                                                                         

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

In the both versions of the code, if I replace the last statement with

          cornerness[indexcornerness] = w[0];

or

          cornerness[indexcornerness] = dz;

then, the results match but the multiplication of them doesn’t.

Specs for the device

Device 0: “Tesla C2050”

Major revision number: 2

Minor revision number: 0

Total amount of global memory: 2817720320 bytes

Number of multiprocessors: 14

Number of cores: 112

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Clock rate: 1.15 GHz

Concurrent copy and execution: Yes

didem · June 15, 2011, 10:26pm

Strangely if I put synchronization point between these two statements, I get correct

results. But there is no need for the synch point.

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

[/code]

didem · June 15, 2011, 10:26pm

Strangely if I put synchronization point between these two statements, I get correct

results. But there is no need for the synch point.

float dz = _sh_block_data[_idy-2][_idx];                                                 

              cornerness[indexcornerness] = dz * w[0];

[/code]

njuffa · June 16, 2011, 2:45am

Best I can tell from looking at the code in the attached archive, the code shown above is located inside a divergent region, so not all threads can reach the __syncthreads(). __syncthreads() must be invoked from within non-divergent code, otherwise the behavior is undefined. The typical way to achieve that in the presence of divergent if-statements is to precompute the condition, then test for it repeatedly. So instead of

cond =

if (cond) {

   work1

   __syncthreads();

   work2

}

one uses

cond =

if (cond) {

   work1

}

__syncthreads()

if (cond) {

   work2

}

[later:]

The relevant section of the CUDA C Programming Guide is B.6, which states:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

njuffa · June 16, 2011, 2:45am

Best I can tell from looking at the code in the attached archive, the code shown above is located inside a divergent region, so not all threads can reach the __syncthreads(). __syncthreads() must be invoked from within non-divergent code, otherwise the behavior is undefined. The typical way to achieve that in the presence of divergent if-statements is to precompute the condition, then test for it repeatedly. So instead of

cond =

if (cond) {

   work1

   __syncthreads();

   work2

}

one uses

cond =

if (cond) {

   work1

}

__syncthreads()

if (cond) {

   work2

}

[later:]

The relevant section of the CUDA C Programming Guide is B.6, which states:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

didem · June 16, 2011, 4:03am

Thanks for the comment.

I have tried what you suggested by taking the synchronization point outside of the
if-statement but still the result is not correct. In fact the default input size in the code
is divisible by the number of thread blocks. So all the threads
should execute the if-statement.

I have attached the code that synchronizes threads outside of the if-statement as suggested.

Any ideas?
sharedmem-fermi-bug.tar.gz (4.28 MB)

didem · June 16, 2011, 4:03am

Thanks for the comment.

I have tried what you suggested by taking the synchronization point outside of the
if-statement but still the result is not correct. In fact the default input size in the code
is divisible by the number of thread blocks. So all the threads
should execute the if-statement.

I have attached the code that synchronizes threads outside of the if-statement as suggested.

Any ideas?

njuffa · June 16, 2011, 4:40am

You can try lowering the backend optimizations via -Xptxas -O{2|1|0}, the default is -O3. If the problem goes away as you lower optimizations one level at a time, this could be indicative of a compiler issue (doesn’t have to).

Is the attached code self-contained with respect to building and running (it seems to require a data file)? Is there an easy way to tell whether the output is correct or not? Based on the header file includes, it looks like you are developing on Linux? Do you observe unexpected behavior with CUDA 4.0, or with an older toolchain?

njuffa · June 16, 2011, 4:40am

You can try lowering the backend optimizations via -Xptxas -O{2|1|0}, the default is -O3. If the problem goes away as you lower optimizations one level at a time, this could be indicative of a compiler issue (doesn’t have to).

Is the attached code self-contained with respect to building and running (it seems to require a data file)? Is there an easy way to tell whether the output is correct or not? Based on the header file includes, it looks like you are developing on Linux? Do you observe unexpected behavior with CUDA 4.0, or with an older toolchain?

didem · June 16, 2011, 6:39am

Hi,

The attached code is self-contained, should compile.

I have included the input data file called Engine.raw as well.

I think the makefile uses test.cu as the input program. Sorry about that.

So, before you type make,

cp test_shared.cu test.cu

The output should print l2norm to test the correctness.

I am using CUDA 3.2 not yet upgraded to 4.0.

I tried changing the backend optimization flag but the problem didn’t go away.

didem · June 16, 2011, 6:39am

Hi,

The attached code is self-contained, should compile.

I have included the input data file called Engine.raw as well.

I think the makefile uses test.cu as the input program. Sorry about that.

So, before you type make,

cp test_shared.cu test.cu

The output should print l2norm to test the correctness.

I am using CUDA 3.2 not yet upgraded to 4.0.

I tried changing the backend optimization flag but the problem didn’t go away.

njuffa · June 16, 2011, 11:45pm

I finally got around to building this on a 64-bit Linux system, with CUDA 4.0. The app builds fine, and even with compiler warnings cranked to the max I don’t see any issues reported beyond a few “unused variable” warnings. When I run, I do not see any differences between fully optimized and totally unoptimized builds other than the reported performance). I don’t have a pre-sm_20 device readily available for use with this machine, but tried building for sm_10 and sm_13 with JITting to sm_20, which again did not change results.

I noticed that the GFLOPS number reported (here: C2050) seems impossibly high and suspected that the app may be silently failing somewhere, but best I can tell from a quick inspection of the source code all CUDA API calls appear are checked, as well as the kernel invocation. The bogus GFLOPS number may be a read herring for all I know, maybe you simply trimmed down the app for use as a repro case without adjusting the GFLOPS computation?

In summary, I see no obvious issues with either CUDA toolchain or driver. You stated that the app runs fine when it runs on an sm_10 platform. Have you tried building for sm_10, and then moving the working app to the sm_20 platform (with JITing, this should work fine). Have you tried running this with cuda-memcheck to see whether there are any issues with wayward pointers or out of bounds accesses?

Here is the output from one of my runs, the other runs looked essentially identical as far as max and l2norm were concerned.

~/[...]/r4.0/sharedmem-fermi-bug $ test_shared

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.258840858936e-02

Program         Time(sec)       Gflops

==========================================

Point3D   0.003         14752.404

Done

njuffa · June 16, 2011, 11:45pm

I finally got around to building this on a 64-bit Linux system, with CUDA 4.0. The app builds fine, and even with compiler warnings cranked to the max I don’t see any issues reported beyond a few “unused variable” warnings. When I run, I do not see any differences between fully optimized and totally unoptimized builds other than the reported performance). I don’t have a pre-sm_20 device readily available for use with this machine, but tried building for sm_10 and sm_13 with JITting to sm_20, which again did not change results.

I noticed that the GFLOPS number reported (here: C2050) seems impossibly high and suspected that the app may be silently failing somewhere, but best I can tell from a quick inspection of the source code all CUDA API calls appear are checked, as well as the kernel invocation. The bogus GFLOPS number may be a read herring for all I know, maybe you simply trimmed down the app for use as a repro case without adjusting the GFLOPS computation?

In summary, I see no obvious issues with either CUDA toolchain or driver. You stated that the app runs fine when it runs on an sm_10 platform. Have you tried building for sm_10, and then moving the working app to the sm_20 platform (with JITing, this should work fine). Have you tried running this with cuda-memcheck to see whether there are any issues with wayward pointers or out of bounds accesses?

Here is the output from one of my runs, the other runs looked essentially identical as far as max and l2norm were concerned.

~/[...]/r4.0/sharedmem-fermi-bug $ test_shared

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.258840858936e-02

Program         Time(sec)       Gflops

==========================================

Point3D   0.003         14752.404

Done

didem · June 16, 2011, 11:59pm

Yes, I didn’t change the GFLOPs number reported after I modified the original code for use as a repro case. Sorry if that leads to some confusion.

The output you got is correct. On Fermi 2050, when compiled with sm_20, I get the following result

where l2norm is different than what you get.

Maybe if I upgrade the CUDA 3.2 to 4.0, the problem will disappear. I just wanted to make sure that there is no race condition in my code.

Memory is successfully allocated.

Data is successfully loaded.

Computation done

:N 262 M 262 K 262, iteration 1

:max:   4.334999918938e-01, l2norm:   6.075446680188e-02

Program  	Time(sec)	Gflops

==========================================

Point3D   0.004 	11913.049

By the way, how do I do cuda-memorycheck?

Thanks.

Didem

didem · June 16, 2011, 11:59pm