Peculiar Shared Memory Behavior - NEED HELP! A test program appears to have two arrays sharing v

Hello, I need some help understanding why the following program gives erroneous results. I was working on a lengthy program when a change I made started giving very odd results. I couldn’t find any fault in the logic so I began testing. I removed all extraneous code until it was reduced to the following very simple program. You will notice that the kernel only runs one thread so there is no chance of unexpected parallel interactions occurring. Also, notice that there is a lot of weird and unnecessary code in the kernel. Those lines were left as they are because changing them causes the code to produce the correct result (I have no idea why that is). In fact, making any ONE of the following changes will cause this program to give a correct answer:

[list=1]

[*]Uncomment the last __syncthreads()

[*]Remove the first __syncthreads()

[*]Remove either part of the if statement

[*]Replace any threadIdx.x with 0 (which is what it evaluates to anyway)

[*]Swap the order of the two lines of code that assign values to localHasPrimes and localHasPi

[*]Swap the order of the two lines of code that assign values to hasPi and hasPrimes

[*]Remove the last assignment for hasPrimes

[*]Assign a constant value to hasPrimes instead

[*]Assign a constant value to the temp variable

Essentially any modification to the code that is already written within the kernel will cause it to behave correctly. However, it seems that it doesn’t affect the inccorect result when you insert new code, add variables, loop on some of the already existing code, move some of the code into functions, increase the size and number of arrays, and change the number of threads and blocks.

#include <stdio.h>

#include <cuda.h>

__global__ void kernel(float *, float *);

int main(void)

{

	float *hasPrimes, *hasPi, hostHasPrimes[1], hostHasPi[1];

	hostHasPrimes[0] = 2357;

	cudaMalloc(&hasPrimes, sizeof(float));

	cudaMalloc(&hasPi, sizeof(float));

	cudaMemcpy(hasPrimes, hostHasPrimes, sizeof(float), cudaMemcpyHostToDevice);

	kernel <<<1, 1>>> (hasPrimes, hasPi);

	cudaThreadSynchronize();

	cudaMemcpy(hostHasPi, hasPi, sizeof(float), cudaMemcpyDeviceToHost);

	cudaMemcpy(hostHasPrimes, hasPrimes, sizeof(float), cudaMemcpyDeviceToHost);

	printf("Primes: %f\n", hostHasPrimes[0]);

	printf("Pi:     %f\n", hostHasPi[0]);

	cudaFree(hasPrimes);

	cudaFree(hasPi);

	printf("\nPress <ENTER> to exit.\n");

	getchar();

	return 0;

}

__global__ void kernel(float *hasPrimes, float *hasPi)

{

	__shared__ float localHasPrimes[1];

	__shared__ float localHasPi[1];

	float tempThatDoesntMatter;

	localHasPrimes[0] = hasPrimes[0];

	localHasPi[threadIdx.x] = 3.141593;

	if (threadIdx.x)

		localHasPrimes[threadIdx.x] = 111317;

	__syncthreads();

	tempThatDoesntMatter = hasPrimes[0];

	hasPi[0] = localHasPi[threadIdx.x];

	hasPrimes[0] = tempThatDoesntMatter;

	//__syncthreads();

}

The correct result should be:

Primes: 2357.000000

Pi:     3.141593

Press <ENTER> to exit.

However, this is printed instead:

Primes: 2357.000000

Pi:     2357.000000

Press <ENTER> to exit.

This code has been run and verified on two different machines. They have different graphics cards but both tests were done on a Windows 7 x64 machine using Visual Studio 2008 and CUDA Driver version 3.2. The error only occurs when the code is compiled to architecture sm_20. I believe that is all the information that I have. Any all help will be greatly appreciated. I really want to know just what on earth is going on with this program. Thanks!

I would think it’s ptxas’s problem. Can you show us the output of cuobjdump on this kernel?

Uh… no? I am sorry but I don’t know what cuobjdump is or how to use it. You have the full source code and I have confirmed that this error happens on multiple machines so you should be able to reproduce it on your own.

There’re a few things I don’t know: your device architecture, your toolkit version.


I built it for sm11 all the way to sm21 using 32-bit 4.0 ptxas and only saw a part which might cause incorrect value for hasPi[0], but i certainly do not expect the value to be 2357…

I don’t have a card to test for actual execution right now… Maybe you can tell us a bit more about your machine

I am not exactly sure what you are asking for. I told you that this needs to be compiled to 2.0 architecture. Therefore, you can be sure the two GPUs I tested it on are 2.0 and higher (which doesn’t leave a lot of possibilities). I ran it on GTX 465 first, which has compute capability 2.0. I also ran it on a GTS 450, which has compute capability 2.1. Do you want the code names for the cores? It is GF100 and GF106, respectively. Please let me know if I am not providing what you need to know. As far as I can tell, I have told you everything now.

I just confirmed the same behavior on a Tesla C2050 (compute capability 2.0), once again with target architecture 2.0 using cuda driver 3.2 on a different windows 7 x64 machine. Thanks for taking the time to help me out!

Oh, I see, you don’t have a GPU to test on. Well, I registered as a developer with nvidia. Hopefully, they will get back to me soon so I can run cuobjdump (once registered, how do I get cuobjdump?). I have 3 different machines that I have tested this on. They are all rather different except that they are all Windows 7 64bit machines. They have an Intel i7 930, AMD Phenom II X4 965, AMD Phenom II X4 970; 12gb, 4gb, and 8gb of ram; and service pack 1, SP0, and SP0; respectively. The CUDA driver version matches the toolkit in all three cases; we are using 3.20 (because 4.00 is only a RC). Is any of this information really helpful? I am rather sure the machine doesn’t matter (beyond being 64bit windows and having driver 3.2) because these machines are all rather different. Thanks again for the help.

You should really try with 4.0 RC2 quickly just to make sure it hasn’t already been fixed.

Thanks for the suggestion but these machines aren’t mine but are shared among several people within the department. I am not about go messing with the drivers and potentially breaking something in order to test if this is actually the fault of the toolkit. I was hoping someone could tell me how I am making a mistake with the code. I would think that this would be a very serious problem if this was caused by a problem with the toolkit because of how trivial the code is.

Alright, so I have no idea how to use cuobjdump or if I am doing any of this correctly. However, I have the below .ptx file. If that is not what you want, please tell me how to go about getting it. Thanks!

.version 2.2

.target sm_20

// compiled with C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin/../open64/lib//be.exe

// nvopencc 3.2 built on 2010-11-04

//-----------------------------------------------------------

// Compiling XXX.cpp3.i (C:/Users/XXX/AppData/Local/Temp/ccBI#.a05840)

//-----------------------------------------------------------

//-----------------------------------------------------------

// Options:

//-----------------------------------------------------------

//  Target:ptx, ISA:sm_20, Endian:little, Pointer Size:64

//  -O3	(Optimization level)

//  -g0	(Debug level)

//  -m2	(Report advisories)

//-----------------------------------------------------------

.file	1	"XXX.cudafe2.gpu"

.file	2	"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include\crtdefs.h"

.file	3	"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include\crt/device_runtime.h"

.file	4	"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include\host_defines.h"

.file	5	"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include\builtin_types.h"

.file	6	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\device_types.h"

.file	7	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\driver_types.h"

.file	8	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\surface_types.h"

.file	9	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\texture_types.h"

.file	10	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\vector_types.h"

.file	11	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\builtin_types.h"

.file	12	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\host_defines.h"

.file	13	"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include\device_launch_parameters.h"

.file	14	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\crt\storage_class.h"

.file	15	"c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include\time.h"

.file	16	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\texture_fetch_functions.h"

.file	17	"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include\common_functions.h"

.file	18	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\math_functions.h"

.file	19	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\math_constants.h"

.file	20	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\device_functions.h"

.file	21	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\sm_11_atomic_functions.h"

.file	22	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\sm_12_atomic_functions.h"

.file	23	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\sm_13_double_functions.h"

.file	24	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\sm_20_atomic_functions.h"

.file	25	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\sm_20_intrinsics.h"

.file	26	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\surface_functions.h"

.file	27	"c:\program files\nvidia gpu computing toolkit\cuda\v3.2\include\math_functions_dbl_ptx3.h"

.file	28	"c:/Users/XXX/Desktop/XXX/XXX.cu"

.entry _Z6kernelPfS_ (

	.param .u64 __cudaparm__Z6kernelPfS__hasPrimes,

	.param .u64 __cudaparm__Z6kernelPfS__hasPi)

{

.reg .u32 %r<4>;

.reg .u64 %rd<10>;

.reg .f32 %f<7>;

.reg .pred %p<3>;

.shared .align 4 .b8 __cuda___cuda_local_var_96302_31_non_const_localHasPrimes0[4];

.shared .align 4 .b8 __cuda___cuda_local_var_96303_31_non_const_localHasPi4[4];

.loc	28	33	0

$LDWbegin__Z6kernelPfS_:

.loc	28	39	0

ld.param.u64 	%rd1, [__cudaparm__Z6kernelPfS__hasPrimes];

ldu.global.f32 	%f1, [%rd1+0];

st.shared.f32 	[__cuda___cuda_local_var_96302_31_non_const_localHasPrimes0+0], %f1;

.loc	28	40	0

mov.u32 	%r1, %tid.x;

cvt.u64.u32 	%rd2, %r1;

mul.wide.u32 	%rd3, %r1, 4;

mov.u64 	%rd4, __cuda___cuda_local_var_96303_31_non_const_localHasPi4;

add.u64 	%rd5, %rd3, %rd4;

mov.f32 	%f2, 0f40490fdc;     	// 3.14159

st.shared.f32 	[%rd5+0], %f2;

mov.u32 	%r2, 0;

setp.eq.u32 	%p1, %r1, %r2;

@%p1 bra 	$Lt_0_1026;

.loc	28	43	0

mov.f32 	%f3, 0f47d96a80;     	// 111317

mov.u64 	%rd6, __cuda___cuda_local_var_96302_31_non_const_localHasPrimes0;

add.u64 	%rd7, %rd3, %rd6;

st.shared.f32 	[%rd7+0], %f3;

$Lt_0_1026:

.loc	28	45	0

bar.sync 	0;

.loc	28	47	0

ldu.global.f32 	%f4, [%rd1+0];

.loc	28	49	0

ld.shared.f32 	%f5, [%rd5+0];

ld.param.u64 	%rd8, [__cudaparm__Z6kernelPfS__hasPi];

st.global.f32 	[%rd8+0], %f5;

.loc	28	50	0

st.global.f32 	[%rd1+0], %f4;

.loc	28	53	0

exit;

$LDWend__Z6kernelPfS_:

} // _Z6kernelPfS_

Playing around with it, I notice that when I add or remove that final commented line, only one line of code changes:

ldu.global.f32 %f4, [%rd1+0];

becomes the following when I uncomment the final __syncthreads():

ld.global.f32 %f4, [%rd1+0];

Do you have any idea what this means?

The ‘ld’ instruction means ‘load’, and the ‘ldu’ instruction means ‘load universal’. The ‘ldu’ instruction is more efficient when a number of threads (I think it’s all threads in a block, but don’t quote me on that) read from the same memory address. Before your last call to __syncthreads(), nvcc is using the more-efficient ‘ldu’ for the following line, since all threads are reading from the same element of the array.

tempThatDoesntMatter = hasPrimes[0];

If you notice this line near the beginning of the kernel:

ld.param.u64    %rd1, [__cudaparm__Z6kernelPfS__hasPrimes];

That’s loading the address (in global memory) of ‘hasPrimes’ into the register %rd1. So, the line:

ld.global.f32 %f4, [%rd1+0];

loads the element from the offset %rd1+0 (i.e., hasPrimes[0]).

Well, it seems that there is absolutely nothing unusual about the assembly at all. From what I can tell when I traced it, it should perform the correct calculation. The only thing I am wondering is what the addresses of those shared memory arrays are. I am able to pass in a 3rd array that grabs certain values so the CPU can print them. However, when getting addresses of the arrays, it causes the program to correct itself. So, we have a refrigerator light sort of situation. Trying to look inside requires opening the door, which turns on the light. So, it is impossible to determine if the light is actually off when the door is shut.

Except, I am able to get it to tell me that the address of localHasPrimes is 0 while it is still giving the incorrect result. Actually, that number is interesting in it of itself. When I find all of the addresses, the program tells me this (which is different):

local hasPrimes: 4
local hasPi: 0
global hasPrimes: 84934656
global hasPi: 84935168

Is it possible that the compiler is not allocating the arrays at different addresses??? I am not even sure that is sufficient to explain this behavior though. The assembly code seems to suggest the pi should override the primes, not the other way around.

Are you on 32-bit toolkit or 64-bit toolkit?

my machine is 32-bit and so is my toolkit. my ptxas version is 4.0, v0.2.1221… not sure if this is RC1 or RC2…

the great thing is, I’m now reproducing your result. Amazing.

It’s a bug

here’s the assembly code produced for sm_21 and sm_20, using the nvcc version mentioned above, for your unmodified kernel

what a bloody mess

/*0000*/ 	/*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

/*0008*/ 	/*0x00001de440000000*/ 	NOP CC.T;

/*0010*/ 	/*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];

/*0018*/ 	/*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;

/*0020*/ 	/*0xfc21dc03190e0000*/ 	ISETP.EQ.U32.AND P0, pt, R2, RZ, pt;

/*0028*/ 	/*0x00001c8588000000*/ 	LDU R0, [R0];

/*0030*/ 	/*0x08209e036000c000*/ 	SHL R2, R2, 0x2;

/*0038*/ 	/*0x70011de21901243f*/ 	MOV32I R4, 0x40490fdc;

/*0040*/ 	/*0x000161e2191f65aa*/ 	@!P0 MOV32I R5, 0x47d96a80;

/*0048*/ 	/*0x1020dc034800c000*/ 	IADD R3, R2, 0x4;

/*0050*/ 	/*0x03f01c85c9000000*/ 	STS [0x0], R0;

/*0058*/ 	/*0x00216085c9000000*/ 	@!P0 STS [R2], R5;

/*0060*/ 	/*0x10211c85c9000000*/ 	STS [R2+0x4], R4;

/*0068*/ 	/*0xffffdc0450ee0000*/ 	BAR.RED.POPC RZ, RZ;

/*0070*/ 	/*0x80011de428004000*/ 	MOV R4, c [0x0] [0x20];

/*0078*/ 	/*0x90015de428004000*/ 	MOV R5, c [0x0] [0x24];

/*0080*/ 	/*0x13f09c05ab000000*/ 	LDS_LDU.32.32 R0, R2, [0x0], [R4+0x0];//this has to be wrong no mater what the semantics is

/*0088*/ 	/*0x00501c8590000000*/ 	ST [R5], R0;

/*0090*/ 	/*0x00409c8590000000*/ 	ST [R4], R2;

/*0098*/ 	/*0x00001de780000000*/ 	EXIT;

I think it’s probably nvcc’s (or whatever thing that produces the ptx code for ptxas) problem cos when I previously built your kernel using nvcc and ptxas separately, the code was correct.

nvcc std.cu -ptx ... // ... is some other command which certainly doesn't matter here

ptxas -arch sm_20 std.ptx -o std20.cubin

Problem code produced when I used this line instead:

nvcc std.cu ...

the separate calls to nvcc and ptxas also generate problematic code:

/*0000*/ 	/*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/ 	/*0x80001de428004000*/ 	MOV R0, c [0x0] [0x20];

	/*0010*/ 	/*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;

	/*0018*/ 	/*0x70011de21901243f*/ 	MOV32I R4, 0x40490fdc;

	/*0020*/ 	/*0xfc1fdc03207e0000*/ 	IMAD.U32.U32 RZ, R1, RZ, RZ;

	/*0028*/ 	/*0x00001c8580000000*/ 	LD R0, [R0];

	/*0030*/ 	/*0x0020dc037000c060*/ 	BFE.U32 R3, R2, 0x1800;

	/*0038*/ 	/*0xfc21dc03190e0000*/ 	ISETP.EQ.U32.AND P0, pt, R2, RZ, pt;

	/*0040*/ 	/*0x08309e036000c000*/ 	SHL R2, R3, 0x2;

	/*0048*/ 	/*0x000161e2191f65aa*/ 	@!P0 MOV32I R5, 0x47d96a80;

	/*0050*/ 	/*0x1020dc034800c000*/ 	IADD R3, R2, 0x4;

	/*0058*/ 	/*0x03f01c85c9000000*/ 	STS [0x0], R0;

	/*0060*/ 	/*0x00216085c9000000*/ 	@!P0 STS [R2], R5;

	/*0068*/ 	/*0x10211c85c9000000*/ 	STS [R2+0x4], R4;

	/*0070*/ 	/*0xffffdc0450ee0000*/ 	BAR.RED.POPC RZ, RZ;

	/*0078*/ 	/*0x80011de428004000*/ 	MOV R4, c [0x0] [0x20];

	/*0080*/ 	/*0x00301c85c1000000*/ 	LDS R0, [R3]; //warp scheduler doesn't check shared mem dependency,

                          //does it? Then with a single warp this is doomed to be wrong

	/*0088*/ 	/*0x90015de428004000*/ 	MOV R5, c [0x0] [0x24];

	/*0090*/ 	/*0x00409c8580000000*/ 	LD R2, [R4];

	/*0098*/ 	/*0x00501c8590000000*/ 	ST [R5], R0;

	/*00a0*/ 	/*0x00409c8590000000*/ 	ST [R4], R2;

	/*00a8*/ 	/*0x00001de780000000*/ 	EXIT;

Is my toolkit RC2? I think I downloaded RC2 but I don’t remember if I installed it… If mine is RC2, then maybe someone should do a bug report.

… I think mine is RC2 because my nvcc.exe is digitally signed on 25 Mar

I am on the 3.2 64-bit toolkit. Are you saying this happens with 4.0 as well?

Yes it happens with 4.0 RC2 as well.
Well looks like I’ll just do the bug report myself.

:thumbsdown:I can’t submit the bug report. After clicking submit the site says something cannot be displayed.

I can submit the bug report instead, if you cannot. Where do I go and what do I say and do? Thanks again for all of your help!

I just did it again, with success.
You can report bugs on this site: https://nvdeveloper.nvidia.com/

So you submitted the bug report? Thanks so much! Let’s hope they can fix this. For now, I have rewritten my code to use less shared memory so it can run on architecture 1.0. It is a workaround that will let me get by, at least.

Hi, we have just published CUDA 4.0 Production. I have verified that this bug was fixed in this new version. Could you please download it and verify this bug on that version again? Please let me know if it is fixed from your side.

You can download it from: http://developer.nvidia.com/cuda-toolkit-40

Thank you!