Different results using diffrent memory types

So guys i got 2 code snippets which give different results although they are doing the same computation.

P.S I am using device emulation mode just to compare the results with the cpu version (i have a different code to run on the actual device which gives me the same exact values / errors)

Snippet 1 (Result Incorrect) (Result Acheived 0.0026914661)

float total = 0;

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			total  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

		

		device_outputVector[0]  +=  total;

OR

shared_buff[threadIdx.x] = 0;

		__syncthreads();

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			shared_buff[threadIdx.x]   += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

		

		device_outputVector[0]  += shared_buff[threadIdx.x]

;

Snippet 2 (Result Correct) (Result Achieved 0.0026918666)

for (int i = 0; i < columnSize_; i++)

		{

			 device_outputVector[0]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

		}

Although the result are pretty close to each other this minor discrepancy between them is causing havoc in my algorithm and i can’t seem to pinpoint whats wrong

device_outputVector is accessed by all threads without any atomics in place…

As i said i am running that sample code in device emulation mode where each thread is run sequentially i have a different code to run on the actual CUDA architecture which make use or warp level reductions and atomic float addition yet i still get the same results observed by the code posted above

What i didn’t mention was that there are multiple threads in a block (256 threads) and multiple blocks in a single dimension 300+

So i tried a new piece of code and if gives the correct results but i can’t understand what is going wrong why am i getting different results?

shared_buff[threadIdx.x] = 0;

	

		// Block until all threads in the block have written their data to shared mem

		__syncthreads();

		//Multiply column vector with our input vector

		for (int i = 0; i < columnSize_; i++)

		{

			shared_buff[threadIdx.x]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

			device_outputVector[2]  += shared_buff[threadIdx.x];

			shared_buff[threadIdx.x] = 0;

		}

Disassemble the code produced by the compiler and see what the differences are.

at the risk of looking like a complete idiot how do you disassemble the code, i m using visual studio? ( i’m pretty desperate right now :P)

nvm i think i found how but i m not much of as assembly guy will look pretty much chines to me

Sorry, no idea about Microsoft’s toolchain. You could use nvcc to compile to ptx and look at that instead (given the device is doing the same thing).

float total = 0;

000000013F85B7B3  xorps	   xmm5,xmm5 

000000013F85B7B6  movss	   dword ptr [total],xmm5 

000000013F85B7BC  mov		 dword ptr [i],0 

000000013F85B7C4  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+221h (13F85B7D1h) 

000000013F85B7C6  mov		 eax,dword ptr [i] 

000000013F85B7CA  add		 eax,1 

000000013F85B7CD  mov		 dword ptr [i],eax 

000000013F85B7D1  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B7D7  cmp		 dword ptr [i],eax 

000000013F85B7DB  jge		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+28Eh (13F85B83Eh) 

		//Multiply column vector with our input vector

		#pragma unroll 

		for (int i = 0; i < columnSize_; i++)

		{

			total  += (float)(device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i]);

000000013F85B7DD  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B7E3  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B7EA  add		 eax,dword ptr [i] 

000000013F85B7EE  mov		 ecx,eax 

000000013F85B7F0  mov		 rax,qword ptr [shared_rowIDs] 

000000013F85B7F5  mov		 eax,dword ptr [rax+rcx*4] 

000000013F85B7F8  sub		 eax,1 

000000013F85B7FB  movsxd	  rcx,eax 

000000013F85B7FE  mov		 rax,qword ptr [device_inputVector] 

000000013F85B806  movss	   xmm1,dword ptr [rax+rcx*4] 

000000013F85B80B  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B811  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B818  add		 eax,dword ptr [i] 

000000013F85B81C  mov		 ecx,eax 

000000013F85B81E  mov		 rax,qword ptr [shared_values] 

000000013F85B823  movss	   xmm0,dword ptr [rax+rcx*4] 

000000013F85B828  mulss	   xmm1,xmm0 

000000013F85B82C  movss	   xmm0,dword ptr [total] 

000000013F85B832  addss	   xmm0,xmm1 

000000013F85B836  movss	   dword ptr [total],xmm0 

		}

000000013F85B83C  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+216h (13F85B7C6h) 

				

		device_outputVector[0]  +=  (float)total;

000000013F85B83E  mov		 rax,qword ptr [device_outputVector] 

000000013F85B846  movss	   xmm0,dword ptr [rax] 

000000013F85B84A  addss	   xmm0,dword ptr [total] 

000000013F85B850  mov		 rax,qword ptr [device_outputVector] 

000000013F85B858  movss	   dword ptr [rax],xmm0 

000000013F85B85C  mov		 dword ptr [i],0 

000000013F85B864  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+2C1h (13F85B871h) 

000000013F85B866  mov		 eax,dword ptr [i] 

000000013F85B86A  add		 eax,1 

000000013F85B86D  mov		 dword ptr [i],eax 

000000013F85B871  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B877  cmp		 dword ptr [i],eax 

000000013F85B87B  jge		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+33Fh (13F85B8EFh) 

		//atomicAdd(&device_outputVector[0], (total) );

				

		

		for (int i = 0; i < columnSize_; i++)

		{

			//device_outputVector[(shared_resultOrder[threadIdx.x*columnSize_ + i])]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

			device_outputVector[1]  += device_inputVector[(shared_rowIDs[threadIdx.x*columnSize_ + i]  - 1 )]*shared_values[threadIdx.x*columnSize_ + i];

000000013F85B87D  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B883  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B88A  add		 eax,dword ptr [i] 

000000013F85B88E  mov		 ecx,eax 

000000013F85B890  mov		 rax,qword ptr [shared_rowIDs] 

000000013F85B895  mov		 eax,dword ptr [rax+rcx*4] 

000000013F85B898  sub		 eax,1 

000000013F85B89B  movsxd	  rcx,eax 

000000013F85B89E  mov		 rax,qword ptr [device_inputVector] 

000000013F85B8A6  movss	   xmm1,dword ptr [rax+rcx*4] 

000000013F85B8AB  mov		 eax,dword ptr [threadIdx (13F8D8EB8h)] 

000000013F85B8B1  imul		eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B8B8  add		 eax,dword ptr [i] 

000000013F85B8BC  mov		 ecx,eax 

000000013F85B8BE  mov		 rax,qword ptr [shared_values] 

000000013F85B8C3  movss	   xmm0,dword ptr [rax+rcx*4] 

000000013F85B8C8  mulss	   xmm1,xmm0 

000000013F85B8CC  mov		 rax,qword ptr [device_outputVector] 

000000013F85B8D4  movss	   xmm0,dword ptr [rax+4] 

000000013F85B8D9  addss	   xmm0,xmm1 

000000013F85B8DD  mov		 rax,qword ptr [device_outputVector] 

000000013F85B8E5  movss	   dword ptr [rax+4],xmm0 

		}

000000013F85B8EA  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+2B6h (13F85B866h) 

000000013F85B8EF  mov		 dword ptr [i],0 

000000013F85B8F7  jmp		 _Z26blockMultiplyColumnPerGridPfS_S_PiS0_+354h (13F85B904h) 

000000013F85B8F9  mov		 eax,dword ptr [i] 

000000013F85B8FD  add		 eax,1 

000000013F85B900  mov		 dword ptr [i],eax 

000000013F85B904  mov		 eax,dword ptr [columnSize_ (13F8D8EF8h)] 

000000013F85B90A  cmp		 dword ptr [i],eax 

000000013F85B90E  jge		 000000013F85B9DB

Thats the assembly but i m pretty much lost in it

Are you sure you are not running into problems with the non-associativity of floating point additions?
How large is columnSize_?

Edit: Just realized that this was your original question. So I think yes, that’s the problem. What results do you get with double precision?

The column is 400,000 elements long which is then split into 256thread per block with around 300blocks total

where each thread calculates a sub part of the actual column then adds its result to the total

But even if its the float associativity problem i am running the sample code in device emulation mode which would run on the CPU in a sequential manner wouldn’t it?

Yes, but you still have different orders in the different samples. In the cases that you have labeled as incorrect, you have done blockwise sums and an extra reduction of the block sums, while in the cases labeled as correct, you’ve directly summed up all values.

And with 400,000 values, the discrepancy can easily be explained by rounding errors violating float associativity. You’d either need to make sure you sum these in order of (roughly) increasing absolute value, of use double precision.

Thank mate really needed that explanation since it was driving me crazy i ll try using double precision see how it goes

Kahan summation might be another useful option to explore too.

any reading material you would suggest?

Actually yes. This article appeared in Doctor Dobb’s 15 years ago, and I still have a very old, coffee stained copy of it in a easy to reach place. It draws heavily on an older ACM paper which I think should be mandatory reading:

@article{103163,

 author = {Goldberg, David},

 title = {What every computer scientist should know about floating-point arithmetic},

 journal = {ACM Comput. Surv.},

 volume = {23},

 number = {1},

 year = {1991},

 issn = {0360-0300},

 pages = {5--48},

 doi = {http://doi.acm.org/10.1145/103162.103163},

 publisher = {ACM},

 address = {New York, NY, USA},

 }

So one question guys atomic double addition function can i recycle the float one or is their a built in function in CUDA

Atomic double operations are difficult as there are no atomic operations on 8-byte words in CUDA. You’d need a semaphore to provide exclusive access to the variable. I’d strongly suggest not to take that direction.

Rather, just save the individual block sums and start an extra kernel for the final addition.

Before i impelment the kernal to reduce threads you think this code i found from Atomic Double Add will work?

static __inline__ __device__ double atomicAdd(double *addr, double val)

{

	double old=*addr, assumed;

	

	do {

		assumed = old;

		old = __longlong_as_double( atomicCAS((unsigned long long int*)addr,

										__double_as_longlong(assumed),

										__double_as_longlong(val+assumed)));

	} while( assumed!=old );

	return old;

}

I stand corrected.
I thought to remember there are no 64 bit atomic operations in Cuda. Checking the programming guide, I see they are there for compute capability 1.2 and above, they just need to operate in global memory. So yes, you can use any of those atomic adds.