Global Sum on Multiprocessor

Hi, All,

it seems that I still did not understand what does warp mean :(

I wrote a global sum: is summarize value v (equal to threadIdx.x) on all threads and save it in a s[bx], where bx=blockIdx.x.

#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cutil.h>

void __global__ gsum(int NTHREADS, int NBLOCKS, int *s)

 { int bx = blockIdx.x; // Block index

   int tx = threadIdx.x; // Thread index

   int i;

   float z;

   __shared__ float v[1000];

  v[tx]=tx+NTHREADS*NTHREADS*bx;

   __syncthreads();

   z=0;

   for(i=0; i<NTHREADS; i++)

     z+=v[i];

   __syncthreads();

   s[bx]=z;

   __syncthreads();

   return;

 }

int main(int nArg, char **ppArg)

 { int NTHREADS, NBLOCKS, *S, *d_S, i;

  sscanf(ppArg[1], "%d", &NTHREADS);

   sscanf(ppArg[2], "%d", &NBLOCKS);

  dim3 threads(NTHREADS, 1);

   dim3 grid(NBLOCKS, 1);

  CUT_DEVICE_INIT();

  S=(int*)malloc(sizeof(int)*NBLOCKS);

  CUDA_SAFE_CALL(cudaMalloc((void**) &d_S, sizeof(int)*NBLOCKS));

   gsum<<<grid,threads>>>(NTHREADS, NBLOCKS, d_S);

   CUDA_SAFE_CALL(cudaMemcpy(S, d_S, sizeof(int)*NBLOCKS, cudaMemcpyDeviceToHost));

   for(i=0; i<NBLOCKS; i++)

   { printf("%d ", S[i]);

     if(S[i]!=(NTHREADS*(NTHREADS-1))/2+NTHREADS*NTHREADS*NTHREADS*i)

       printf("Error (%d)",S[i]-(NTHREADS*(NTHREADS-1))/2-NTHREADS*NTHREADS*NTHREADS*i);

   }

   printf("\n");

   return 0;

 }

It works correctly for small number of threads (<=16) and some times fail for bigger one. It seems that I did not understand something. Is it right that __synchthreads() do it only on a subpool of threads running on one block, and this pool is exactly one warp?

Thank you

Sincerely

Ilghiz

__syncthreads() is applied to all the threads of a block, whether they belong to the same warp or not.

A warp is 32 threads that belong to the same threadblock. Threads 0, 31 form warp 0, threads 32, 63 form warp 2, etc. Check the programming guide for how thread IDs and distribution among warps is done when using 2D or 3D threadblocks.

Paulius

Hi, Paulius,

thank you for your reply…

Yes, I considered this, however, if

  1. NTHREADS=32, NBLOCKS=640,

  2. NTHREADS=64, NBLOCKS=72,

  3. NTHREADS=128, NBLOCKS=10,

  4. NTHREADS=256, NBLOCKS=3,

  5. NTHREADS=512, NBLOCKS=2

I got incorrect computations, so, several blocks was not synchronized.

Yes, I so it in the manual, but it does not explain my problem.

Sincerely

PS: I fogot an includes in the source above, I will change it now that everybody can repeat this on your computer.

Ilghiz

Blocks are definately not guaranteed to be synchronized and your computations should not rely on their synchronization.

Paulius

Hi,

I’ve gone through your code and the problem is that you accumulate and store each thread’s sum into a float variable. If you change the types of v and z to float, you will get the exact results.

This is an issue with IEEE754-standard, not hardware or CUDA. There are many many integers that can be accurately stored in int format, but not in float. An excellent reference to the intricacies of the floating point format is a paper by David Goldberg “What every computer scientist should know about floating-point arithmetic.” google gives many links to pdfs.

Paulius

Hi, Paulious,

yes, and I do not expect any synchronization between blocks, only IN one block!

Sincerely

Ilghiz

Sorry for this thread, you are right! I thouth that I am doing the same type of summation and working only with doubles (not floats) last 20 years and just forget that float can index till 2^23…

Sincerely

Ilghiz