2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction

Hi,

i’m actually working on parallelization of a small CPU+host code which implies 2 big arrays:

#define NPTS1 10000

#define NPTS2 10000

typedef struct pt3D {

    float x;

    float y;

    float z;

} pt3D;

pt3D cloud1[NPTS1];

pt3D cloud2[NPTS2];

For the moment these arrays are initialized with random values

and we want to get the sum of the euclidean distances (without the sqrt) of each single cloud1 point regarding all cloud2

for(i=0; i<NPTS1; i++) {

        for(j=0; j<NPTS2; j++) {

            sumCPU +=  (cloud1[i].x - cloud2[j].x) * (cloud1[i].x - cloud2[j].x) +

                    (cloud1[i].y - cloud2[j].y) * (cloud1[i].y - cloud2[j].y) + 

                    (cloud1[i].z - cloud2[j].z) * (cloud1[i].z - cloud2[j].z);

        }

     }

sumCPU contains our final wanted result. We work in single precision float.

I’m facing troubles coding cuda, i decomposed the problem in 2 steps:

1/ One kernel will do the calculus of all the distances and store the result in a big array sum_d of size NPTS1NPTS2sizeof(float) in global memory (for the moment i don’t look for anykind of optimization i just want to have something accurate)

__global__ void calc_dist(pt3D * c1, pt3D * c2, float* sum_d)

{

unsigned int i=blockDim.x*blockIdx.x + threadIdx.x;

  unsigned int j=blockDim.y*blockIdx.y + threadIdx.y;

  if(i < NPTS1 and j < NPTS2)

    {

      sum_d[i*NPTS2 + j] = (c1[i].x-c2[j].x)*(c1[i].x-c2[j].x) + 

	                  (c1[i].y-c2[j].y)*(c1[i].y-c2[j].y) + 

	                  (c1[i].z-c2[j].z)*(c1[i].z-c2[j].z);

    }  

}

on a grid with blocksPerGrid(NPTS1/16+1,NPTS2/16+1) and threadsPerBlock(16,16) parameters

2/ i use cublas library on the sum_d array to get the actual sum result in an pointer float sumGPUp

status = cublasSasum(handle,NPTS1*NPTS2,sum_d,1,sumGPUp);

first problem:

regarding values of NPTS1 and NPTS2 i get problems with final results, at the moment i tested the code for NPTS1 and NPTS2 multiple of 10,

if NPTS1<1000 and NPTS2 <10000 there is no big troubles

GPU TIME : 0.010257 seconds

CPU TIME : 0.020978 seconds

GAIN : 2.045197

relative error CPU/GPU: 1.247682 %

STOP GPU: 4.548740e+15

STOP CPU: 4.492685e+15

but if i take NPTS1=NPTS2=10000:

GPU TIME : 0.061120 seconds

CPU TIME : 0.205745 seconds

GAIN : 3.366221

relative error CPU/GPU: <b>96.780182 %</b>

STOP GPU: 4.501012e+16

STOP CPU: 2.287330e+16

The final goal is to use NPTS1=NPTS2=1000 000 !

second problem:

facing troubles to use cublasSasum function on a NPTS1*NPTS2= 1 000 000 000 000 floats size, i don’t know what is the limit to use this cublas function.

Thanks a lot for you precious help and feedback,

Pascal

    Assuming this really is what you want to compute, why not sum the two arrays individually and then just take the difference of the sums on the CPU?

    Check return codes of all CUDA calls for errors. NPTS1=NPTS2=10000 needs about 1,2GB for the intermediate array. I don’t know which GPU you are using, but chances are you don’t have enough memory and the allocation is failing.

    Storing the intermediate result (the differences) and reading it back is by far the most expensive operation of this computation. As you already write your own custom kernel, why not just do the reduction inside the kernel? At the very least, each thread should compute multiple results, sum these (reductions within one thread are simple and fast) and write only one float. This reduces both the bandwidth needed and the exorbitant size of the intermediate array.

    []To further minimize the bandwidth requirements of the kernel, you should switch to a tiled approach, where each block reads just 2N input values to operate on N[sup]2[/sup] pairs.

Thank you very much for your precious feedback,

I’m not sure to get what you mean, since i need the sum between one element of the first array with all the element of the second and so on ?

I guess 4 bytes per float then => 400 Mb for NPTS1=NPTS2=10 000 (since i get a float array at the end and not a pt3D array) but its true that for 1000 000 i will encounter issues regarding that point.

Regarding errors, for NPTS1=NPTS2=10 000 i don’t have errors but still final computation is twice what i expected, therefore i was guessing on maybe there is a mistake in my kernel or in the grid definition?

I’m actually not really sure how to proceed 2D reduction for a single thread? i read the excellent 1D reduction guide from Mike Harris and SDK CUDA but found it hard with my basics knowledge in cuda to do transpose it in a 2D case. What i tried was to associate the NPTS2 CPU loop to each thread:

unsigned int i=blockDim.x*blockIdx.x + threadIdx.x;

 __global__ void calc_dist(pt3D * c1, pt3D * c2, float* sum_d)

{

unsigned int i=blockDim.x*blockIdx.x + threadIdx.x;

  if(i < NPTS1)

    {

      sum_d[i]=0;

      for(unsigned int j = 0;j<NPTS2;j++)

	{

	  sum_d[i] += (c1[i].x-c2[j].x)*(c1[i].x-c2[j].x) + 

	              (c1[i].y-c2[j].y)*(c1[i].y-c2[j].y) + 

	               (c1[i].z-c2[j].z)*(c1[i].z-c2[j].z);

	}

    }  

}

for a one dimensional grid on i with NPTS1/256 +1 blocks and 256 threadsPerBlock. But its not working great again for high values of NPTS1 and NPTS2, i guess because c2[j] element is accessed by multiple threads simultaneously, that reduces the size of final sum_d array to NPTS1 float element.

Actually i want to adapt this kernel from the SDK to my 2D case but without success for the moment:

reduce0(float *g_idata, float *g_odata, unsigned int n)

{

    extern __shared__ float sdata[];

// load shared mem

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = (i < n) ? g_idata[i] : 0;

__syncthreads();

// do reduction in shared mem

    for(unsigned int s=1; s < blockDim.x; s *= 2) {

        // modulo arithmetic is slow!

        if ((tid % (2*s)) == 0) {

            sdata[tid] += sdata[tid + s];

        }

        __syncthreads();

    }

// write result for this block to global mem

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

i guess this part is related to brent theorem but for the moment i just want the basics to work and i will go further then!

Again thank you and waiting for your reply,

Pascal

    I’m thinking of

    although you’d need to check whether single precision is still sufficient as this formula is more susceptible to loss of accuracy.

    Ah you are right, it’s 400Mb as you are not storing the components of the vector individually. Still you get the idea…

    What is the fundamental difference between a 1D reduction and a 2D reduction that keeps you from just using the 1D code?

    In the code you give, you can move the load of [font=“Courier New”]c1[i][/font] out of the loop (assuming the compiler doesn’t do it already).

    The code from the SDK example is also what I’m thinking of ultimately. Just sum as many terms as possible within each thread and defer the final reduction between the threads of the block to the very end.

    I’m just talking of avoiding unnecessary reloads of data from global memory. Instead of having each thread load [font=“Times New Roman”]x[sub]m[/sub][/font] and [font=“Times New Roman”]y[sub]n[/sub][/font], resulting in [font=“Times New Roman”]M[/font]×[font=“Times New Roman”]N[/font] loads, it is sufficient to load [font=“Times New Roman”]x[sub]m[/sub][/font] and [font=“Times New Roman”]y[sub]n[/sub][/font] once per block, resulting in [font=“Times New Roman”]M[/font]+[font=“Times New Roman”]N[/font] loads per block.

okay ! thank you again for all these good tips, i will come back with new results, i will give a shot to the implementation of formula in single precision with cublas premade routines!

Sincerely,

Pascal

Hi,

i implemented the given formula with great success using 12 CUBLAS routines for the calculus of the 12 simple sums, and it works really fine with so much speed-up compared to the global 2D sum CPU version. For N=M= 100 000, my float arrays are initialized with specific rand values.

START CPU
CPU TOTAL TIME: 101.711845 seconds
STOP CPU : 1.799312e+19
START GPU: REDUCTION Version0

EXECUTION GPU
GPU TOTAL TIME : 0.012736 seconds
CPU TOTAL TIME : 101.711845 seconds
GAIN : 7986.309082
Relative Error CPU/GPU: 0.000015 %
STOP GPU: 1.799312e+19
STOP CPU: 1.799312e+19

But now my supervisor (i’m in internship, using cuda for scientific programming) wants me to add the square root to all elements within the sum, decomposition in simple sum is not possible anymore i guess? Since i have to reduce sum((sum(sqrt( (x_n - y_m) )²)))?

I will give a shot for the kernel 1D reduction adaptation.

Thank you for your feedback!

Pascal

With the square root, only approximate simplification is possible using multipole expansion. But using the other tips, you should arrive at a reasonably fast implementation.

Hi,

today i just implemented that kernel which seems to be quite efficient already:

__global__ void reduction_v0(pt3D * c1, pt3D * c2, float* sum_d)

	{

	 float s=0.0;

	 unsigned int i=blockDim.x*blockIdx.x+threadIdx.x;

	 __syncthreads();

	 if(i < NPTS1)

	   {

	  for(unsigned int j =0; j<NPTS2;j++)

	    {

	      s += sqrt((c1[i].x-c2[j].x)*(c1[i].x-c2[j].x) + 

                        (c1[i].y-c2[j].y)*(c1[i].y-c2[j].y) + 

			(c1[i].z-c2[j].z)*(c1[i].z-c2[j].z));

	    }

	 

	  sum_d[i]=s;

	   }

	}

Each thread is doing the calculus for one element of the NPTS1 cloud point, and the result is stored in a global memory array sum_d. This array is then copied on the host side, and a simple NPTS1 loop is made on the host side array to compute the final value.

Guess this version is REALLY not optimal at all but is ‘enough for the moment’ since our performance goal is reached.

GPU implementation speeds-up by ~40 CPU double loop implementation, for high dimensional NPTS1 and NPTS2 (>= 100 000 and <= 10 000 000).

just as an example of results i get for NPTS1=NPTS2=200 000

CPU TIME: 183.53 seconds

GPU TIME: 4.35 seconds

NPTS1=NPTS2=200 000

Relative error %: 0,000152%

SPEED-UP FACTOR: 42,1908046

Pascal

For such big dimentions maybe there is no other suitable way to do this since the GPU should be fully used.
I think that with the maximum block size you would have to launch the kernel multiple times for it to work.

Actually it seems that with the simple implementation i made (previous post), i can use like 1 000 000 size for my 2 cloud of points and i still get accurate results.

But i have a questions on ‘kernel limitations’:

  • do you know if there is a defined ‘maximum spent time’ for a thread execution ? i mean, if i keep improving size i have troubles and i was wondering if Streaming Processors include a sort of limitation in time to run a single thread ? or more generally Is the SIMT scheduler having a limitation on time to run a Warp on a SM ?

Maybe the answer is obvious but i don’t really get it for the moment!

Thank you,

Pascal

If your GPU is used for displaying the GUI as well, there is a watchdog timer that triggers after about 2 to 5 seconds to prevent the computer from becoming unusable.

EDIT: The timer is reset on every kernel launch (or batch of kernels of Windows WDDM) though, there is no additional limit per thread/warp/block/whatever.

okay! i guess it triggers regarding execution of first warp ? (the time 2-5 seconds will be the max time for each thread ? if the thread requires more time it crashes ?)

And in fact it is a device propery and i guess we can switch it off in need. Here is the device query of the computer i’m actually working on, simple cluster of 2 Quadro 5000 FX nvidia cards.

UDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: “Quadro 5000”
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2239 MBytes (2347958272 bytes)
(11) Multiprocessors x (32) CUDA Cores/MP: 352 CUDA Cores
GPU Clock Speed: 1.03 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “Quadro 5000”
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2239 MBytes (2348220416 bytes)
(11) Multiprocessors x (32) CUDA Cores/MP: 352 CUDA Cores
GPU Clock Speed: 1.03 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Quadro 5000, Device = Quadro 5000
[deviceQuery] test results…
PASSED

Press ENTER to exit…

Pascal