Sum vectors

local_hero · December 15, 2011, 10:06am

Hi people, I must sum two vectors and save the result in a third vector, each thread must do only a sum. This is simple but I don’t know to do this then the size of my vectors is major of max number of threads (in my case 33.553.920 threads). How do you do? Must I call the kernel function several times in a cicle “for”? But so I don’t know to organize data when I return in the host code :(. I hope in your reply, is very important for my thesis. Thank you!

pasoleatis · December 15, 2011, 11:09am

The number of threads possible is (1024x65000x65000x64) (for a compute capability device 2.0) this is about 100000 times more than the number you wrote. You can make each thread to make more than 1 addition with a simple loop inside the kernel. Please write the code you tried and did not work, from your message I could not understand what is the problem.

local_hero · December 15, 2011, 1:30pm

The code is:

__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

//Dichiarazioni variabili sul device

        int tid, sh, linIdx, i, z, j;

//Calcolo del thread ID

        sh = threadIdx.x+32*threadIdx.y;

        tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

        //Algoritmo per il calcolo degli indici giusti in base al thread

        if (tid<N2){

                linIdx=N2-tid;

                i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

                z=(N+N-1-i)*i;

                j=tid - z/2 + 1 + i;

if (i==j){

                        i=i-1;

                        j=N-1;

                }

//Somma di due quadrivettori

                VettRis[tid]=Vett[i]+Vett[j];

}

and call with:

sumVect<<<dim3(2048,32,1),dim3(32,16,1)>>>(QVect_Dev, QVect_Dev_Ris,N,coefBin);

I must use 2D blocks and 512 thread for block, this is the number optimal of threads to exploit at 100% the cores of GPU of a Tesla S2050, it’s write in “CUDA_Occupancy_Calculator.xls” document in the CUDA download page.

pasoleatis · December 15, 2011, 1:35pm

I see. I already replied on the other topic you opened.
Depending on the problem you could get better performance with lower occupancy. What is important is the total time to complete.

Two suggestions:

If you keep the blocks the same use a loop inside the kernel to make more operations, but this will result in different number of registries and different occupancy.
You can also submit kernels with different parts of the vectors and use streams to submit the kernels. This would require less modifications.

local_hero · December 15, 2011, 1:53pm

My global function mustn’t have cycles, it must be complexity costant. Then your second case is that good for me. But I don’t know which changes to do :(

pasoleatis · December 15, 2011, 2:10pm

First you need to look at your problem if you can divide the work in independent parts. For example if you have a simple addition C[i]=A[i]+B[i]; and divide he work in 2 parts for example. You would have 2 calls in which first call would compute from 0 to N/2-1, second call from N/2 to N-1. If the work is enough to fill gpu you do not need to use streams you can make the calls one after each other. The way you divide it will depend on your specific problem.

Some suggestions:

Do not get hung up in maximum occupancy. Sometimes smaller occupancy can give better performance. In practice you need to play a little to see which occupancy is better.

In order to reduce the probability of errors, use in kernel:

instead of sh = threadIdx.x+32threadIdx.y; use sh = threadIdx.x+blockDim.xthreadIdx.y;

and instead tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh; use tid = blockDim.xb * blockIdx.x + gridSize.x blockIdx.y + sh;

local_hero · December 15, 2011, 2:46pm

sorry, what is “b” in tid = blockDim.xb * blockIdx.x + gridSize.x blockIdx.y + sh;? And gridSize is not definited

pasoleatis · December 15, 2011, 2:50pm

Sorry for the typos. Here is corrected:

tid = blockDim.xblockDim.yblockIdx.x + gridDim.x* blockIdx.y + sh;

blockDim.x*blockDim.y is the total number of threads per block, gridDim.x is size of the grid of blocks in the x direction.

local_hero · December 15, 2011, 3:12pm

however I have increase the number of block, I can push until <<<dim3(65535,65535,1),dim3(32,16,1)>>> and is more than enough :). Now but I have a new problem :P: the memory allocated on the device can not be more than 2.7GB. How do we solve this other problem? Is there a way to take advantage of all 12GB of my Tesla?

pasoleatis · December 15, 2011, 3:28pm

Your S2050 has 4 gpus cards which do not know about each other. For example the card 0 can not access resources allocated on card 1. You can have a multigpu program with each gpu doing calculations for 1/4 th of the total calculations. The kernels call are non-blocking so you could do a code like this:

set card 0 as current
call kernel with first part of the work
set card 1 as current
call kernel with first part of the work

At the end when you collect the data you can synchronize the devices.

and so on. Because the kernel calls are non-blocking, after you make call 2) for example the control is given back to the cpu (even if the gpu 0 is still working) and it continues to line 3) and 4) and so on.

local_hero · December 15, 2011, 8:57pm

We are in the first situation then. We must divide the work but I do not know. This function sums without repetition of the elements of the vector, then, to make you understand:

VetRis[0]=Vett[0]+Vett[1];

VetRis[1]=Vett[0]+Vett[2];

…

VetRis[i]=Vett[0]+Vett[N-1];

VetRis[i+1]=Vett[1]+Vett[2];

VetRis[i+2]=Vett[1]+Vett[3];

VetRis[i+3]=Vett[1]+Vett[4];

…

VetRis[j]=Vett[1]+Vett[N-1];

VetRis[j+1]=Vett[2]+Vett[3];

VetRis[j+2]=Vett[2]+Vett[4];

VetRis[j+3]=Vett[2]+Vett[5];

…

understood? Help me please, is for my thesis :(

local_hero · December 16, 2011, 9:53am

Nothing? Can not find a solution even you? :(

pasoleatis · December 16, 2011, 1:37pm

Hello,

I am sorry I can not help more, I am a beginner myself. In order to solve your problem you need to analyze the algorithm in detail and identify if you can split the work in 4 independent parts of about 3GB each, so that you can split the memory in 4. This depends on the details of your problem and you know them better than me.

If the splitting is possible than you can allocate the vector and use all 4 cards.

cricri1 · December 16, 2011, 3:20pm

perhaps possible but hard

for 7 number you have 21 mult so 21 thread → =214 octet + 74*2 octet

for 10000 number you have 49995000 so 49995000 thread ->499950004 + 100004*2 =200Mo
for 20000 number =800Mo

how many max number you have done and you want done ??

local_hero · December 16, 2011, 6:38pm

Yes, the situation is almost this, only that I sum for each thread a struct of float so defined:

typedef struct {

        float Ene; //Energia

        float x;

        float y;

        float z;

} QVECT;

then each thread sum one struct of QVECT (four float = 16 bytes). I managed to get through to 2.6 GB (approximately 12000 or 15000 number, I don’t remember of precise). Unfortunately I must push over because I must do an experiment for my thesis with big number :S Can you help me (again :D ehehehhe)? :)

cricri1 · December 19, 2011, 9:58am

ok
for 10000 you can do only 9999 thread with 10000/2=5000
for 10001 you can do only 10000 thread with (10001-1)/2=5000
for 10002 you can do only 10001 thread with 10000/2=5001
for 10003 you can do only 10000 thread with (10003-1)/2=5001
etc

after must do a for and do some call for only use the memory

i try to write this

local_hero · December 19, 2011, 10:05am

sorry but I don’t understand. What do you intend with the quote above?

tera · December 19, 2011, 11:18am

If you are using CUDA 4.0 on a 64 bit system, the unified virtual address space (UVA) can simplify your task of splitting the data between cards a lot. With UVA, it does not matter for correctness on which card the data is stored, it only affects speed.

So if you can find a partition scheme where each card most of the time operates on data in its own memory, but sometime accesses data from different cards, your problem is solved.

local_hero · December 19, 2011, 11:56am

Yes, I have a CUDA 4.0 on 64bit system (my hardware is a Tesla S2050). How to use this technique?

tera · December 19, 2011, 12:35pm

With [font=“Courier New”]cudaDeviceEnablePeerAccess()[/font]. Check section 3.2.6.4 (and up to section 3.2.7) of the Programming Guide.

Topic		Replies	Views
Help! Sum of vectors CUDA Programming and Performance	7	893	June 16, 2011
HELP with vector sum CUDA Programming and Performance	6	2219	May 11, 2010
Sum of N numbers in parallel in pairs without repetition. CUDA Programming and Performance	23	2617	December 20, 2011
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3968	May 21, 2011
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10004	January 18, 2012
Reduction CUDA Programming and Performance	19	3465	May 16, 2012
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4431	February 7, 2012
matrix multiply reduction CUDA Programming and Performance	41	35549	January 15, 2011
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Getting started with CUDA ... cannot add simple vectors CUDA Programming and Performance	9	20930	January 31, 2011

Sum vectors

Related topics