Sum vectors

Hi people, I must sum two vectors and save the result in a third vector, each thread must do only a sum. This is simple but I don’t know to do this then the size of my vectors is major of max number of threads (in my case 33.553.920 threads). How do you do? Must I call the kernel function several times in a cicle “for”? But so I don’t know to organize data when I return in the host code :(. I hope in your reply, is very important for my thesis. Thank you!

The number of threads possible is (1024x65000x65000x64) (for a compute capability device 2.0) this is about 100000 times more than the number you wrote. You can make each thread to make more than 1 addition with a simple loop inside the kernel. Please write the code you tried and did not work, from your message I could not understand what is the problem.

The code is:

__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

//Dichiarazioni variabili sul device

        int tid, sh, linIdx, i, z, j;

//Calcolo del thread ID

        sh = threadIdx.x+32*threadIdx.y;

        tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;

        //Algoritmo per il calcolo degli indici giusti in base al thread

        if (tid<N2){

                linIdx=N2-tid;

                i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

                z=(N+N-1-i)*i;

                j=tid - z/2 + 1 + i;

if (i==j){

                        i=i-1;

                        j=N-1;

                }

//Somma di due quadrivettori

                VettRis[tid]=Vett[i]+Vett[j];

}

and call with:

sumVect<<<dim3(2048,32,1),dim3(32,16,1)>>>(QVect_Dev, QVect_Dev_Ris,N,coefBin);

I must use 2D blocks and 512 thread for block, this is the number optimal of threads to exploit at 100% the cores of GPU of a Tesla S2050, it’s write in “CUDA_Occupancy_Calculator.xls” document in the CUDA download page.

I see. I already replied on the other topic you opened.
Depending on the problem you could get better performance with lower occupancy. What is important is the total time to complete.

Two suggestions:

  1. If you keep the blocks the same use a loop inside the kernel to make more operations, but this will result in different number of registries and different occupancy.
  2. You can also submit kernels with different parts of the vectors and use streams to submit the kernels. This would require less modifications.

My global function mustn’t have cycles, it must be complexity costant. Then your second case is that good for me. But I don’t know which changes to do :(

First you need to look at your problem if you can divide the work in independent parts. For example if you have a simple addition C[i]=A[i]+B[i]; and divide he work in 2 parts for example. You would have 2 calls in which first call would compute from 0 to N/2-1, second call from N/2 to N-1. If the work is enough to fill gpu you do not need to use streams you can make the calls one after each other. The way you divide it will depend on your specific problem.

Some suggestions:

Do not get hung up in maximum occupancy. Sometimes smaller occupancy can give better performance. In practice you need to play a little to see which occupancy is better.

In order to reduce the probability of errors, use in kernel:

instead of sh = threadIdx.x+32threadIdx.y; use sh = threadIdx.x+blockDim.xthreadIdx.y;

and instead tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh; use tid = blockDim.xb * blockIdx.x + gridSize.x blockIdx.y + sh;

sorry, what is “b” in tid = blockDim.xb * blockIdx.x + gridSize.x blockIdx.y + sh;? And gridSize is not definited

Sorry for the typos. Here is corrected:

tid = blockDim.xblockDim.yblockIdx.x + gridDim.x* blockIdx.y + sh;

blockDim.x*blockDim.y is the total number of threads per block, gridDim.x is size of the grid of blocks in the x direction.

however I have increase the number of block, I can push until <<<dim3(65535,65535,1),dim3(32,16,1)>>> and is more than enough :). Now but I have a new problem :P: the memory allocated on the device can not be more than 2.7GB. How do we solve this other problem? Is there a way to take advantage of all 12GB of my Tesla?

Your S2050 has 4 gpus cards which do not know about each other. For example the card 0 can not access resources allocated on card 1. You can have a multigpu program with each gpu doing calculations for 1/4 th of the total calculations. The kernels call are non-blocking so you could do a code like this:

  1. set card 0 as current

  2. call kernel with first part of the work

  3. set card 1 as current

  4. call kernel with first part of the work

At the end when you collect the data you can synchronize the devices.

and so on. Because the kernel calls are non-blocking, after you make call 2) for example the control is given back to the cpu (even if the gpu 0 is still working) and it continues to line 3) and 4) and so on.

We are in the first situation then. We must divide the work but I do not know. This function sums without repetition of the elements of the vector, then, to make you understand:

VetRis[0]=Vett[0]+Vett[1];

VetRis[1]=Vett[0]+Vett[2];

VetRis[i]=Vett[0]+Vett[N-1];

VetRis[i+1]=Vett[1]+Vett[2];

VetRis[i+2]=Vett[1]+Vett[3];

VetRis[i+3]=Vett[1]+Vett[4];

VetRis[j]=Vett[1]+Vett[N-1];

VetRis[j+1]=Vett[2]+Vett[3];

VetRis[j+2]=Vett[2]+Vett[4];

VetRis[j+3]=Vett[2]+Vett[5];

understood? Help me please, is for my thesis :(

Nothing? Can not find a solution even you? :(

Hello,

I am sorry I can not help more, I am a beginner myself. In order to solve your problem you need to analyze the algorithm in detail and identify if you can split the work in 4 independent parts of about 3GB each, so that you can split the memory in 4. This depends on the details of your problem and you know them better than me.

If the splitting is possible than you can allocate the vector and use all 4 cards.

perhaps possible but hard

for 7 number you have 21 mult so 21 thread → =214 octet + 74*2 octet

for 10000 number you have 49995000 so 49995000 thread ->499950004 + 100004*2 =200Mo
for 20000 number =800Mo

how many max number you have done and you want done ??

Yes, the situation is almost this, only that I sum for each thread a struct of float so defined:

typedef struct {

        float Ene; //Energia

        float x;

        float y;

        float z;

} QVECT;

then each thread sum one struct of QVECT (four float = 16 bytes). I managed to get through to 2.6 GB (approximately 12000 or 15000 number, I don’t remember of precise). Unfortunately I must push over because I must do an experiment for my thesis with big number :S Can you help me (again :D ehehehhe)? :)

ok
for 10000 you can do only 9999 thread with 10000/2=5000
for 10001 you can do only 10000 thread with (10001-1)/2=5000
for 10002 you can do only 10001 thread with 10000/2=5001
for 10003 you can do only 10000 thread with (10003-1)/2=5001
etc

after must do a for and do some call for only use the memory

i try to write this

sorry but I don’t understand. What do you intend with the quote above?

If you are using CUDA 4.0 on a 64 bit system, the unified virtual address space (UVA) can simplify your task of splitting the data between cards a lot. With UVA, it does not matter for correctness on which card the data is stored, it only affects speed.

So if you can find a partition scheme where each card most of the time operates on data in its own memory, but sometime accesses data from different cards, your problem is solved.

Yes, I have a CUDA 4.0 on 64bit system (my hardware is a Tesla S2050). How to use this technique?

With [font=“Courier New”]cudaDeviceEnablePeerAccess()[/font]. Check section 3.2.6.4 (and up to section 3.2.7) of the Programming Guide.