What is the integer size of the registers allowed for EACH co-processor?

The physical size? I think its 32-bits for floating point and integers.

Thanks.

Do you know how many processors that you can use in parallel?

This is explained in the technical specifications section of the programming guide.

On the GeForce 8800 GTX / Quadro FX 5600 there are 16 multiprocessors, each comprised of 8 processors, giving a total of 128 MAD units.

Sorry, but another quick question. Is the integer size 32 bits per processor or per multiprocessor?

Each processor is capable of operating on 32-bit floats or integers.

Since the processors are 32 bits, I should be able to do 4096 bits of arithmetic, right?

If I multiply 2 32 bit numbers can I get the intermediate result of 64 bits or not?

Thanks for all the help.

The ALUs look like SIMD (and do run in lock step) but each has a separate register set. So you have to think about it not like MMX wide registers but like the 4-vector processing in GPU shaders. That means there is no carry between ALUs. Your integer mult will simply produce an overflow in case.

Peter

Thanks for all the help. I really appreciate it. I have another question.

if I have a vector

V=[v_1,v_2,…v_128]

and W=[w_1,w_2,…w_128]

I want the addition to be

Z=V+W (as vectors)

z_1 = w_1 + v_1 module P_1 where P_1 is a prime

similarly for z_2,…

Is this possible?

Yeah this is easy. Upload the V, W and P vector, then run a kernel with 128 threads

```
__global__ void test(float* V, float* W, float* P, float* result)
{
result[threadIdx.x] = (V[threadIdx.x] + W[threadIdx.x]) % P[threadIdx.x];
}
```

Then download result. Note that is uses only 1 block, the number of threads is also pretty low and they don’t do much computations (you are memory bound). So the performance will be disappointing. You need more work for the GPU in order to amortize the up/download overhead.

Peter

I noticed that it is float. Can this be done using integer?

As to the number of threads, I assume they are parallel

hardware? If so, can we do 256 parallel adds that are independent?

I need a large number of parallel add, subtract, multiply, modulo a prime

all in integer.

Thanks again for the help. This forum has been very helpful.

Yes, CUDA supports int natively.

See the CUDA manual for max limitations of resources. Currently, you can do 512 threads per block and 64k x 64k blocks.

Also, I would suggest you do loop unrolling, ie. do the calculation for i, i+1, i+2, i+3, … i+n in one kernel call. That should let you handle array sizes that are far beyond what fits in memory :)

Peter

Is it possible for all these operations to be independent?

Let R be result array, X and Y arrays of ints we want to work with and P is prime array

for ( i = 0; i <512; i++) {

int R [ i ] = int X[i] + int Y[i] % int P[i] ;

}

Is it possible for this to be unrolled to 512 parallel threads?

Hi -

I think I may not have been asking a specific enough questions. I am more curious about the capabilities of the card at the hardware level.

I wondering what is the max number of simultaneous operations that I can do (at the hardware level) of add/subtract/multiply/divide, not necessarily how many threads, unless of course there is a 1:1 relationship between thread and hardware operation.

I also do not need the co-processors to communicate with one another - as long as the master processor is able to communicate with all the co-processors.

I want to allocate all variables on the GPU.

I want to simultaneously do the following as many times as the GPU has the hardware resources to allow me. My question is, how many times can I do the following simultaneously at the hardware level?

r[i] = a[i] + b[i] % p[i];

where

r = result int array

a = int array

b = int array

p = prime number int array

Thanks for all the help. These forums are great.

See the programming guide chapter 5 about hardware capabilities.

It basically depends on the number of shared resources needed in the kernel (registers and shared variables) that rule how many thread blocks can execute simultaneously on one multiprocessor. Each block has 512 threads max. There are 16 multiprocessors.

Peter