Integer size per coprocessor

blah · March 8, 2007, 3:10am

What is the integer size of the registers allowed for EACH co-processor?

GregD · March 8, 2007, 6:46am

The physical size? I think its 32-bits for floating point and integers.

blah · March 8, 2007, 10:16am

Thanks.

Do you know how many processors that you can use in parallel?

Simon_Green · March 8, 2007, 10:28am

This is explained in the technical specifications section of the programming guide.

On the GeForce 8800 GTX / Quadro FX 5600 there are 16 multiprocessors, each comprised of 8 processors, giving a total of 128 MAD units.

blah · March 8, 2007, 10:56am

Sorry, but another quick question. Is the integer size 32 bits per processor or per multiprocessor?

Simon_Green · March 8, 2007, 11:03am

Each processor is capable of operating on 32-bit floats or integers.

blah · March 8, 2007, 4:30pm

Since the processors are 32 bits, I should be able to do 4096 bits of arithmetic, right?

If I multiply 2 32 bit numbers can I get the intermediate result of 64 bits or not?

Thanks for all the help.

prkipfer · March 8, 2007, 5:10pm

The ALUs look like SIMD (and do run in lock step) but each has a separate register set. So you have to think about it not like MMX wide registers but like the 4-vector processing in GPU shaders. That means there is no carry between ALUs. Your integer mult will simply produce an overflow in case.

Peter

blah · March 8, 2007, 8:03pm

Thanks for all the help. I really appreciate it. I have another question.

if I have a vector
V=[v_1,v_2,…v_128]
and W=[w_1,w_2,…w_128]
I want the addition to be

Z=V+W (as vectors)

z_1 = w_1 + v_1 module P_1 where P_1 is a prime
similarly for z_2,…

Is this possible?

prkipfer · March 9, 2007, 10:00am

Yeah this is easy. Upload the V, W and P vector, then run a kernel with 128 threads

__global__ void test(float* V, float* W, float* P, float* result)

{

  result[threadIdx.x] = (V[threadIdx.x] + W[threadIdx.x]) % P[threadIdx.x];

}

Then download result. Note that is uses only 1 block, the number of threads is also pretty low and they don’t do much computations (you are memory bound). So the performance will be disappointing. You need more work for the GPU in order to amortize the up/download overhead.

Peter

blah · March 9, 2007, 4:28pm

I noticed that it is float. Can this be done using integer?

As to the number of threads, I assume they are parallel
hardware? If so, can we do 256 parallel adds that are independent?

I need a large number of parallel add, subtract, multiply, modulo a prime
all in integer.

Thanks again for the help. This forum has been very helpful.

prkipfer · March 9, 2007, 4:54pm

Yes, CUDA supports int natively.

See the CUDA manual for max limitations of resources. Currently, you can do 512 threads per block and 64k x 64k blocks.

Also, I would suggest you do loop unrolling, ie. do the calculation for i, i+1, i+2, i+3, … i+n in one kernel call. That should let you handle array sizes that are far beyond what fits in memory :)

Peter

blah · March 10, 2007, 6:27am

Is it possible for all these operations to be independent?

Let R be result array, X and Y arrays of ints we want to work with and P is prime array

for ( i = 0; i <512; i++) {
int R [ i ] = int X[i] + int Y[i] % int P[i] ;
}

Is it possible for this to be unrolled to 512 parallel threads?

blah · March 10, 2007, 11:21pm

Hi -

I think I may not have been asking a specific enough questions. I am more curious about the capabilities of the card at the hardware level.

I wondering what is the max number of simultaneous operations that I can do (at the hardware level) of add/subtract/multiply/divide, not necessarily how many threads, unless of course there is a 1:1 relationship between thread and hardware operation.

I also do not need the co-processors to communicate with one another - as long as the master processor is able to communicate with all the co-processors.

I want to allocate all variables on the GPU.

I want to simultaneously do the following as many times as the GPU has the hardware resources to allow me. My question is, how many times can I do the following simultaneously at the hardware level?

r[i] = a[i] + b[i] % p[i];

where

r = result int array
a = int array
b = int array
p = prime number int array

Thanks for all the help. These forums are great.

prkipfer · March 12, 2007, 12:31pm

See the programming guide chapter 5 about hardware capabilities.

It basically depends on the number of shared resources needed in the kernel (registers and shared variables) that rule how many thread blocks can execute simultaneously on one multiprocessor. Each block has 512 threads max. There are 16 multiprocessors.

Peter

Topic		Replies	Views
number of threads and registers CUDA Programming and Performance	10	4866	March 14, 2008
How many concurrent threads? CUDA Programming and Performance	4	4827	June 7, 2008
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27674	February 15, 2010
Architecture Questions CUDA Programming and Performance	6	8170	February 12, 2008
A question about the CUDA's thread parallelization CUDA Programming and Performance	12	63011	January 25, 2009
What parameters to choose - threads, blocks, warps CUDA Programming and Performance	3	337	October 14, 2022
Organization of threads CUDA Programming and Performance	1	644	December 21, 2011
How much threads can execute in parallel? CUDA Programming and Performance	4	1276	April 24, 2009
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17216	September 11, 2011
Maximum number of threads in a GPU CUDA Programming and Performance cuda	5	6266	December 29, 2022

Integer size per coprocessor

Related topics