Array Handling between threads Question about addressing array

Hi all. I’ve been lurking around quite a bit and reading as much literature on CUDA as I can get my hands on, but the learning is rather daunting, not because of syntax, but the programming for parallel operations in general. Here’s my question:

I’m writing some code to handle really large integers. I can express a really large integer as an array of 32-bit unsigned integers. Let’s say I want to add A + B and put the result in C.

(consider all memory has been declared, initialized, etc).

The algorithm for adding each element is:

A[i] + B[i] = C[i], carry[i]

(i.e. if A + B is larger than C can hold, the carry part is added)

C[i+1] = C[i+1] + carry[i]

I realize the magic that makes this run in parallel is letting i = blockIdx.x*blockDim.x + threadIdx.x

The first question:
Is it OK for a thread, which works on it’s element “i”, to address “i+1”? I’m assuming I’ll have to sync threads before they do that (some carry, some won’t, so the time spent during each operation may be different).

Second question:
Considering A and B won’t change (immutable), what memory type would be fastest? Each will be several thousand elements, so it won’t all be able to run at once.

Third question:
I have a 8800 GT, which has 112 stream processors. How would I construct the kernel to run on all possible threads, considering the input is one-dimensional (or is it possible, does it need to be two-dimensional?)

Sorry for the multiple questions, but I think if I get these answered, it will help me to understand how CUDA parallelism works in general. Thanks in advance!

Kale

There won’t be an addressing problem, but I think you might run into a few issues:

  1. __syncthreads() only synchronizes threads within a block, not all threads running on the GPU. Since you will have thousands of elements in your big integer, you will want to spread out the computation over multiple blocks.

  2. There can be up to N-1 carry steps required, right? I’m imagining 9999 + 1 (in decimal) where it will take 3 additional carry steps to bubble the carry bit through all the 9s.

With that in mind, I think the easiest approach would be to have a single add kernel:

global add(int *operandA, int *operandB, int *sum, int *carry)

which does the operation:

int tempsum, tempA, tempB;

tempA = operandA[i];

tempB = operandB[i];

tempsum = tempA + tempB;

sum[i] = tempsum;

carry[i+1] = tempsum < min(tempA, tempB);

(I think that’s how you test for carry, right? There’s no direct carry bit test I can find in the PTX documentation.) The index i is selected how you described.

Then from the host, you can do the following (not using CUDA syntax here, just sketching out the idea):

add(A, B, C, carry)

for (int j=0; j < n - 1; j++)

 Â  Â add(carry, C, C, carry)

That should perform the addition. I imagine this could be made faster with some more thinking.

Regarding the indexing question, there’s nothing wrong with a one dimensional block and grid. That’s the configuration most people use, I would bet.

Edit: Nuts! There’s a race condition here. You’ll need to write the carry bit back to the same index i, or you could step on the next thread over, which could be in a different block. More thinking required here…

I was thinking of making another array for the carry part, and yeah, I left out the code for the carry portion, I would implement it something like the way you listed. What I was thinking, is perhaps have a local (register) variable for each thread load the carry from the carry array:

  • A[i] + B[i] = C[i] + carry[i]
  • temp_carry = carry[i-1]
  • carry[i-1] = 0
  • C[i] += temp_carry

And, since I let i = blockIdx.x*blockDim.x + threadIdx.x
wouldn’t addressing still work across different blocks?

Thanks seibert, I’m getting this slowly figured out!

The problem with this approach is that thread i is overwriting the carry[i-1] field, which is also being read by thread i-1. If the ordering of the write in thread i and the read in i-1 is different (for example, if the i-1 thread is in a different block, and therefore on another multiprocessor which is not necessarily in lock step with the multiprocessor running thread i), you might overwrite the carry bit with zero before the other thread has had a chance to read it.

I think the solution here is to have two carry arrays. One to read from, and one to write out to. Then on each iteration through the carry propagation loop, swap the pointers, so the output carry from the previous iteration becomes the input carry to the next.