Hi all. I’ve been lurking around quite a bit and reading as much literature on CUDA as I can get my hands on, but the learning is rather daunting, not because of syntax, but the programming for parallel operations in general. Here’s my question:
I’m writing some code to handle really large integers. I can express a really large integer as an array of 32-bit unsigned integers. Let’s say I want to add A + B and put the result in C.
(consider all memory has been declared, initialized, etc).
The algorithm for adding each element is:
A[i] + B[i] = C[i], carry[i]
(i.e. if A + B is larger than C can hold, the carry part is added)
C[i+1] = C[i+1] + carry[i]
I realize the magic that makes this run in parallel is letting i = blockIdx.x*blockDim.x + threadIdx.x
The first question:
Is it OK for a thread, which works on it’s element “i”, to address “i+1”? I’m assuming I’ll have to sync threads before they do that (some carry, some won’t, so the time spent during each operation may be different).
Considering A and B won’t change (immutable), what memory type would be fastest? Each will be several thousand elements, so it won’t all be able to run at once.
I have a 8800 GT, which has 112 stream processors. How would I construct the kernel to run on all possible threads, considering the input is one-dimensional (or is it possible, does it need to be two-dimensional?)
Sorry for the multiple questions, but I think if I get these answered, it will help me to understand how CUDA parallelism works in general. Thanks in advance!