Will this work? From cpu to GTX280

If this is the wrong forum, please let me know.

We have just started working on this problem (new hardware is on order [gtx280/amd9950/gigabyte GA-MA790GP-DS4H]) and since the timeframe is tight we need to start designing before hardware is up and stable.

It is implemented already on a regular cpu but takes days and days running for larger datasets, even after a substantial amount of time optimizing it. I believe (hope? pray?) we can gain substantially (50x-100x?) by using the new hardware and cuda.

The problem goes like this: There are coordinates in 11-space (between about 100,000 to 2,000,000 sets of them depending…). Call this 2D coordinates array “c” and the number of coordinate sets in c “n”. There is an extreme number of simulation training runs (like hundreds of millions) on disk; call this 2D array “s.” Essentially, take a coordinate vector from s and transform it against each coordinate vector in c (the transform is similar to a modified distance formula). Then each resulting transform is compared against a standard and handled appropriately.

Our initial cuda approach is to store the complete array “c” into the 280. If some “c” is too big for 280 memory we will chunk it.

Now, run “n” number of threads so each thread produces one answer (transformed vector). Each thread receives, through function parameters, the 11 values from one “s” vector. Resulting “n” answers are copied back to the cpu. This paragraph is repeated with another coordinate vector from “s”.

In the meantime, another thread on a different cpu is doing the rest of the processing (which is not gpu-appropriate). We thought about switching the roles of “s” and “c” but aren’t convinced.

We know we will need to tweak block stuff. We thought about storing a chunk of “s” but didn’t understand a real advantage. Any comments on the overall approach? TIA.

let me see,

you will have a variable on the device d_c with all coordinate sets (n of them)
you can:

have kernel

calculate<<<ceil(n/NTHREADS), NTHREADS>>>(unsigned int n, data *d_C, s1 …, s11, d_output);

and call this kernel in a loop running over all your s’s

or you could have those s1 till s11 also be arrays (preferably NTHREADS big if global (and more importantly shared) memory allows)

Then you read in s1…s11 in shared memory in the beginning of the kernel (the first n_shared threads read in one element of each array)


if overall_index < n
have each thread read in it’s data from d_C into registers

#pragma unroll NTHREADS
for (s_index=0; s_index <N_THREADS ; s_index++)
calculate transformed vector (my_c, s1…s11[s_index],

end of your kernel.

The second version can give you some nice speedup if you can store more than a few values of s in shared memory, since you will only read in c once for all of those s’s.

Let’s say c takes 100 Mb. And let’s say you can store 16 values of s in shared memory. Than in the first case you read in total n100 Mb worth of values of c.
In the second case, you read in n
100/16 Mb worth of values of c.

So if you output as much as you read, you will have approximately:
2n100 Mb / 100Gb/s of runtime
n17/16100 Mb / 100 gb/s of runtime.

So you can get up to twice as fast in the second case. If your output is smaller than the c array, your gain can get bigger.

Denis, First, thanks for a quick reply.

We do realize we will have to get down and dirty with the hardware to figure out all the optimizations, but being new to simd and cuda we have to make sure we aren’t going down the wrong rat hole at the start.

Yes, i understand and like your 2nd idea. It will require a bit more code in the host to serve, and to manage the return results, but is not unmanageable. In most cases with the 1gb in the 280 we should have room for some fair number of test tuples (the “s” array). Unfortunately, each tuple in “c” will produce a result so that will probably be our size bottleneck.

Along those lines, we could also move some of the descrimination logic up to the gpu so after a run through the bunch of “s” we could __syncthreads and then have threadidx = 0 run through the results and copy the ones we’d like to keep (about half) to a results array and do a second amount of “s”. I’m just concerned about slow global memory access in a scheme like this (though it might be better than copying back to host and processing and would allow us to about double the number of “s” tuples processed before a copy becomes necessary). Do you think my global memory access concerns are unfounded for this scheme?

(If only i could squeeze a couple of c1060s in the budget.)

Thanks again!

C1060 is indeed very nice with its 4 Gb.

I also think you performance bottleneck will be the numberof_s * numberof_c write at the end, but that is data that always has to be written anyway. Ways of reducing this can be indeed doing some discrimination logic, but then you have to be sure you will have (e.g.) only maximum 50% of the values written. You cannot have one block write more than the number of values appointed to them.

In the end, you can already do some pre-calculation of performance boundaries:

  • see how much data you will be reading & writing. Together with the 140 Gb (let’s take 100 for your calcs) memory bandwidth it will tell you the minimum time your kernel will take.
  • see how much data you will have to transfer to & from the device. Together with your (mainboard dependent, so let’s say you will reach 4Gb/s) You can calculate how much time your transfer back and forth will take at least.

Combine these two and you will have an estimate of your upper boundary of performance. It all depends a bit if you can stay coalesced and such.

It looks by what you describe as not too complicated, if you don’t run into size issues. But still if it is your first project in CUDA, there is all this getting used to necessary also.

If you need want a 100x speedup, you’ll need to do this:

  1. Each thread computes more than one distance.

  2. The data (specifically, a small piece of c) will be held in shared mem

  3. Yes, work with more than one s at one time. You will need this to accomplish goal #2 effectively.

The general approach you’ll use is “blocking,” similar to the one used for other linear algebra routines. Consider this:

If you load 10 elements (meaning, 110 ints) of c into shared memory and only 1 element of s, you can perform 10 operations before loading the next 10 elements. If you hold 5 elements of s (also in smem), you can perform 50 operations before loading the next 10 elements of c. Memory bandwidth requirements, as you can see, fall 5 fold.

You’ll load elements into smem and perform computations cooperatively among the threads in a block. This is because the number of computations that can be done before reloading smem increases geometrically with smem’s size, while the speed with which you can solve them increases linearly with number of threads. Hence, you want threads to all share smem together.

Btw, a similar approach is applicable if c can’t all fit into GPU global memory. (Perhaps thinking about this case will help you understand the above one.)

However, this might all be for naught if PCIe bus bandwidth is the bottleneck, as Riedijk suggested. (Honestly, he probably suggested everything else too, but I didn’t study his post carefully enough to understand :">) You need to perform ~30 calculations per 4-byte result (if I understand your 11D distance calculation correctly). That’s a bit low, by an order of magnitude. (6GB/s PCIe2 bandwidth => 1.5 bn results/s => 45 GFLOPs, while with correct blocking your code i think can be as efficient as a matrix-multiply, ie ~500 GFLOPs.) Oh wellz I guess. Should still be killer performance, and probably more than your CPU can even handle.

This forum is totally excellent! I can’t wait to begin some of the dirty work. Makes me want to stop at the local CompUSA on the way to work to pick up an 80somethinghundred to begin playing with. Oh wait, CompUSA went out of business. I’ll find something somewhere.

Denis, thanks for your insight, and you’re right when you insinuated that getting my head around cuda might be the real bottleneck! :wacko: Good thing there’s another person assigned to help me.

Alex, Thanks for weighing in. From reading many posts across the board i have become “afraid” of using global memory, unless absolutely necessary. I think that, as Denis pointed out, I might be limited by bandwith in having to copy results back to the cpu. Thanks for doing the calculations and i will be happy if the cpu is overwhelmed. I also like your idea of having a kernel do more than a single set of calcs, and i will give that a try and see what happens when i double, triple, quad, etc. up on the calculations in a thread.

I’ll post back in a week or two to let y’all know how things are going.

Thanks again,