Is CUDA appropriate for this application? Trig on small-medium sized array(s) of 3D points.

Hey guys,

I think this problem is quite suited for acceleration with CUDA, but I’d like to get the opinion of a CUDA programmer before I set about learning how to do it. I’m using a 3D lattice-based model divided into ‘cells’, each consisting of ~100-1000 points (x,y,z). To start with, for any one cell I want to convert the 3D cartesian points to spherical polar coordinates, double an angle, convert back to cartesian coordinates, sum these points as vectors, then do the same thing with the resulting single point (except halving the angle this time). This involves a few trigonometric functions, which is why I’d like to offload the computation to a GPU. Single precision floats are fine. I’d want to do this for many cells (up to tens of thousands), the data for which may well exceed the device’s total memory, and every time step of the model, of which there would be thousands at the least. I should note that this problem is a small part of the model, but would take a disproportianate amount of computation.

Any comments on the worth of programming this in CUDA are greatly appreciated, and any comments on how to best go about doing it are even more so! Thanks in advance.

Dan.

This sounds ideal for CUDA… much more so than most problems!
It’s computationaly intensive with local data. It uses transendentals, which are much more efficient on the GPU than the CPU.
Your data is massive but chunked into thousands of cells. But those cells are small enough to fit into GPU shared memory.
Wow… absolutely idealfor CUDA!

The initial strategy would be pretty simple… just have one block load one cell’s points into shared memory. Loop over the points (giving each thread 3-10 points each). Do a parallel reduction to sum the results, then a quick one-thread transform to that result and write it out.

The complication might be if you did have more than a gigabyte of data. In that case you could use multiple kernels, or perhaps just use zero-copy memory, a fancy way of leaving data on the host. Since you read the points only once, this may work pretty well.

What about using a Quadro FX 5800 or a Tesla C1060 for 4GB of memory?

Thanks guys. I doubt I’ll be buying a newer GPU as my PC came with two Quadro NVS 290 cards, and if I need to perform the operation on more cells than they can cope with in one kernel, the penalty isn’t going to be a major concern.

SPWorley: Good to have my suspicions confirmed, and thanks for the pointers. I’ve got a fair bit of reading to do in order to get this up and running, but it should be worthwhile.