Which card to get?

Good evening!

I write computer vision software and algorithms for image analysis. All my software is Linux based in CentOS 5.4/5.5 64-bit with a middle aged 2.6.18 kernel and some third party hardware. Everything is deployed on a Dell T5500 workstation with single or dual Xeon processors and a minimum of 12GB of RAM — a PCIe bus. I’m wanting to move to CUDA, but have a short list of requirement, which may grow later, and a few questions.

My current main goal is to reduce math operations from ~400 microseconds to as lightning fast as I can get them. (When you do the same operations 360,000 times, 400 microseconds starts to add up).

Requirement:

  1. Must all be real time, this is in a state machine and as close to RTOS as a standard distro will get.
  2. Must have C++ support, this is critical - gcc-4.1.
  3. Jave support would be nice, but not critical - java1.6.
  4. Need ability to do math operations in CUDA environment, something like multiplying matrices as big as 4,920x3,264 using standard data types (char, double, float, int).
  5. Would be nice to have built in image manipulation abilities such as binarize, dilation, erosion, rotation, etc.
  6. Advanced image processing such as edge detection and scaling would be really, really nice.

So, knowing the requirements, my questions are:

  1. Which nVidia/CUDA chip card should I be looking at for best results? With how much memory?
  2. Which kernel version would be the most optimal for this environment?
  3. Which CUDA development environment would be most efficient?
  4. Which compilers/versions are required to make the thing work, and which compilers/versions would be best?
  5. Is there any other third-party debugging/profiling software besides gdb, valgrind, electricfence and purify that I should be looking at?

I’d like to get a card on order next week and get started, so any insights would be greatly appreciated!

-brian