CUDA + user scripting (e.g. Lua)

You can not get the same performance on NVIDIA hardware (currently at least) out of OpenCL as you can using methods based on the Driver API. If you are getting the same performance as OpenCL on NVIDIA hardware–then you do not have a good implementation.

I am (still) slightly vague on your requirements: are you wanting the users of your system to be able to specify both the source for the GPU kernels and how they are configured together or just how they are configured together. If the latter, then you can use nvcc to convert .cu source code to .ptx files (on your system) and just distribute the .ptx files–no nvcc needed on their systems.

You can not get the same performance on NVIDIA hardware (currently at least) out of OpenCL as you can using methods based on the Driver API. If you are getting the same performance as OpenCL on NVIDIA hardware–then you do not have a good implementation.

I am (still) slightly vague on your requirements: are you wanting the users of your system to be able to specify both the source for the GPU kernels and how they are configured together or just how they are configured together. If the latter, then you can use nvcc to convert .cu source code to .ptx files (on your system) and just distribute the .ptx files–no nvcc needed on their systems.

Thanks to everyone, this whole discussion has been a real eye opener. I’m learning new acronyms and jargon by the minute. Thanks to profquail and kappa for their ideas/products. I am willing to pay potentially if something can do what I want more easily and with greater speed than free alternatives. As far as I can see now, I have four options open to me:

1: Driver API w/ CUDA

2: OpenCL w/ JIT

3: Kappa / psilambda.com

4: GPU.NET

I would like to filter these down if possible, depending on execution speed, ease of use / code maintenance, compiler size ( less than 50 megabyte distributable if possible!), and maybe even compilation time if there’s a fixed constant time over half a second. Fermi support would be very useful too, as double support is very useful for my app.

And the JIT compilation of OpenCL code wouldn’t have any penalty in terms of speed compared to normal? That would be good.

If anyone could chime in on this issue, that would be appreciated.

This seems to be the most common opinion, though I have sometimes heard strong contrary opinions too.

Just the kernel source in fact (though the program has to be work on data from outside through the kernel’s parameters of course). Sorry I didn’t answer sooner - I thought you were mainly speaking only to seibert before.

Thanks to everyone, this whole discussion has been a real eye opener. I’m learning new acronyms and jargon by the minute. Thanks to profquail and kappa for their ideas/products. I am willing to pay potentially if something can do what I want more easily and with greater speed than free alternatives. As far as I can see now, I have four options open to me:

1: Driver API w/ CUDA

2: OpenCL w/ JIT

3: Kappa / psilambda.com

4: GPU.NET

I would like to filter these down if possible, depending on execution speed, ease of use / code maintenance, compiler size ( less than 50 megabyte distributable if possible!), and maybe even compilation time if there’s a fixed constant time over half a second. Fermi support would be very useful too, as double support is very useful for my app.

And the JIT compilation of OpenCL code wouldn’t have any penalty in terms of speed compared to normal? That would be good.

If anyone could chime in on this issue, that would be appreciated.

This seems to be the most common opinion, though I have sometimes heard strong contrary opinions too.

Just the kernel source in fact (though the program has to be work on data from outside through the kernel’s parameters of course). Sorry I didn’t answer sooner - I thought you were mainly speaking only to seibert before.

When the OpenCL drivers came out there were lots of complaints about speed. Since then, we’ve seen a few updates, AMD has shipped their stream SDK, and everyone has updated to the OpenCL 1.1 standard. Previously I had taken a wait-and-see approach to OpenCL, so I’m only just now starting to some of my CUDA test pieces in OpenCL. I don’t have any comparisons yet, though. I would believe that the need to compile in the driver might limit the complexity of the compiler and result in slower GPU machine code. Additionally, the abstraction of the OpenCL model might also result in extra performance penalty. That said, I’d take a 20% performance hit if it simplified deployment.

When the OpenCL drivers came out there were lots of complaints about speed. Since then, we’ve seen a few updates, AMD has shipped their stream SDK, and everyone has updated to the OpenCL 1.1 standard. Previously I had taken a wait-and-see approach to OpenCL, so I’m only just now starting to some of my CUDA test pieces in OpenCL. I don’t have any comparisons yet, though. I would believe that the need to compile in the driver might limit the complexity of the compiler and result in slower GPU machine code. Additionally, the abstraction of the OpenCL model might also result in extra performance penalty. That said, I’d take a 20% performance hit if it simplified deployment.

I did the simple experiment of moving gcc out of the way and then directly invoking nvcc. It confirmed what is stated (somewhere) in the NVIDIA documentation–nvcc uses the host compiler (gcc in this case) for preprocessing.

If you give the ‘-dryrun’ argument to nvcc, it will show you what it is doing. If you do not need the preprocessing, you could probably figure out how to run without the host compiler:

nvcc -dryrun -I. -I/usr/local/cuda/include -O3 -o matrixMul_kernel.ptx -ptx matrixMul_kernel.cu 

#$ _SPACE_= 

#$ _CUDART_=cudart

#$ _HERE_=/usr/local/cuda/bin

#$ _THERE_=/usr/local/cuda/bin

#$ _TARGET_SIZE_=64

#$ TOP=/usr/local/cuda/bin/..

#$ INCLUDES="-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"  

#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib64" -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -E -x c++ "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -I"." -I"/usr/local/cuda/include" -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_00001dca_00000000-4_matrixMul_kernel.cpp4.ii" "matrixMul_kernel.cu" 

#$ cudafe++ --m64 --gnu_version=40404 --parse_templates  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.cpp" --stub_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.stub.c" "/tmp/tmpxft_00001dca_00000000-4_matrixMul_kernel.cpp4.ii" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -I"." -I"/usr/local/cuda/include" -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_00001dca_00000000-6_matrixMul_kernel.cpp1.ii" "matrixMul_kernel.cu" 

#$ cudafe --m64 --gnu_version=40404 -tused --no_remove_unneeded_entities  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.c" --stub_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.gpu" --include_file_name "/tmp/tmpxft_00001dca_00000000-3_matrixMul_kernel.fatbin.c" "/tmp/tmpxft_00001dca_00000000-6_matrixMul_kernel.cpp1.ii" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -D__CUDA_FTZ -I"." -I"/usr/local/cuda/include" -m64 -o "/tmp/tmpxft_00001dca_00000000-7_matrixMul_kernel.cpp2.i" "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.gpu" 

#$ cudafe --m64 --gnu_version=40404 --c  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.c" --stub_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.gpu" --include_file_name "/tmp/tmpxft_00001dca_00000000-3_matrixMul_kernel.fatbin.c" "/tmp/tmpxft_00001dca_00000000-7_matrixMul_kernel.cpp2.i" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDABE__  -O3 -D__CUDA_FTZ -I"." -I"/usr/local/cuda/include" -m64 -o "/tmp/tmpxft_00001dca_00000000-9_matrixMul_kernel.cpp3.i" "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.gpu" 

#$ nvopencc  -TARG:compute_10 -m64 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0  "/tmp/tmpxft_00001dca_00000000-5_matrixMul_kernel" "/tmp/tmpxft_00001dca_00000000-9_matrixMul_kernel.cpp3.i"  -o "matrixMul_kernel.ptx"

The current version of the OpenCL API does not allow use of CUDA Streams. It only has synchronization events. The (performance) problem with the OpenCL/synchronization approach is that it does not allow for independent, parallel execution of memory copy and concurrent kernels without stopping execution frequently with synchronization events. With CUDA streams, the memory copies and kernel execution within a stream execute in the right order without stopping for synchronization events. This means that memory copies and kernels executions that are on different streams can (and are) concurrent and overlapping. In other words, you are not bottle-necked on synchronization events but on the natural limits of transfer bandwidth/kernel GPU execution speed.

A point to note is that a lot of kernels do not use the full resources of a GPU and so can benefit from concurrent kernel execution on Fermi GPUs. Even for the ones who do fully utilize the resources of a GPU at the height of their launch, that they still usually have stages in their execution where they are distributing or bringing together computation across the CUDA grid and/or executing other reduction stages. At these stages they are therefore still running but not fully utilizing GPU resources. At these stages they then can benefit from concurrent kernel execution.

The great new feature of Fermi GPUs (according to NVIDIA prior to release of CUDA 3.0) was concurrent kernel execution. The Kappa framework defaults to concurrent kernel execution (on Fermi GPUs) and defaults to overlapping memory copies if stream ids are assigned. I have yet to hear of anyone besides the Kappa framework stating that they support concurrent kernel execution (on Fermi GPUs) without synchronization barriers and also have not heard anyone state that they default to using concurrent kernel execution (again on Fermi). Note that I explicitly include NVIDIA in the previous statements. The reason that Kappa has this and others don’t/can’t, is that the Kappa framework has a data flow (producer/consumer) scheduler to ensure proper execution and data transfer without using synchronization that stops execution flow.

I honestly do not view other approaches as truly embracing parallel execution–it seems to me that they are just implementing spots of parallel execution with the main paradigm and control structure still being serial. At some point, if you are embedding this in a serial (current) program, you do have to switch to the serial paradigm (issue synchronization to stop execution)–but why take the performance hit any sooner than you have to?

I did the simple experiment of moving gcc out of the way and then directly invoking nvcc. It confirmed what is stated (somewhere) in the NVIDIA documentation–nvcc uses the host compiler (gcc in this case) for preprocessing.

If you give the ‘-dryrun’ argument to nvcc, it will show you what it is doing. If you do not need the preprocessing, you could probably figure out how to run without the host compiler:

nvcc -dryrun -I. -I/usr/local/cuda/include -O3 -o matrixMul_kernel.ptx -ptx matrixMul_kernel.cu 

#$ _SPACE_= 

#$ _CUDART_=cudart

#$ _HERE_=/usr/local/cuda/bin

#$ _THERE_=/usr/local/cuda/bin

#$ _TARGET_SIZE_=64

#$ TOP=/usr/local/cuda/bin/..

#$ INCLUDES="-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"  

#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib64" -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -E -x c++ "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -I"." -I"/usr/local/cuda/include" -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_00001dca_00000000-4_matrixMul_kernel.cpp4.ii" "matrixMul_kernel.cu" 

#$ cudafe++ --m64 --gnu_version=40404 --parse_templates  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.cpp" --stub_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.stub.c" "/tmp/tmpxft_00001dca_00000000-4_matrixMul_kernel.cpp4.ii" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -I"." -I"/usr/local/cuda/include" -include "cuda_runtime.h" -m64 -o "/tmp/tmpxft_00001dca_00000000-6_matrixMul_kernel.cpp1.ii" "matrixMul_kernel.cu" 

#$ cudafe --m64 --gnu_version=40404 -tused --no_remove_unneeded_entities  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.c" --stub_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.gpu" --include_file_name "/tmp/tmpxft_00001dca_00000000-3_matrixMul_kernel.fatbin.c" "/tmp/tmpxft_00001dca_00000000-6_matrixMul_kernel.cpp1.ii" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -O3 -D__CUDA_FTZ -I"." -I"/usr/local/cuda/include" -m64 -o "/tmp/tmpxft_00001dca_00000000-7_matrixMul_kernel.cpp2.i" "/tmp/tmpxft_00001dca_00000000-1_matrixMul_kernel.cudafe1.gpu" 

#$ cudafe --m64 --gnu_version=40404 --c  --gen_c_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.c" --stub_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.gpu" --include_file_name "/tmp/tmpxft_00001dca_00000000-3_matrixMul_kernel.fatbin.c" "/tmp/tmpxft_00001dca_00000000-7_matrixMul_kernel.cpp2.i" 

#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_13_DOUBLE_INTRINSICS   -DCUDA_FLOAT_MATH_FUNCTIONS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS  "-I/usr/local/cuda/bin/../include" "-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDABE__  -O3 -D__CUDA_FTZ -I"." -I"/usr/local/cuda/include" -m64 -o "/tmp/tmpxft_00001dca_00000000-9_matrixMul_kernel.cpp3.i" "/tmp/tmpxft_00001dca_00000000-8_matrixMul_kernel.cudafe2.gpu" 

#$ nvopencc  -TARG:compute_10 -m64 -CG:ftz=1 -CG:prec_div=0 -CG:prec_sqrt=0  "/tmp/tmpxft_00001dca_00000000-5_matrixMul_kernel" "/tmp/tmpxft_00001dca_00000000-9_matrixMul_kernel.cpp3.i"  -o "matrixMul_kernel.ptx"

The current version of the OpenCL API does not allow use of CUDA Streams. It only has synchronization events. The (performance) problem with the OpenCL/synchronization approach is that it does not allow for independent, parallel execution of memory copy and concurrent kernels without stopping execution frequently with synchronization events. With CUDA streams, the memory copies and kernel execution within a stream execute in the right order without stopping for synchronization events. This means that memory copies and kernels executions that are on different streams can (and are) concurrent and overlapping. In other words, you are not bottle-necked on synchronization events but on the natural limits of transfer bandwidth/kernel GPU execution speed.

A point to note is that a lot of kernels do not use the full resources of a GPU and so can benefit from concurrent kernel execution on Fermi GPUs. Even for the ones who do fully utilize the resources of a GPU at the height of their launch, that they still usually have stages in their execution where they are distributing or bringing together computation across the CUDA grid and/or executing other reduction stages. At these stages they are therefore still running but not fully utilizing GPU resources. At these stages they then can benefit from concurrent kernel execution.

The great new feature of Fermi GPUs (according to NVIDIA prior to release of CUDA 3.0) was concurrent kernel execution. The Kappa framework defaults to concurrent kernel execution (on Fermi GPUs) and defaults to overlapping memory copies if stream ids are assigned. I have yet to hear of anyone besides the Kappa framework stating that they support concurrent kernel execution (on Fermi GPUs) without synchronization barriers and also have not heard anyone state that they default to using concurrent kernel execution (again on Fermi). Note that I explicitly include NVIDIA in the previous statements. The reason that Kappa has this and others don’t/can’t, is that the Kappa framework has a data flow (producer/consumer) scheduler to ensure proper execution and data transfer without using synchronization that stops execution flow.

I honestly do not view other approaches as truly embracing parallel execution–it seems to me that they are just implementing spots of parallel execution with the main paradigm and control structure still being serial. At some point, if you are embedding this in a serial (current) program, you do have to switch to the serial paradigm (issue synchronization to stop execution)–but why take the performance hit any sooner than you have to?

Okay, if no-one objects then, I’ll be testing out all four approaches. This is going to be a tough job, but I’ll break it down by writing a simple program for the CPU first of all. Then porting that to normal non-RTCG GPU code with each approach. And then finally, my goal of writing RTCG code with each approach. Maybe test out 2 kernels too - one concentrating on floating point math and the other without. Hopefully, this plan is good.

To make my task that bit easier, if anyone could provide some links (hopefully containing sample code) for each of the four approaches, I would be grateful. I would like to get on with this daunting comparison by myself as much as possible, so I’ll be reading up on GPGPU programming generally, because I bet it’s far from trivial even for normal non-RTCG code.

Do all 4 approaches support use of doubles/fermi? I only have an Nvidia GeForce 9500GT GPU currently, so I would like to buy a quiet Fermi class GPU as well to see how well that fares with each approach (maybe the GTX 580 is quietest?)

Okay, if no-one objects then, I’ll be testing out all four approaches. This is going to be a tough job, but I’ll break it down by writing a simple program for the CPU first of all. Then porting that to normal non-RTCG GPU code with each approach. And then finally, my goal of writing RTCG code with each approach. Maybe test out 2 kernels too - one concentrating on floating point math and the other without. Hopefully, this plan is good.

To make my task that bit easier, if anyone could provide some links (hopefully containing sample code) for each of the four approaches, I would be grateful. I would like to get on with this daunting comparison by myself as much as possible, so I’ll be reading up on GPGPU programming generally, because I bet it’s far from trivial even for normal non-RTCG code.

Do all 4 approaches support use of doubles/fermi? I only have an Nvidia GeForce 9500GT GPU currently, so I would like to buy a quiet Fermi class GPU as well to see how well that fares with each approach (maybe the GTX 580 is quietest?)

CUDA doubles will work for sure with GTX 200 and Fermi GPUs. I believe that double precision is an extension in OpenCL that NVIDA has supported in its drivers for a year. I have no experience with Kappa or GPU.NET, but if they are based on CUDA, then I would hope doubles also work on capable hardware.

CUDA doubles will work for sure with GTX 200 and Fermi GPUs. I believe that double precision is an extension in OpenCL that NVIDA has supported in its drivers for a year. I have no experience with Kappa or GPU.NET, but if they are based on CUDA, then I would hope doubles also work on capable hardware.

Kappa fully supports all of the featues of the CUDA 3.2 API–including doubles and Fermi. Look at the Kappa Quick Start Guide (online on the website) for which ever platform you will be working on for examples. The User Guide includes some more in-depth usage and examples. In the Quick Start Guide, it shows calling CUDA and CPU kernels. It also shows displaying the compiled attributes of a kernel (register usage, etc–note that this is more precise about the kernel attribute for that particular GPU than the output from nvcc is). There are also some .Net examples in the KappaCUDANet distribution file if you want to work with .Net. I do not include many CUDA kernel examples (matrix multiply and a few odds and ends) since the CUDA SDK and books are a better source for CUDA kernel programming than I have yet managed to gather and organize.

As long as you are running on a system with an NVIDIA CUDA GPU, Kappa supports using only CPU kernels if you want. Based on the approach you stated, you might use the TestModule shared library example from the Kappa Quick Start Guide as a template to get started. The TestModule is just a (CPU/C++) shared library (DLL) with CMake project files (I settled on CMake to support easy cross-platform projects). Make a copy of that project and start modifying to create an initial shared library for testing. Then, at least as far as Kappa is concerned, when you’ve got it ported to CUDA, just make slight changes to module loading and kernel calling and you’ve switched from CPU to GPU (assuming that your argument signatures are the same).

At least with Kappa, you will definitely prefer a Fermi class GPU. You could do your initial development with the GPU you have but you will want to do your testing with a Fermi GPU (Kappa will automatically take advantage of the Fermi without you changing your code). Everything I have seen about the GTX580 would indicate it is what you want to buy for development and testing in terms of power consumption, noise, and performance. If you have are worried enough about data integrity to have a server class host with ECC memory you may want an C2050 or C2070 with their matching ECC memory. Also, if you are mainly double precision or bandwidth limited, then, again, the C2050 or C2070 with their higher double and bandwidth speeds are worth considering.

I do not see for sure what platform you are on (Macintosh/Windows/Linux)–here is the link for the Kappa Quick Start Guide for Windows–you can browse to the other platform links if you need. The Quick Start Guides have sections on getting the examples (and getting everything installed in the first place).

Kappa fully supports all of the featues of the CUDA 3.2 API–including doubles and Fermi. Look at the Kappa Quick Start Guide (online on the website) for which ever platform you will be working on for examples. The User Guide includes some more in-depth usage and examples. In the Quick Start Guide, it shows calling CUDA and CPU kernels. It also shows displaying the compiled attributes of a kernel (register usage, etc–note that this is more precise about the kernel attribute for that particular GPU than the output from nvcc is). There are also some .Net examples in the KappaCUDANet distribution file if you want to work with .Net. I do not include many CUDA kernel examples (matrix multiply and a few odds and ends) since the CUDA SDK and books are a better source for CUDA kernel programming than I have yet managed to gather and organize.

As long as you are running on a system with an NVIDIA CUDA GPU, Kappa supports using only CPU kernels if you want. Based on the approach you stated, you might use the TestModule shared library example from the Kappa Quick Start Guide as a template to get started. The TestModule is just a (CPU/C++) shared library (DLL) with CMake project files (I settled on CMake to support easy cross-platform projects). Make a copy of that project and start modifying to create an initial shared library for testing. Then, at least as far as Kappa is concerned, when you’ve got it ported to CUDA, just make slight changes to module loading and kernel calling and you’ve switched from CPU to GPU (assuming that your argument signatures are the same).

At least with Kappa, you will definitely prefer a Fermi class GPU. You could do your initial development with the GPU you have but you will want to do your testing with a Fermi GPU (Kappa will automatically take advantage of the Fermi without you changing your code). Everything I have seen about the GTX580 would indicate it is what you want to buy for development and testing in terms of power consumption, noise, and performance. If you have are worried enough about data integrity to have a server class host with ECC memory you may want an C2050 or C2070 with their matching ECC memory. Also, if you are mainly double precision or bandwidth limited, then, again, the C2050 or C2070 with their higher double and bandwidth speeds are worth considering.

I do not see for sure what platform you are on (Macintosh/Windows/Linux)–here is the link for the Kappa Quick Start Guide for Windows–you can browse to the other platform links if you need. The Quick Start Guides have sections on getting the examples (and getting everything installed in the first place).