I am trying to decide whether the Tesla C870 is for me.
My applications do not have floating-point operations, but rather
integer and logic operations, including shifts, as well as many loads and
stores. Will the cores on the card be able to execute such programs efficiently,
if these are the characteristics of the individual threads?
Can C++ programs be compiled efficiently for execution on the C870?
Do you have tutorials about porting single-threaded applications
written in C++ to execute efficiently on the C870?
About the only integer type operations that are slow in CUDA are multiply (can be fast if you don’t multiply numbers larger than 24 bits), divide and modulus. Loads and stores are very efficient if you can coalesce them and/or have spatial locality in your reads and use textures.
You can use c++ in the host program as much as you like. The actual CUDA kernels must be written in C.
In order to be efficient on the GPU, you must be able to decompose your problem into 10,000+ threads, each of which works independently from all the others.
What do you mean by “CUDA kernels”? Are these the threads that will be executed on the C870?
Also, if there are 128 cores, and the memory is 1.5GB on the C870, then each core will have about 10-12 MB, is that correct? Or, could the cores somehow use some of the memory on the motherboard of the computer?
How would one orchetrate the communication between the threads when they are executed on the Tesla C870?
And how could one manage the memory space on the C870, such that some of it is used for data shared by all the threads, and some is allocated per thread for the temporary data computed by that thread?
If the 1.5GB on the C870 should be used up, then could the excess data (computed bythe threads on the C870) be spilled on to the memory on main motherboard?
Please read the introductory chapters in the CUDA programming guide. It will explain everything about the threading model and the memory model. If you have specific questions about things mentioned in the guide, we’ll be happy to answer them.
But, just to answer a few of your concerns here:
The threading model is very different than on the CPU. “Memory per thread” doesn’t mean much because in CUDA kernels you break down the problem so far that each thread only deals with a few bytes.
A “kernel” is one function. When you execute it on the GPU, the same function is executed 10,000+ times in multiple threads (however many times you ask for), with each thread having a separate index. So, each thread performs the same operations, just on different data, based on the index.
There is no need to orchestrate communication between threads: a kernel call looks just like a function call from the host side.