New to Tesla/CUDA questions Just a few questions.

Hi,

I’m an experienced C\C++ programmer working mostly with HPC systems. We are very interested in the Testla/CUDA environment. A few questions:

Our application is a 64-bit app running on XP64 using Trolltech’s Qt. It’d be nice if we could offload the image we are processing to the C870. Can someone point me to a document that I could read up for people transitioning from the desktop world to CULA programming environment? Basically, I want to know how my thinking or software design will be different with CUDA & Tesla. Ideally I want to keep our software as C++/OO as possible; I’m aware that there’s no C++ compiler for this platform yet.

Thanks, potentially a very cool product!

The CUDA programming guide is a good place to start. You can get it here.

[url=“CUDA Zone - Library of Resources | NVIDIA Developer”]http://developer.nvidia.com/object/cuda.html[/url]

Then download the SDK. The sample code are a great learning resource.

Also go through the UIUC course on parallel programming with special focus on CUDA.

[url=“Course Websites | The Grainger College of Engineering | UIUC”]http://courses.ece.uiuc.edu/ece498/al1/Arc...7/Syllabus.html[/url]

There are simple ways to integrate C++ and CUDA. The SDK sample code shows you how to do this.

Just want to second sphyraena’s suggestion… the UIUC lectures are fantastic!

Some C++ “stuff” seems to be supported such as declaring variables in places other than the beginning of blocks. Sometimes it seems necessary to declare your functions with following style decl.

extern “C” void test() {
}

Best is probably to isolate the processing that you want to do on the GPU into a seperate function in a seperate file. Strip out the fancy C++ stuff leaving more or less C code. Then make a CUDA version of this file. You will want the functions that are run on the GPU to be fairly short to minimze register usage. If needed, consider breaking your function up into multiple kernels if your problem allows this. The absolute fastest kernels I have written are always transformation type kernels. I.e. this the destination value is some function of a single (or few neighboring) source values. I’ve have really great luck with image analysis… the calculations are fast and typically it is possible to map the result into OpenGL avoiding the overhead of transfering the results. The examples in the SDK has at least one example of this (postprocessGL is the example name I think).

Thanks for all the suggestions. I’m going to try to keep the core of our processing code as C-like functions, so that we can easily port to CUDA for prototype. I also will start digging into those lectures to learn more about CUDA.

I have a few more questions:

  1. Does anyone know if Nvidia plan to release a CUDA C++ compiler?

  2. Architecture design: So basically our algorithms work on one image frame at a time. The frame size varies from 5 MB all the way to 1.2GB. The reason why we had to do 64-bit computation was because we needed to queue up these frames on the DRAM while waiting for our processors to finish processing each frame.

One way I thought about using the Tesla card is to keep the same queuing on the PC side, but for each image frame we want to process, DMA or transfer the image frame to the Tesla’s 1.5GB memory, and tell the GPU to start processing it. Then after it’s done, transfer it back up to the main DDR for outputting or further processing.

Does this sound like a typical use of the Tesla card/server? If so, does anyone know off the top of their head what API is used to move data from main memory to Tesla’s memory?

I thought this would be a good architecture as I can pretty much code all of my harness code in OO/C++ and make our algorithms running on Tesla a bunch of C-like functions.

Thanks

Also, this post if just for comparision-sake.

Another product we were looking into before encountering Tesla was IBM’s Cell Processor.

JFYI & Comparison: Mercury has a nice solution for this: [url=“RF, Microwave & Mixed Signal Solutions | Mercury Systems”]RF, Microwave & Mixed Signal Solutions | Mercury Systems

But we felt like our algorithm couldn’t optimally make full use of it since each of the 8 Cell processors only had, I believe, 256KB of memory. That is too small for our data size, and we didnt want to store everything in the XDR memory.

I dont think Tesla has any of this limitation. i.e. A developer gets 1.5GB of memory to the main GPU core (?)

My approach was to keep everything in C++ as much as possible, including CUDA memory allocation and DMA calls. I switch to C only to run kernels. So far, that’s worked pretty well for me.

FYI, nvcc supports some useful C++ features even though it isn’t a full C++ compiler. Templates is one example. In the SDK’s separable convolution example, templates were used very elegantly to implement loop unrolling. Very useful trick.

For starters, look for cudaMemcpy() in the programming guide. Look at the SDK’s bandwidth example for how to use it.

Your thinking sounds okay, but I think the devil is in the details and you’ll need to dig deeper to figure out how to use Tesla efficiently. Use pinned memory for best DMA performance. DMA performance is also dependent on the motherboard, RAM, etc that is used with the Tesla unit. So choose carefully.

Finally, remember that Tesla is based on the G80, which does not support overlapping of DMA and GPU processing. Since you have large images, you should be aware of this when estimating the running time of your algorithms.

Several things to keep in mind with a comparison like that…

  1. The 1.5GB of memory is shared among global, constant and local memory. If your image is 1.5GB (as you said in your previous email), it will not likely fit into the global memory all at once.

  2. It isn’t fair to compare the Cell’s 256KB of local store with the 1.5GB of GPU memory. The local store is more like a cache, similar to a texture cache of the GPU. It’s probably fairer to compare XDR vs GPU memory.

  3. There are a lot more differences between the Cell and the GPU. For example, how well do your algorithms map to each architecture? Being able to answer this question may provide you with more insight about your choice than simply comparing memory size alone.

PS: it’s nice to compare architectures, but I hope we won’t start a flame war on this forum. :)

This is what I am doing too, it works very well. You can expose a fully object oriented interface to the user of the API that doesn’t need to know anything about what is being done internally on the GPU.

As for a full fledged c++ compiler allowing classes in kernels… I doubt it will become available. 1) There is no way to dynamically allocate memory on the device. 2) kernels are for pretty low level processing. One of the most important things you need to do to get good performance on the GPU is to coalesce memory reads. C++ classes, copy constructors, and = operators do not map very well to this concept. You need ABSOLUTE control over which memory locations are accessed and in what order to get good performance out of the GPU.

Your usage pattern: copy to GPU, perform computations, copy back to CPU is normal, but don’t expect drastic performance improvements. First, remember Ahmdal’s law. After you’ve made one piece of your code extremely fast, the running time is limited by the other processing on the CPU. Second is the memory transfer issue. Keep in mind the overhead of copying back and forth to the GPU. With pinned memory, you can get up to 3.1 GB/s max. If the processing on the GPU doesn’t take very long you may see the PCIe communication time dominating the running time.