Putting the GPU at work

Ok.
I’ve just read all the posts from this topic today. Why NVIDIA didn’t answer? I think we are all from the same side, we all want to improve the performance. Why NVIDIA’s not helping here by confirming or correcting what it has been said?

Osiris> Indeed I convince it can interest a lot of people all of this. Your comments are pretty useful.

I’m not quite sure how to respond to this thread, but I’ll try.

To the original poster - I agree with mfatica that we really need more information about your code in order to help you optimize it. You’re not going to get good performance on any GPU with only 1 or 2 threads. And tree searching problems like checkers are particularly difficult on a data-parallel machine like the GPU.

Regarding global memory latencies, fundamentally global memory reads are slow because they are uncached. Typically GPUs try and cover this latency by using many threads.

This is why we recommend having at least two or more thread blocks per multiprocessor. With enough blocks per multiprocessor, some of the blocks will be idle, waiting for load data to return, while one or more blocks will be executing instructions on previously loaded data.

The memory system is complicated, but as far as I’m aware the guidelines in the programming guide are correct.

We’re also working on profiling tools that should make optimizing these kind of things easier.

If you have specific questions please post them in another thread.