Advices about a program

Hi. I have a new question about histograms. I run my program on a 1024*768 picture (lvl of grey).

Each thread (768) analyse 768 pixels of the picture and create its histogram in a specifik place in the global.

In one case, we stock the data in texture memory, and in the other, we stock it on global memory.

I run the program on a 8800 GTS (320 Mb) and on a 8800(GTX) and we obtain these results :

This time is in ms.

I notice that for the GTS, I have a change after 96 thread (the time becomes slower). I thought I will notice the same thing on GTX after 128 threads, but nothing.

How can we explain this difference ? (I hope detailled informations).

And I don’t understand why the difference is so important with 2 threads !

I have try on differents GTX and GTS, I always have this difference :(.

No help ?

I’m afraid I don’t even understand the question. Yes, you get different execution times depending on block size. Memory access patters, read after write dependencies, warp occupancy, shared memory usage, and a host of other things go into this. It is impossible to predict the best performing block size without performing benchmarks as you have. Just choose the best performing block size and you are done.

Okay.

Can you just detailled why occupancy and memory access pattern influence my execution. Because I have read informations about these 2 things, but I have problems with english and I think I have not understood it. I just need a little explanation about these 2 notions.

Sorry for being boring :(.

Because the occupancy tells you how many threads are available for a multiprocessor. the more threads, the more chance you have to hide memory-latency with calculations.
And if you access memory in a non-coalesced way, the latency is higher than in a coalesced way.

What basically happens:

thread 1 starts to run.
thread 1 needs memory, that has not yet arrived
thread 2 starts to run
thread 2 needs memory that has not yet arrived
Now if you have low occupancy, you may need to swap back to thread 1, but if that thread has not yet received the memory, your processor waits and does nothing.

But if you have a high occupancy thread 3 starts to run.

Now if you read in your memory coalesced, you might already have received the memory for thread 1, so a low occupancy might not matter, because thread 1 is ready to run.

I hope this helps (and I hope this is correct, but that’s how I understand it)

Basically, my point is that the all the interactions are so complicated that you can’t know how they will all work together to change performance without actually running your kernel and measuring the performance.

If you want to learn more about occupancy, try out the occupancy calculator: http://forums.nvidia.com/index.php?showtopic=31279

Thanks DenisR. I understand now :).