Advices about a program

Marsxema · December 6, 2007, 7:57pm

Hi. I have a new question about histograms. I run my program on a 1024*768 picture (lvl of grey).

Each thread (768) analyse 768 pixels of the picture and create its histogram in a specifik place in the global.

In one case, we stock the data in texture memory, and in the other, we stock it on global memory.

I run the program on a 8800 GTS (320 Mb) and on a 8800(GTX) and we obtain these results :

#Nombre de blocks/threads - Texture(GTS) - global(GTS) - Texture(GTX) - global(GTX)

2-384 6.2 7.7 4.1 4.8

4-192 4.1 4.8 4.1 4.8

6-128 3.4 3.9 3.4 3.9

8-96 3.3 3.9 3.3 3.7

12-48 3.2 3.9 3.3 3.7

16-64 3.3 3.9 3.4 3.7

24-32 3.2 3.8 3.4 3.7

32-24 3.3 3.9 3.5 3.8

48-16 3.3 4.0 3.7 3.9

64-12 3.3 4.0 3.8 4.0

96-8 3.3 4.1 3.9 4.1

128-6 3.9 4.6 4.0 4.2

192-4 3.9 4.7 3.9 4.2

384-2 5.2 5.9 4.5 4.8

768-1 8.1 8.8 6.7 7.0

This time is in ms.

I notice that for the GTS, I have a change after 96 thread (the time becomes slower). I thought I will notice the same thing on GTX after 128 threads, but nothing.

How can we explain this difference ? (I hope detailled informations).

And I don’t understand why the difference is so important with 2 threads !

Marsxema · December 7, 2007, 7:31pm

I have try on differents GTX and GTS, I always have this difference :(.

No help ?

MisterAnderson42 · December 8, 2007, 2:27am

I’m afraid I don’t even understand the question. Yes, you get different execution times depending on block size. Memory access patters, read after write dependencies, warp occupancy, shared memory usage, and a host of other things go into this. It is impossible to predict the best performing block size without performing benchmarks as you have. Just choose the best performing block size and you are done.

Marsxema · December 8, 2007, 2:55pm

Okay.

Can you just detailled why occupancy and memory access pattern influence my execution. Because I have read informations about these 2 things, but I have problems with english and I think I have not understood it. I just need a little explanation about these 2 notions.

Sorry for being boring :(.

DenisR · December 8, 2007, 4:36pm

Because the occupancy tells you how many threads are available for a multiprocessor. the more threads, the more chance you have to hide memory-latency with calculations.
And if you access memory in a non-coalesced way, the latency is higher than in a coalesced way.

What basically happens:

thread 1 starts to run.
thread 1 needs memory, that has not yet arrived
thread 2 starts to run
thread 2 needs memory that has not yet arrived
Now if you have low occupancy, you may need to swap back to thread 1, but if that thread has not yet received the memory, your processor waits and does nothing.

But if you have a high occupancy thread 3 starts to run.

Now if you read in your memory coalesced, you might already have received the memory for thread 1, so a low occupancy might not matter, because thread 1 is ready to run.

I hope this helps (and I hope this is correct, but that’s how I understand it)

MisterAnderson42 · December 8, 2007, 5:19pm

Basically, my point is that the all the interactions are so complicated that you can’t know how they will all work together to change performance without actually running your kernel and measuring the performance.

If you want to learn more about occupancy, try out the occupancy calculator: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

Marsxema · December 8, 2007, 6:43pm

Thanks DenisR. I understand now :).

Topic		Replies	Views
Influence of blocks and threads CUDA Programming and Performance	1	1349	November 30, 2007
Low occupancy ratio using texture memory Image correlation using texture memory CUDA Programming and Performance	2	4693	September 20, 2008
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Gap between measured perf. and peak CUDA Programming and Performance	8	13074	March 20, 2008
Putting the GPU at work CUDA Programming and Performance	21	20172	July 5, 2007
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5898	July 25, 2007
Why 8800 is faster? CUDA Programming and Performance	15	10270	May 13, 2009
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3309	March 13, 2007
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4711	June 10, 2008
CUDA texture memory performance CUDA Programming and Performance	4	33543	January 13, 2009

Advices about a program

Related topics