From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ?

I guess this is one of the most popular topics on this forum: “I have code running on my GPU, how do I make it run faster ?”.

So here I am, with my piece of code, and I want to make it run faster. I have tweaked the code for two weeks now, tested at least 10 different configurations, and my current code runs ~2.5 times faster than the initial version.
I use templates, pointer increments, shared memory, fully coalesced memory access, streams, and some unrolling. I have tried every trick I could find out there, but this is my first large CUDA code (more than 1k lines in ~10 files), so I’m still in the learning curve.

To make this post less boring my question comes with a twist: my code is running on a 9600GT GPU, doing the whole processing in 0.15 seconds. When I make it run on a Tesla T10 it runs in 0.13 seconds. How comes ?
The Tesla GPU has ~2.5 times the computing power of the 9600GT GPU, yet the speed improvement is almost none.

Looking into the profiler output things become more interesting. 55% of the total computation time of my code is dedicated to one single function called 8 times, the rest is spread in multiple other functiones.
I have attached to this post screenshots of the profiler output of this particular function when running on each GPU model.

As you can see, on the 9600GT the kernel runtime is bound by computing resources, 80% of the kernel is spent computing, this seems reasonable to me. When moving to the Tesla I would then expect a considerable speed up, since it provides more computing power.
However when running on the Tesla T10 the relation changes and the kernel becomes memory bounded. What took less than 5% of the time on the 9600GT takes 60% on the Tesla T10.
Even worse: as written the code does as many reads than writes on the mentioned kernel. Yet on the Tesla T10 the profiler reports 5% of time reading, but 60% of the time writting.

So my question is: how can this be ? I am using the profiler incorrectly ? what can be happening that would explain these numbers ?

Thank you very much for your answers and hints. If you need any more details or specific benchmarks, I will provide them.


When you say “the whole processing” takes 0.13 or 0.15 seconds, how exactly do you measure that for this program?

is that one critical function very complex? Maybe you can share some source code

Doesn’t your second profiler chart tell you something important?
Why are there so many 64 byte writes, and are those really dominating your runtime?

Typically programs are dominated by read bandwidth and/or latency, sometimes computation, but rarely writes.

Here is a peek into the mentioned core function.

The function is not very complex, it has one main for loop, and inside it does sums, multiplications, min operations and one if.

Other than that, at each iteration it will do one read in global memory, one write in global memory, and some read and writes in shared memory.

All and all the function uses only 12 registers.

Comments and performance hints most welcome.

[codebox]typedef struct


int offset_x, offset_y;

int delta_x, delta_y;

} slice_t;

/// we assume that input_volume and output_volume have the same size

template<typename PixelType, typename PixelCostFunction>

global void forwards_or_backwards_pass_kernel_parallel_disparities(

const cost_volume_kernel_data_t input_volume,

cost_volume_kernel_data_t output_volume,

const slice_t slice,

const parameters_t parameters)


// x is used as the z index, y is used as the y index, z is used as the x index

const int initial_x = (__mul24(blockIdx.x, blockDim.z) + threadIdx.z) + slice.offset_x;

const int initial_y = (__mul24(blockIdx.y, blockDim.y) + threadIdx.y) + slice.offset_y;

const int current_disparity = threadIdx.x;

const int &image_width = get_volume_width(input_volume);

const int &image_height = get_volume_height(input_volume);

const int &num_disparities = get_volume_disparities(input_volume);

// we check that we are inside the image boundaries

// since accessing KernelData outside boundaries is not ok

// no need to check left_y since  grid.y = image_height

if(initial_x < image_width and initial_y < image_height and current_disparity < num_disparities)


const unsigned char slice_id = threadIdx.z;

shared cost_value_t d_costs[2][MAX_D];

    __shared__ cost_value_t temporary_d_costs[2][MAX_D];

cost_value_t *previous_pixel_disparities_costs = d_costs[slice_id];

    cost_value_t *t_min_costs = temporary_d_costs[slice_id];

{ // first pixel of the slice, we simply copy the costs

const cost_value_t current_pixel_cost = at_xyz(input_volume, initial_x, initial_y, current_disparity);

        previous_pixel_disparities_costs[current_disparity] = current_pixel_cost;

        t_min_costs[current_disparity] = current_pixel_cost;

        at_xyz(output_volume, initial_x, initial_y, current_disparity) = current_pixel_cost;


__syncthreads(); // all disparities of previous pixel have been processed

{ // rest of the slice

const PixelCostFunction pixel_distance;

typedef typename textures_type::texture_type texture_t;

        texture_t left_texture = get_left_texture<texture_t>();

PixelType previous_pixel_value = tex2D(left_texture, initial_x, initial_y);

const int &disparity = current_disparity;

        const int previous_disparity = max(disparity-1, 0);

        const int next_disparity = min(disparity+1, num_disparities-1);

// we iterate over the slice, starting from the second pixel


                x = initial_x + slice.delta_x, y = initial_y + slice.delta_y;

                (x < image_width) and (y < image_height) and (x >= 0) and (y >= 0);

                x += slice.delta_x, y += slice.delta_y)


            // find_min_cost is similar to the CUDA reduction sample, but does b = min(a,b) 

            // will modify the values in t_min_costs

            const cost_value_t previous_pixel_min_disparity_cost = 

                find_min_cost<MAX_D, false>(t_min_costs, current_disparity, num_disparities);

const cost_value_t term0 = at_xyz(input_volume, x, y, disparity);

            const cost_value_t &term1 = previous_pixel_disparities_costs[disparity];

            const cost_value_t &term2 = previous_pixel_disparities_costs[previous_disparity];

            const cost_value_t &term3 = previous_pixel_disparities_costs[next_disparity];

const PixelType current_pixel_value = tex2D(left_texture, x, y);

            const int term4 = pixel_distance(current_pixel_value, previous_pixel_value);

            previous_pixel_value = current_pixel_value;

const cost_value_t current_pixel_cost =

                /*sum, adds, multiplications, mins and one if using the different terms*/;

at_xyz(output_volume, x, y, disparity) = current_pixel_cost;

            previous_pixel_disparities_costs[disparity] = current_pixel_cost;

            t_min_costs[current_disparity] = current_pixel_cost;

__syncthreads(); // all disparities of previous pixel have been processed

} // end of “for each pixel in the slice, starting from the second one”

} // end “rest of slice”

} // end of “if initial voxel coordinates not inside cost volume”




I use the cutGetTimerValue (and related) functions, taking the time between “right before launching the computation” and right after finishing.

I then run the application a few times and look at the minimal timing value obtained.

The timing numbers obtained is consistent with the values reported by the profiler (0.9 and 0.11 seconds on each GPU).

That is exactly the point of my question. I run the same code in two GPUs, yet the behavior pointed out by the profiler is different.

Even worse, as you can check on the code posted on this thread the kernel does as many reads than writes to the global memory.

I see no reason why I should see such difference in read versus write times.

All the memory access is reported by the profiler as coalesced, and every write is done in the same coordinates than each write (but in a different data volume), so read and writes should have the same coalescence properties anyway.

This weird behavior is the reason for my post. Something seems to be wrong… but what is it ?

assuming you comment out this line:

at_xyz(output_volume, x, y, disparity) = current_pixel_cost;

the compiler will still have to evaluate current_pixel_cost because you’re also writing current_pixel_cost to

shared memory.

Will the run time of this function be dramatically different then, when you cut out the global memory writes?


I have seen this exact behavior for one of my kernels as well, although I move from a 9600GT to a 260GTX. The exact same thing! And I truely do not use that many global writes. (I have appr. 1,000,000 writes, and the profiler states many more)

Basically, I think the devil is in the detail. As far as I read, the profiler counts the number of instructions / reads / whatnot you do, not the time spend at each one. That way, you can compare the numbers to what you estimate your method to do, like do you really need to read that many from global memory. If not, then move it to shared etc.

Another issue is that I feel the cards count any write action, including register and shared memory, as a global write. Has anybody else seen this? Even if I only do in the order of 1000 global writes, it still comes out very high.

Did you profile on the 9600GT or on the Tesla? Memory access on the Tesla will always be reported as coalesced.

Also what are the values of delta_x and delta_y you are calling the kernel with?

Indeed commenting that line provides almost no change in the timing in either GPUs, this support the idea that “the profiler is lying”.

In any case I cannot “cut out” the global memory writes, since I do exactly one global memory per output value (the strict minimum).

To check up I disable all writes to global memory, the timing is essentially the same and the GPU profiler output too (but the program output is different).

How comes the profiler is “lying” about the gst 64b writes ?

This would explain the profiler output I see when I disable all global memory writes, however this seem inconsistent with the Nvidia profiler documentation, since gst stands for “global memory store transactions”, I would expect shared memory operations stand in “local store” operations.

Is this an error in the documentation or there is “something pushing my shared memory array into global memory” ?

That isn’t my experience and I don’t believe it is true. A global write is precisely what the name says - a store to global memory. What you need to bear in mind with the profiler is that it only instruments a couple of SM to record profile counters, and then it simply interpolates those results to the size of the card. If your blocks are at all heterogeneous, or you don’t launch a nice round multiple of the number of SM on the card, then the interpolation from a given SM to the whole card won’t be a very good approximation of what really happened.

This has been my fear as well, but my performance would plummet if I did, since I use a += on my shared memory. And I don’t see that.

I use the same code on a 9600GT card and see very few global writes. The number is much more consistent with what I expect than the one on my 260GTX card. It basically jumps from being neglible to being dominant on the graph.

Avidday: All my blocks are actually excessively homogenious. I only have 27 kernels, so the number of blocks are not a multiple of that, but would that matter if each block should execute in the same fashion? Each block loads the same amount of data, performs calculations in a loop with the same number for each thread in all the blocks, and stores the same amount of data. The instructions are the same, and only differ on the value of the data loaded. So it should not be due to the statistical nature of the profiling, especially since it does not show on the 9600GT profiler. Or would the 260GT card launch a part of the kernel that is not done one the 9600GT for some unknown reason?

I profile by launching an identical kernel first, and then running my “profiled” kernel 35 times afterwards.

Is there any way of looking into the generated code and manually counting the number of writes to global memory? Basically, the difference would be how the 9600GT and the 260GT card executes the code, is this expected? I would find it very odd.

Well that would make sense, but then:


If the profiler provides counts and not timing (it does say “profiler count plot” after all), then how it comes that the count changes between two GPUs ? How comes that the “gld coalesced” (in 9600GT) is not equivalent to the sum of “gst + gld” (in Tesla) ?

How comes that the shared memory access becomes global memory access (see code)?

The threads are all homogeneous all doing the same operations, there is only one if in the form c= (b)? d:e; so this part should be ok.

The profiler instruments two SM on the G90 and three SM on the GT200, and the way some of the counters work (especially the memory ones, see next response) is different. It would be normal for some deviation between a G80/G90 and GT200 running the same code.

By definition. In the GT80/90, there were exactly two possible memory access patterns “coalesced” (a half warp of requests serviced in a single transaction) and “not coalesced” (a half warp of requests serviced by 16 transactions). In the GT200 and Fermi, the memory controller is more flexible and the idea of coalescing doesn’t exist anymore. Many cases which would be fully serialized in a G80 can be serviced with two transactions in the GT200. Which is why there are now separate request and transaction counters in the profiler.

To the best of my knowledge It doesn’t. Under any circumstances.

I understand that some differences can be observed, but here looking at the attached plots we are talking about major changes from 5% to 60%.

Also if you look at the code you will notice that read and writes are made with the exact same pattern (same location but in different data volumes). On the Tesla profiler output, as you indicate the number of request is separated from the number of transactions. The number of read and write requests is identical (as expected) yet the number of write operations is twelve times the number of read operations.

How can this be ?

Thank you very much all of you for answer, I hope you can keep helping me until this mystery is solved. Apparently I’m not the only one with it, so any hint, test to be done, idea is most welcome.



I profiled on both, see the attached profiler output.

Indeed in Tesla the coalesced aspect is not reported (because it does not matter), but since it is the same code, the data access pattern is the same.

delta_x and delta_y are 0, 1 or -1 and only one of them is non zero. Accordingly offset_x and offset_y are either 0 or a large number related to the size of the data volume.

I don’t know if this is just a mistake in the documentation, but if true, it would explain a factor of four in gst_64b vs. gld_64b

Apart from that, how is [font=“Courier New”]at_xyz(output_volume, x, y, disparity)[/font] defined - is [font=“Courier New”]disparity[/font] an additive constant?

Ohh indeed, good catch. I cannot possibly imagine why they do not rescale these values when presenting them to the user… (for me this looks like a documented bug). So that explains a 4x factor, and explains why the 9600GT would look at least 4x different than the Tesla T10 plots.

The profiler plot (see first post) shows a 12x factor of difference, so a 3x factor still unexplained.

But still, given that the number of requests is the same (in the profiler plot) I would think that this part of the system is working fine, both GPU do execute the same number of operations (as expects).

Since the profiler shows (scaled) counts and not timing inside the kernel, actually I have no idea of what is the slowest part in the current kernel.

So given this I will come back to my initial question: why do you think that I see almost no improvement between the 9600GT and the Tesla T10 in an application where the core function (55% of the time) is computation bounded ?

I’m using cuda templates, at_xyz access a voxel in a 3d data volume as follows

[codebox]#define at_xyz(kernel_data, x,y,z) \

([(__mul24(y, kernel_data.stride[1]) + __mul24(x, kernel_data.stride[0]) + z)])[/codebox]