Optimizing memory-bound kernel (memory dependency around 95% in NVVP)

I have a piece of code that, according to Nvidia Visual Profiler, is memory bound and so far I haven’t managed to improve it further after passing some arguments as constants.
If you copy/paste and compile the following code, NVVP shows that both kernels are memory bound, have around 89% occupancy even though the kernel configurations should fully saturate the device, and the SMs from 1 to 7 are around 88-90% utilization while the other ones are closer to 100%.
Error checking was omitted for easier reading, but cuda-memcheck reports no errors for any array length I use.

#include <iostream>

__global__ void init_array(float *array, size_t len)
    {
    for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < len; idx += gridDim.x * blockDim.x)
        array[idx] = idx;
    }

__global__ void transform_array(float *in, float *out, const float scale_factor, size_t len)
    {
    for(size_t idx = blockDim.x * blockIdx.x + threadIdx.x; idx < len; idx += gridDim.x * blockDim.x)
        out[idx] = in[idx] * scale_factor;
    }

int main(void)
    {
    float *array_in, *array_out;
    size_t length = 100000000;
    const unsigned short block_Size = 256, grid_Size = 200;
    const float factor = 0.5;

    // Allocate and initialize memory
    cudaMallocManaged(&array_in, length * sizeof(float));
    cudaMallocManaged(&array_out, length * sizeof(float));
    cudaMemset(array_in, 0, length * sizeof(float));
    cudaMemset(array_out, 0, length * sizeof(float));
    cudaDeviceSynchronize();

    // Fill the input array
    init_array <<< grid_Size, block_Size >>> (array_in, length);
    cudaDeviceSynchronize();

    // Transform input and write to output array
    transform_array <<< grid_Size, block_Size >>> (array_in, array_out, factor, length);
    cudaDeviceSynchronize();

    cudaFree(array_in);
    cudaFree(array_out);
    return 0;
    }

The first kernel just initializes the input array with some numbers using a strided loop, and second kernel saves the multiplication between the input element and some scaling factor (which I calculate with other functions, but here it is just an arbitrary value) to the output array, again using the same strided loop. Essentially doing a lot of work in global memory.

How do you normally get rid of/alleviate this bottleneck?

You won’t eliminate the memory bottleneck for a memory bound code. The operations you are doing here are so trivial they are going to be memory bound.

There is likely very little you can do to make them run substantially faster. At this point, if you want to improve things, you are in the realm of what I call “ninja methods”. Things like tuning kernel size (e.g. number of blocks - easily doable with your grid-stride loop method) for the number of SMs in your device to minimize the tail effect, attempting to see if larger vector loads will improve things (slightly), etc.

Ninja methods are referred to here:

[url]http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url]

These methods in my experience don’t usually provide more than a few percent improvement.

At a higher level of abstraction, programmers who have multiple operations like this to do will sometimes seek to fuse operations. This means combining multiple kernel calls to do more work in a single kernel call. The objective is to do as much work as possible per load and store operation in global memory. Your two operations could be trivially fused into a single kernel, for example. Fusing to reduce kernel calls also saves the overhead of additional kernel calls - another ninja topic (usually).

In any event, these trivial memory-bound kernels are “fully optimized” when the kernel runs at the rate of memory bandwidth. For example, determine the total number of loads and stores done by a kernel, in bytes, and divide by the kernel execution time, in seconds. This bytes/sec number is then compared to a proxy measurement of peak achievable bandwidth (e.g. such as the device-to-device memory bandwidth reported by bandwidthTest sample code). When your kernel is running at that rate, it probably cannot be optimized further. You are done, excepting higher-level “meta” work like algorithm redesign or fusing of operations/kernels.

Thanks for these clarifications, txbob. So I think it is just what it is. And the Titan V with that ridiculous memory bus is probably laughing at it.
But while I was reading this document and trying some of the ninja techniques, like increasing the grid size to raise the occupancy (with 200 it had 89%, with 1000 it goes to 98%) and shaving some milliseconds here and there, I found by accident, clicking the wrong kernel to profile, that the array reduction we worked some weeks ago actually has some branch divergence, doesn’t it?

It is exactly the last lines:

if (tid == 0)
    array_out[blockIdx.x] = sdata[0];

Only a few threads will execute it, so I don’t think it is all that harmful, yes? NVVP shows an increase in divergence as the grid size increases.
There is probably an expression in English like, you aim at something but hit something else…

Many, many kernels will have some divergence. your grid-stride loops, for example, are prone to some small divergence as well. These sorts of things are usually insignificant, from a performance perspective.