Thread read pattern causing memory throttle and dramatic fall in throughput on Xavier

I have been running some kernel tests on the Jetson Xavier platform and comparing with the Jetson TX2 and GTX 1050. The performance of the Xavier is far closer to the 1050 GTX than the TX2 across a large number of kernels, but I have discovered that for one particular configuration, one of the kernels hits a memory throttle and there’s a dramatic fall in throughput compared to the other platforms.

The kernel is proprietary, so I cannot post, but I’ll try to explain the pattern and also create some equivalent code that I can post that exhibits the behaviour.

The situation arises when groups of multiple threads within a warp are reading aligned coalesced memory, but each of the groups of threads are reading from different locations.

One example is with 8-threads in a group, but it also occurs with 16-threads grouped together…

thread   read/thread   memory addr 
0-7      8-bytes       addr_a
8-15     8-bytes       addr_b
16-24    8-bytes       addr_c
25-31    8-bytes       addr_d

addr_x is 64-byte aligned, and each time around the kernel inner loop we progress forward through each of the addresses. So in this example each group of 8 threads is reading 64-bytes coalesced

The problem arises specifically when there are 8-blocks running in parallel on each SM. I can also create this when reading 4-bytes per thread, but with 16-blocks running in parallel on each SM.

Are there any known memory issues on Xavier causing memory throttle? Any help much appreciated

Hi,

Would you mind to share a simple example to demonstrate this?
(It don’t need to include your private kernel.)

We want to check this issue in depth and update it to our internal team.

Thanks.

Hi,

Unfortunately, as I strip out code from the kernel to create something simple that I can post the problem disappears. The kernel operates on Xavier as expected under certain configurations (with larger input block size which leads to larger groups of threads operating on coalesced memory), but begins to fail drastically as the number of parallel groups of coalesced reads into the memory interface increases, (but the group reads are not coalesced with respect to one another)

I assume this is the cause of the “memory throttle” that is seen on nvprof. The problem, though, appears to be that the mechanism to handle memory throttle is stalling the processor for far longer than necessary. The result is that as I halve the input block size instead of doubling throughput as on TX2 and 1050 GTX, the throughput is, in fact, 1/8 of the expected.

Hi,

Sometime bad access pattern lead to a poor performance.
Do you think this is a bug or just some limitation in writing CUDA kernel for Xavier?

Here are some tutorial of Xavier for your reference first:
[url]https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/[/url]
[url]https://docs.nvidia.com/cuda/volta-tuning-guide/index.html[/url]

Thanks.

Hi - there’s nothing wrong with the access pattern. As per the original post, this pattern (and kernel) shows no problems on 1050 and TX2 - the issue is specific to Xavier.

There is one correction though to the original post. Each addr_x was pitch aligned (512 bytes) when seeing the problem. (I’d wrongly assumed 64-byte, but the memory is pitch-allocated). By changing this so that each addr_x is 64-byte aligned the issue disappears. It’s counter-intuitive as the programming manual suggests that pitch aligned memory is higher performance.

My current thought is that there’s a bug in the caches in Xavier, and too many outstanding memory requests create a bottle neck that doesn’t get correctly resolved.

Hi,

1050 & TX2 are in the different GPU generation of Xavier.
You can check Volta tunning guide to check if everything is under your expectation:
[url]https://docs.nvidia.com/cuda/volta-compatibility-guide/index.html[/url]

If you are already follow all the recommended rule in the above document, we may still need a sample to reproduce this internally.
Would you mind to try for a sample again? This will help us to check the issue in detail.

Thanks.

Hi AastaLLL

Here’s some stand alone code that exhibits a large slow down for certain memory patterns on Xavier. Please let me know what you think?

Thanks,
Simon

#include <cuda_runtime.h>
#include <cuda_profiler_api.h>
#include <stdio.h>

__global__ void move8(
    const uint2* dInput,
    const unsigned int inputPitch,
    uint2* dOutput,
    const unsigned int outputPitch,
    const unsigned int width,
    const unsigned int threadGroup
)
{
    uint2 ip;

    int2 x;
    x.x = 0;
    x.y = (blockDim.x / threadGroup) * blockIdx.x + (threadIdx.x / threadGroup);

    // calculate position in input buffer
    dInput += (x.y * width + x.x) * inputPitch + (threadIdx.x % threadGroup);

    // calculate position in output buffer
    dOutput += (x.y * width + x.x) * outputPitch + (threadIdx.x % threadGroup);

    uint2 dCurr = *dInput;

    *dOutput = dCurr;
    dOutput += outputPitch;

    for (int i = 0; i < width-1; ++i)
    {
        ip = *dInput;
        dInput += inputPitch;

        // add some sort of dependency... 
        dCurr.x += ip.x;
        dCurr.y += ip.y;

        *dOutput = dCurr;
        dOutput += outputPitch;
    }
}

void move_chunk(
    uint2* inputBufferRight, uint2* outputBufferRight,
    uint2* inputBufferLeft, uint2* outputBufferLeft,
    unsigned int inputPitch, unsigned int outputPitch,
    cudaStream_t streamRight, cudaStream_t streamLeft, 
    unsigned int width, unsigned int height, unsigned int depth)
{
    const unsigned int linesPerBlock = 8;
    constexpr unsigned int kernelAtom = 8;

    dim3 gridDim = {height / linesPerBlock};
    dim3 blockDim = {(depth * linesPerBlock) / kernelAtom};

    for (unsigned int index = 0; index < 10; ++index)
    {
        move8<<<gridDim, blockDim, 0, streamRight>>>(
            inputBufferRight,
            inputPitch / kernelAtom,
            outputBufferRight,
            outputPitch / kernelAtom,
            width,
            depth / kernelAtom
            );

       move8<<<gridDim, blockDim, 0, streamLeft>>>(
            inputBufferLeft,
            inputPitch / kernelAtom,
            outputBufferLeft,
            outputPitch / kernelAtom,
            width, 
            depth / kernelAtom
            );
    }
}

int main(int argc, char** argv)
{
    int leastPriority, greatestPriority;

    cudaStream_t left, right;

    cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority);
    cudaStreamCreateWithPriority(&right, cudaStreamDefault, leastPriority);
    cudaStreamCreateWithPriority(&left, cudaStreamDefault, leastPriority);

    uint2 *inputRight, *outputRight, *inputLeft, *outputLeft;

    const size_t depth = 32;   // kernel runs ++faster if depth is 256
    const size_t height = 512;
    const size_t width = 1120;

    size_t inputPitch, outputPitch;
    cudaMallocPitch(&inputRight, &inputPitch, depth, width * height);
    cudaMalloc(&inputLeft, inputPitch * width * height);

    cudaMalloc(&outputRight, width * height * depth * 2);
    cudaMalloc(&outputLeft, width * height * depth * 2);

    outputPitch = depth * 2;   // kernel runs ++faster without the factor 2

    move_chunk(
        inputRight, outputRight,
        inputLeft, outputLeft, 
        inputPitch, outputPitch,
        right, left, 
        width, height, depth);

    cudaDeviceSynchronize();

    cudaFree(inputRight);
    cudaFree(inputLeft);
    cudaFree(outputRight);
    cudaFree(outputLeft);

    cudaProfilerStop();
}

Hi,

Thanks a lot for this sample.
It really helps us to update this issue to the internal team.

Will update information with you once we get a response.
Thanks.

Hi,

It looks like you access the output buffer with a big interval.
A larger interval(outputPitch) may cause the frequent cache miss and will seriously decrease the performance.

Do you think this cause your issue?
Thanks.

That doesn’t explain why a depth of 128 or 256 is much faster.

Nor does it explain why the depth 32 code is faster (as expected) on 1050, TX2 (embedded like Xavier), 1080ti and 2080ti.

Hi,

The key point is the cache size between TX2 and desktop GPU are different.

This issue is about Xavier performance which is bad, not TX2. TX2 and 1050, 1080, 2080 are ok. TX2 has a similar arch to Xavier and smaller cache, but does not suffer from this dramatic drop in performance.

Do you have an answer why the Xavier performance is a fraction of the other platforms in this scenario? Is there a h/w bug in the memory interface? Thanks again.

Just guessing here, but the Xavier needs to arbiter between many more bus masters (8 CPU cores, special deep learning accelerators, multiple camera input processors, etc) and it may be that the memory controller had to reduce the available write buffer depth or something in that area. The desktop cards only really worry about 2 bus masters (PCI-Ex and the GPU itself) and the TX2 has fewer devices hanging off the memory bus so it may not run into the same congestion.

Note: I’m not saying the bandwidth is stolen by those devices; I’m saying it’s quadratically harder to coordinate more devices wanting to talk to the same memory, and thus you have to make up the complexity in hardware by dropping it somewhere else. No idea if this is actually the cause, but it’s a theory that fits the observation.

Also, it’s unlikely NVIDIA will answer the question you asked in a public forum in any meaningful way. They may perhaps answer if you’re deep in NDA and mutual collaboration agreement territory, already having bought millions of parts, and having built up a good personal relationship with actual architects on the team. Maybe.
My prediction is that the public answer will be no better than “Xavier performs better with a narrower spread across output addresses because of hardware design.” (But I’ve been wrong before.)