Memory copy by two CUDA kernels - why speed differs?

Can anyone help me understand performance difference between memCopy2dA and memCopy2dB kernels?

They are supposed to copy 2D data with size xLen,yLen from one place to the other but they are using different strategies:

  • when memCopy2dA is used blocks/threads cover whole 2D space since this kernel is suppose to copy only one data point
  • when memCopy2dB is used blocks/threads are created only for one whole X row, and then each kernel is looping over Y direction to copy all data.

According to profiler (nvvp) in both cases GPU access memory pattern is 100% and X dimension is big enough to saturate device for “B” kernel (Titan X, 24SM). Unfortunately “B” kernel is slower and on my machine result is:

GB/s: 270.715
GB/s: 224.405

Additional question: Is it even possible to be close to theoretical memory bandwidth limit which is 336.48 GB/s (3505MHz * 384 bits * 2 / 8)? At least my tests shows max always around 271-272 GB/s.

Test code:

#include <cuda_runtime.h>
    #include <device_launch_parameters.h>
    #include <iostream>
    #include <chrono>
    
    template<typename T>
    __global__ void memCopy2dA(T *in, T *out, size_t xLen, size_t yLen) {
        int xi = blockIdx.x * blockDim.x + threadIdx.x;
        int yi = blockIdx.y * blockDim.y + threadIdx.y;
        if (xi < xLen && yi < yLen) {
            out[yi * xLen + xi] = in[yi * xLen + xi];
        }
    }
    
    template<typename T>
    __global__ void memCopy2dB(T *in, T *out, size_t xLen, size_t yLen) {
        int xi = blockIdx.x * blockDim.x + threadIdx.x;
        if (xi < xLen) {
            size_t idx = xi;
            for (int y = 0; y < yLen; ++y) {
                out[idx] = in[idx];
                idx += xLen;
            }
        }
    }
    
    static void waitForCuda() {
        cudaDeviceSynchronize();
        cudaError_t err = cudaGetLastError();
        if (err != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(err));
    }
    
    int main() {
        typedef float T;
    
        size_t xLen = 24 * 32 * 64; //49152
        size_t yLen = 1024;
        size_t dataSize = xLen * yLen * sizeof(T);
    
        T *dInput;
        cudaMalloc(&dInput, dataSize);
        T *dOutput;
        cudaMalloc(&dOutput, dataSize);
    
        const int numOfRepetitions = 100;
        double gigabyte = 1000 * 1000 * 1000;
        {
            dim3 threadsPerBlock(64, 1);
            dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x,
                           (yLen + threadsPerBlock.y - 1) / threadsPerBlock.y);
    
            auto startTime = std::chrono::high_resolution_clock::now();
            for (int i = 0; i < numOfRepetitions; ++i) {
                memCopy2dA <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
                waitForCuda();
            }
            auto stopTime = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed = stopTime - startTime;
            std::cout << "GB/s: " << (2 * dataSize * numOfRepetitions) / elapsed.count() / gigabyte << std::endl;
        }
        {
            dim3 threadsPerBlock(64);
            dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x);
    
            auto startTime = std::chrono::high_resolution_clock::now();
            for (int i = 0; i < numOfRepetitions; ++i) {
                memCopy2dB <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
                waitForCuda();
            }
            auto stopTime = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed = stopTime - startTime;
            std::cout << "GB/s: " << ((2 * dataSize * numOfRepetitions) / elapsed.count()) / gigabyte << std::endl;
        }
    
        cudaFree(dInput);
        cudaFree(dOutput);
    
        return 0;
    }

compiled with:

nvcc -std=c++11 memTest.cu -o memTest

According to my understanding both kernels are doing nice aligned coalesced memory reading/writing. In kernel “B” each warp has 32 threads working with 4 byte (float) elements doing 128 bytes of RD/WR at a time. Also test data dimensions are “nice” - everything fits perfectly without having idle threads or so. That is why I’m confused.

Just adding new measurements for 1080Ti (theoretical memory bandwidth 5505Mhz * 352 bits * 2 / 8 = 484 GB/s:

GB/s: 377.17
GB/s: 263.11

Anyone has any idea why speed of kernels differs?

The reading #1 for the 1080Ti is consistent, and I believe for your other tests too. If you look around other threads, the efficiency of memory system on these devices is around 75-80% of the theoretical peak.
In the implementation of memCopy2dB, you have this loop:

for (int y = 0; y < yLen; ++y)
...

that has to be processed (yLen - 1) times by all threads that pass the (xi < xLen) test. Inside this loop idx is incremented by itself + xLen, but at this point idx is no longer calculated by its unique thread ID. This means that the threads are busy doing serial work (something that CPUs are far more powerful at) and not really saturating the memory bus like the other implementation.

The speed difference is because you are launching 1024* more threads in the first kernel. You can get performance improvements by having the total number of blocks = C*the minimum amount required for maximum occupancy where C could equal 2,3,4,… maybe even 10 or more

For argument’s sake, if you change the second kernel to…

template<typename T>
    __global__ void memCopy2dB(T *in, T *out, size_t xLen, size_t yLen)
	{
	for (int xi = blockIdx.x * blockDim.x + threadIdx.x; xi<xLen*yLen; xi+=blockDim.x*gridDim.x*gridDim.y)	out[xi] = in[xi];
    }

And if you launch with…

dim3 threadsPerBlock(64);
            dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x, (yLen + threadsPerBlock.y - 1) / threadsPerBlock.y);
    
            auto startTime = std::chrono::high_resolution_clock::now();
            for (int i = 0; i < numOfRepetitions; ++i) {
                memCopy2dB <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
                waitForCuda();
            }

I get the following (pretty reliably)

C:\Work>test
GB/s: 219.969
GB/s: 243.172

C:\Work>test
GB/s: 217.932
GB/s: 240.509

C:\Work>test
GB/s: 247.209
GB/s: 261.957

C:\Work>test
GB/s: 246.687
GB/s: 257.378

C:\Work>test
GB/s: 216.718
GB/s: 241.063

C:\Work>test
GB/s: 245.693
GB/s: 260.294

also I’m not sure if you’re intending to use ++i instead of i++, you will be running (numOfRepetitions-1) times with the former

Maximum achievable memory throughput is about 80% of theoretical, and that holds for both CPU and GPU memory subsystems at this time. Comparing your code with a 1D copy on a Quadro P2000 shows that your are about at the limit:

C:\Users\Norbert\My Programs>zcopy
zcopy: operating on vectors of 100000000 double2s (= 1.600e+009 bytes)
zcopy: using 128 threads per block, 65520 blocks
zcopy: mintime = 26.601 msec  throughput = 120.30 GB/sec

C:\Users\Norbert\My Programs>copy2d
GB/s: 116.869
GB/s: 112.015

Dear all, thank you for your input.

I have run some more tests and I think that I have a answer or rather a guess since I cannot find literature to confirm this. It was all about adding one line with __syncthreads() command.

Here is my copy/pasted answer I have put on stackoverflow with the same question as above:
There is a lot of information about how important are coalesced memory operations on warp level when all threads in warp should access same aligned segments of memory.
But it seems that synchronizing warps in a block makes coalescing possible on inter-warp level probably utilizing better memory bus width on different GPUs ← this is just my “explanation” to this problem since I could not find any literature on that.

I would be very happy if anybody can confirm above and/or point me to some good info about how memory is read by multiple warps/SMs. We should remember that memory bus is much wider than L2 32-bytes aligned segments (it might be as big as 4096bits in case of monsters like v100).

Also to reach maximum possible memory transfers we need to saturate all of SMs equally with maximum number of active blocks/threads possible.

Here is my final benchmark (interested persons should focus on memCopy2dB and memCopy2dBnotSynchronized kernels).

#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
#include <chrono>
#include <memory>
#include <vector>

template<typename T>
__global__ void memCopy1dKernel(T *in, T *out, size_t len) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < len) {
        out[idx] = in[idx];
    }
}

template<typename T>
__global__ void memCopy2dA(const T *in, T *out, size_t xLen, size_t yLen) {
    size_t xi = blockIdx.x * blockDim.x + threadIdx.x;
    size_t yi = blockIdx.y * blockDim.y + threadIdx.y;
    if (xi < xLen && yi < yLen) {
        out[yi * xLen + xi] = in[yi * xLen + xi];
    }
}

template<typename T>
__global__ void memCopy2dB(const T *in, T *out, size_t xLen, size_t yLen) {
    size_t xi = blockIdx.x * blockDim.x + threadIdx.x;
    if (xi < xLen) {
        for (size_t idx = xi; idx < yLen * xLen; idx += xLen) {
            __syncthreads(); // don't need sychronization but it gives super speedup!
            out[idx] = in[idx];
        }
    }
}

template<typename T>
__global__ void memCopy2dBnotSynchronized(const T *in, T *out, size_t xLen, size_t yLen) {
    size_t xi = blockIdx.x * blockDim.x + threadIdx.x;
    if (xi < xLen) {
        for (size_t idx = xi; idx < yLen * xLen; idx += xLen) {
            out[idx] = in[idx];
        }
    }
}

static void waitForCuda() {
    cudaDeviceSynchronize();
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(err));
}

class ScopedBandwidth {
    const std::string name;
    const std::chrono::high_resolution_clock::time_point startTime;
    const size_t dataSize;
    const int numOfRepetitions;
    static constexpr double gigabyte = 1000 * 1000 * 1000;

public:
    ScopedBandwidth(const std::string &name, size_t dataSize, size_t numOfRepetitions) : name(name), startTime(std::chrono::high_resolution_clock::now()), dataSize(dataSize), numOfRepetitions(numOfRepetitions) {}
    ~ScopedBandwidth() {
        auto stopTime = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> elapsed = stopTime - startTime;
        std::cout << name << " GB/s: " << (2 * dataSize * numOfRepetitions) / elapsed.count() / gigabyte << std::endl;
    }
};


template <typename T>
void warmUp(int numOfRepetitions, int numOfThreads, size_t xLen, size_t yLen, T *dInput, T *dOutput) {
    const size_t numOfElements = xLen * yLen;

    dim3 threadsPerBlock(numOfThreads);
    dim3 numBlocks((numOfElements + threadsPerBlock.x - 1) / threadsPerBlock.x);

    {
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy1dKernel << < numBlocks, threadsPerBlock >> > (dInput, dOutput, numOfElements);
        }
        waitForCuda();
    }
}

template <typename T>
void test1D(int numOfRepetitions, int numOfThreads, size_t xLen, size_t yLen, T *dInput, T *dOutput) {
    const size_t numOfElements = xLen * yLen;
    const size_t dataSize = numOfElements * sizeof(T);

    dim3 threadsPerBlock(numOfThreads);
    dim3 numBlocks((numOfElements + threadsPerBlock.x - 1) / threadsPerBlock.x);

    {
        ScopedBandwidth sb("test1D", dataSize, numOfRepetitions);
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy1dKernel << < numBlocks, threadsPerBlock >> > (dInput, dOutput, numOfElements);
        }
        waitForCuda();
    }
}

template <typename T>
void test2DA(int numOfRepetitions, int numOfThreads, size_t xLen, size_t yLen, T *dInput, T *dOutput) {
    const size_t numOfElements = xLen * yLen;
    const size_t dataSize = numOfElements * sizeof(T);

    dim3 threadsPerBlock(numOfThreads, 1);
    dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x,
                   (yLen + threadsPerBlock.y - 1) / threadsPerBlock.y);

    {
        ScopedBandwidth sb("test2DA", dataSize, numOfRepetitions);
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy2dA <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
        }
        waitForCuda();
    }
}

template <typename T>
void test2DB(int numOfRepetitions, int numOfThreads, size_t xLen, size_t yLen, T *dInput, T *dOutput) {
    const size_t numOfElements = xLen * yLen;
    const size_t dataSize = numOfElements * sizeof(T);

    dim3 threadsPerBlock(numOfThreads);
    dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x);

    {
        ScopedBandwidth sb("test2DB", dataSize, numOfRepetitions);
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy2dB <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
        }
        waitForCuda();
    }
}

template <typename T>
void test2DBnotSynchronized(int numOfRepetitions, int numOfThreads, size_t xLen, size_t yLen, T *dInput, T *dOutput) {
    const size_t numOfElements = xLen * yLen;
    const size_t dataSize = numOfElements * sizeof(T);

    dim3 threadsPerBlock(numOfThreads);
    dim3 numBlocks((xLen + threadsPerBlock.x - 1) / threadsPerBlock.x);

    {
        ScopedBandwidth sb("test2DBnotSynchronized", dataSize, numOfRepetitions);
        for (int i = 0; i < numOfRepetitions; ++i) {
            memCopy2dBnotSynchronized <<< numBlocks, threadsPerBlock >>> (dInput, dOutput, xLen, yLen);
        }
        waitForCuda();
    }
}

template <typename T>
void setMemory(T *dInput, T *dOutput, size_t dataSize) {
    cudaMemset(dInput, 0, dataSize);
    cudaMemset(dOutput, 11, dataSize);
}

template <typename T>
void checkMemory(T *dOutput, size_t numOfElements) {
    std::unique_ptr<T[]> hMemory(new T[numOfElements]);
    cudaMemcpy(hMemory.get(), dOutput, numOfElements * sizeof(T), cudaMemcpyDeviceToHost);
    size_t cnt = 0;
    for (size_t i = 0; i < numOfElements; ++i) {
        if (hMemory[i] != 0) { // check if memory was zeroed
            cnt++;
        }
    }

    if (cnt > 0) {
        std::cout << "Memory not copied properly! Found " << numOfElements << "/" << cnt << " errors." << std::endl;
        exit(1);
    }
}

int main() {
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, 0);
    const int smCount = deviceProp.multiProcessorCount;
    std::cout << "Number of SMs = " << smCount << std::endl;
    std::cout << "Size of global mem = " << static_cast<float>(deviceProp.totalGlobalMem / 1048576.0f) << "MB" << std::endl;

    using T=float;
    const size_t yLen = 1024;
    // calculate coefficient to get for both buffers about half of memory in GPU
    const size_t halfMemCoef = deviceProp.totalGlobalMem / (yLen * 32 * 64 * smCount * sizeof(T)) / 2 / 2;
    // max efficiency at #SM * 32 (max num of active blocks) * 64 (maximum num of active threads 2048/32 active blocks)
    const size_t xLen = smCount * 32 * 64 * halfMemCoef;
    const size_t numOfElements = xLen * yLen;
    const size_t dataSize = numOfElements * sizeof(T);
    std::cout << "Used DataSize = " << dataSize / 1024 / 1024 << "MB\n";

    T *dInput;
    cudaMalloc(&dInput, dataSize);
    T *dOutput;
    cudaMalloc(&dOutput, dataSize);

    const int numOfRepetitions = 50;

    warmUp(numOfRepetitions, 64, xLen, yLen, dInput, dOutput);

    std::vector<int> numThreadsPool = {32, 64, 128, 256};

    for (int numOfThreads : numThreadsPool) {
        std::cout << "-------- numOfThreads in block = " << numOfThreads << std::endl;

        setMemory(dInput, dOutput, dataSize);
        test1D(numOfRepetitions, numOfThreads, xLen, yLen, dInput, dOutput);
        checkMemory(dOutput, numOfElements);

        setMemory(dInput, dOutput, dataSize);
        test2DA(numOfRepetitions, numOfThreads, xLen, yLen, dInput, dOutput);
        checkMemory(dOutput, numOfElements);

        setMemory(dInput, dOutput, dataSize);
        test2DB(numOfRepetitions, numOfThreads, xLen, yLen, dInput, dOutput);
        checkMemory(dOutput, numOfElements);

        setMemory(dInput, dOutput, dataSize);
        test2DBnotSynchronized(numOfRepetitions, numOfThreads, xLen, yLen, dInput, dOutput);
        checkMemory(dOutput, numOfElements);
    }

    cudaFree(dInput);
    cudaFree(dOutput);

    return 0;
}

With about benchmark this are my results from 1080Ti:

Number of SMs = 28
Size of global mem = 11177.2MB
Used DataSize = 2688MB
-------- numOfThreads in block = 32
test1D GB/s: 206.252
test2DA GB/s: 197.154
test2DB GB/s: 282.854
test2DBnotSynchronized GB/s: 292.736
-------- numOfThreads in block = 64
test1D GB/s: 358.422
test2DA GB/s: 369.371
test2DB GB/s: 349.66
test2DBnotSynchronized GB/s: 299.613
-------- numOfThreads in block = 128
test1D GB/s: 369.34
test2DA GB/s: 369.274
test2DB GB/s: 358.829
test2DBnotSynchronized GB/s: 310.661
-------- numOfThreads in block = 256
test1D GB/s: 369.489
test2DA GB/s: 369.395
test2DB GB/s: 359.54
test2DBnotSynchronized GB/s: 321.461

As we can see synchronized kernel memCopy2dB is really faster than not synchronized version, even if we lose some time on synchronization.

Your guess is as good as anybody’s, I think. The efficiency of a DRAM interface will depend on the DRAM technology and the specifics of the address and stream presented to it. The former would presumably be influenced by the choice of DDR4 / GDDR5 / GDDR5X / GDDR6 / HBM2 (internal row/column organization, number of open pages, number of internal channels, read/write turnaround latency). The latter depends on the details of GPU internal load/store paths, including width, latency, buffering, re-sorting, interaction with scheduling mechanisms. NVIDIA does not make the details of any of that available.

So there are a lot of variables in play and I doubt that a good model can be formulated across GPU models and generations without knowing a lot of the internals and even then it might be difficult. In a former life I was responsible for the memory subsystem portion of a CPU model, and in memory-intensive benchmarks I had trouble limiting error to 10% despite modelling many of the internal mechanism of both CPU and DRAM.

Using heuristics-based tuning for a particular GPU is probably the best way to go if you need to squeeze the last few percent of copy performance out of a particular platform. The first rule of copy performance on any platform: Don’t copy!

Sure, I agree with you. Anyway sometimes (always?) you need to get as much from your GPU as possible and then you need good approaches.
For me copying memory is a best test to check what you can get in practical use of given GPU. Then any algorithm probably will put some more logic in between reads and writes and instead of just copying we have some more sophisticated data processing ;-)

Here are some more tests from different GPUs and this ‘synchronization technique’ seems to work always and in my laptops GPU it allows to get best results (surprisingly):

GT 750M (on laptop):

Number of SMs = 2
Size of global mem = 2047.56MB
DataSize = 496MB
-------- numOfThreads in block = 32
test1D GB/s: 11.9956
test2DA GB/s: 10.8648
test2DB GB/s: 16.0146
test2DBnotSynchronized GB/s: 16.4039
-------- numOfThreads in block = 64
test1D GB/s: 21.8001
test2DA GB/s: 19.5055
test2DB GB/s: 26.623
test2DBnotSynchronized GB/s: 28.3895
-------- numOfThreads in block = 128
test1D GB/s: 38.3179
test2DA GB/s: 33.1063
test2DB GB/s: 45.3577
test2DBnotSynchronized GB/s: 37.7094
-------- numOfThreads in block = 256
test1D GB/s: 37.1336
test2DA GB/s: 30.7438
test2DB GB/s: 44.3298
test2DBnotSynchronized GB/s: 38.9371

‘regular’ 1080:

Number of SMs = 20
Size of global mem = 8119.56MB
Used DataSize = 1920MB
-------- numOfThreads in block = 32
test1D GB/s: 195.639
test2DA GB/s: 194.296
test2DB GB/s: 172.84
test2DBnotSynchronized GB/s: 175.09
-------- numOfThreads in block = 64
test1D GB/s: 237.32
test2DA GB/s: 237.333
test2DB GB/s: 219.705
test2DBnotSynchronized GB/s: 186.942
-------- numOfThreads in block = 128
test1D GB/s: 236.89
test2DA GB/s: 236.934
test2DB GB/s: 225.896
test2DBnotSynchronized GB/s: 193.252
-------- numOfThreads in block = 256
test1D GB/s: 236.537
test2DA GB/s: 236.603
test2DB GB/s: 225.727
test2DBnotSynchronized GB/s: 197.74

You found a useful heuristic for current hardware. Caveat: Heuristics tend to change over time. For example, at one time, it was crucial for GPU copy performance to make each data access in the code as large as possible. Today performance differences based on access width are minimal, thanks to different hardware designs.

For many practical cases of high-performance processing, data movement is now the performance limiter, and a major power burner to boot. Over the last 15 years, execution unit throughput has grown much faster than memory throughput, and execution units are very efficient compared to moving data around if you look at the pJ burned. There are physical reasons for that.

I sometimes see code like the following: Process vector A into vector B. Process vector B into vector C. Etc, etc. This style probably dates back to the days of vector processors, but it is not the way to best utilize a GPU. It is usually better to slice the vectors, and take each slice through all stages of processing in one go. More compute, less data movement. Of course, in some instances one needs compose-ability, and arranging the processing by data layers may be the way to go in those isolated instances.

I’m completely fine with this. I need a solution that works now and probably in a near future.
It is obvious that next generation(s) of CPUs and GPUs will require different kind of optimizations - discovering them is a kind of (neverending) pleasure :-)
However, revealing a little bit more information about HW by NVIDIA and other companies would be really helpful, since we could find a solutions to problems basing on understanding instead of … guessing.