Partition Camping on newer architectures (Long Scoreboard?)

Hi!

I’ve been finding bits and pieces on “Partition Camping” in threads dating back to 09/10, but I haven’t seen anything new on it. Would someone be able to expand on the state-of-play with respect to how it works on the newer architectures (V100s, say), as well as the current performance costs of it, and the best way to overcome it when writing kernels

Perhaps unrelated (perhaps not): in profiling a couple of my kernels which read exclusively through global memory, I’ve noticed a warp stall which pops up repeatedly - “Long Scoreboard” - which happens to become substantially more pronounced when adding numbers of blocks to the kernel (whilst keeping the occupancy the same). Is this error actually telling me that I may be suffering from Partition Camping?

Thanks in advance.

1 Like

Partition camping issues went the way of the dodo after Fermi, if I recall correctly. If memory serves, this was accomplished by strategically scrambling address bits in the memory controllers. This makes worst case scenarios like partition camping extremely unlikely, but I would think for any such scrambling scheme one could find a carefully constructed setup that defeats it (just as one can trigger cache or TLB thrashing).

From a software perspective, there is nothing to do here. The memory hierarchy of modern GPUs (say, Maxwell and later) is sophisticated and doesn’t need all the hand-holding required for early GPUs. Early GPUs had minimal hardware expended on compute. Now that GPU computing is in the mainstream and a major revenue generator for NVIDIA, GPU architectures are designed with the needs of GPU computing very much a main consideration. This would in particular apply to a chip like the V100, which drives the fastest supercomputers in the world.

I have no idea what “long scoreboard” refers to. From my time building processors I would assume it has something to do with scheduling inside the processor, nothing related to memory directly. What (if anything) does the profiler documentation say about this stall?

Discussions about performance properties of code typically require more than vague descriptions, and ideally access to the code itself. Can you post the (cut-down version of) code in question?

Absolutely… though it’s written to test the DGX-2 / HGX-2 - so I doubt many people out there have the hardware capable of debugging it! :(

The code is trying to find a faster way to run sparse matrix multiplies (in some scenarios) instead of csrmm / cscmm, by instead defaulting to ELLPACK (using padding for non-conforming matrices), which benefits hugely from coalesced memory reads - as each ‘row’ of non zeroes is a multiple of 32

#include <string>
#include <iostream>
#include <chrono>
#include <stdio.h>
#include <stdexcept>
#include <random>
#include <iterator>
#include <fstream>
#include <cublas_v2.h>

#ifndef CUDA
#define CUDA(expr) do {                             \
        cudaError_t _e = (expr);                    \
if (_e == cudaSuccess) break;               \
        char errstr[128];                           \
        snprintf(errstr, 128, \
        "%s(%d) CUDA Error(%d)\n", \
        __FILE__, __LINE__, _e);           \
        throw std::runtime_error(errstr);                       \
} while (0)
#endif

#define CUSPARSE(expr) do {                         \
        cusparseStatus_t _e = (expr);               \
if (_e == CUSPARSE_STATUS_SUCCESS) break;   \
        char errstr[128];                           \
        snprintf(errstr, 128, \
        "%s(%d) CUSPARSE Error(%d)\n", \
        __FILE__, __LINE__, _e);           \
        throw std::runtime_error(errstr);                       \
} while (0)

#define CUBLAS(call) do {                               \
        cublasStatus_t _e = (call);                     \
if (_e == CUBLAS_STATUS_SUCCESS) break;         \
        char errstr[128];                           \
        snprintf(errstr, 128, \
        "%s(%d) CUBLAS Error(%d)\n", \
        __FILE__, __LINE__, _e);           \
        throw std::runtime_error(errstr);                       \
} while (0)

#define DOM_FILE (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : strrchr(__FILE__, '\') ? strrchr(__FILE__, '\') + 1 : __FILE__)

#define MSG(msg,...) do {printf("%s(%d):" msg "\n",DOM_FILE,__LINE__,##__VA_ARGS__);} while(0)

#define NUM_SM 84
#define BLOCKS_PER_SM 16
#define BLOCKSIZE 128

//__launch_bounds__(BLOCKSIZE, BLOCKS_PER_SM)
__global__ void ELLgemm_multi_gpu(float ** results, int* index, float * vals, float * dense, int * maps, int * lengths, unsigned int nnz, unsigned int y)
{
        const int tidx = threadIdx.x;
        const int gpu = threadIdx.y; //this tells us what gpu to operate on

        const int off1 = blockIdx.x;
        const int off2 = gridDim.x;

        float *result = results[gpu];
        int length = lengths[gpu];
        int map = maps[gpu];

        index += nnz * (off1+map) + tidx;
        vals += nnz * (off1+map) + tidx;
        result += y * off1 + tidx;
        dense += tidx;

        //if (blockIdx.x > 0) return;

        for (int read = off1; read<length; read += off2)
        {
                for (int b = 0; b<y; b += 32)
                {
                        float calc = 0.0f;
                        for (int a = 0; a<nnz; a += 32)
                        {
                                int index_reg = index[a];
                                int vals_reg = vals[a];
                                for (int i = 0; i<32; i++)
                                {
                                        int idx_use = __shfl_sync(0xFFFFFFFF, index_reg, i);
                                        if (idx_use >= 0)
                                        {
                                                float vals_use = __shfl_sync(0xFFFFFFFF, vals_reg, i);
                                                float dense_use = dense[y * idx_use + b];
                                                calc += dense_use * vals_use;
                                        }
                                        //else if (tidx==0) printf("Error zero index, index is %d, gpu : %d, read : %d, b : %d, a is %d\n",index_reg, gpu,read,b,a);
                                }
                                //__syncwarp();
                        }
                        atomicAdd(result + b, calc);
                }
                index += nnz * off2;
                vals += nnz * off2;
                result += y * off2;
        }
}

//__launch_bounds__(BLOCKSIZE, 8)
__global__ void ELLgemmT_multi_gpu(float * result, int* index, float * vals, float  ** __restrict__ denses, int * __restrict__ maps, int * __restrict__ lengths, unsigned int nnz, unsigned int y) //, unsigned int nnz, unsigned int y
{
        const int offz = blockIdx.z;
        const int gdz = gridDim.z;

        const int gpu = threadIdx.y; //this tells us what gpu to operate on

        const float *dense = denses[gpu]; //gets the pointer to the dense array on the right gpu
        int length = lengths[gpu]; //get the length for said array
        int map = maps[gpu]; //gets the offset for array

        //increment ELL to the corresponding chunk
        {
                int offset = 32 * blockIdx.x + threadIdx.x + (map + offz)*nnz;
                index += offset;
                vals += offset;
        }

        dense += 32 * blockIdx.y + threadIdx.x + offz*y;
        result += 32 * blockIdx.y + threadIdx.x;

for (int i = offz; i<length; i += gdz)
        {
                //Idx_shared[tid] = index[0];
                //Val_shared[tid] = vals[0];
                float dense_reg = dense[0];
                //__syncwarp();
                int idx_reg = index[0];
                int val_reg = vals[0];
                for (int j = 0; j<32; j++)
                {
                        //int write = Idx_shared[shared_offset + j];
                        int write = __shfl_sync(0xFFFFFFFF, idx_reg, j);
                        //if (write >= 0) atomicAdd(result + y * write, dense_reg * Val_shared[shared_offset + j]);
                        if (write >= 0) atomicAdd(result + y * write, dense_reg * __shfl_sync(0xFFFFFFFF, val_reg, j));
                }
                //__syncwarp();
                index += gdz*nnz;
                vals += gdz*nnz;
                dense += gdz*y;
        }
}

void ELLgemmT(float *result, int *index, float *vals, float **denses, int *maps, int *lengths, unsigned int nnz, unsigned int y, int ngpu, int totalblocks)
{
        int blocksx = nnz / 32;
        int blocksy = y / 32;
        dim3 threadsPerBlock(32, ngpu);
        dim3 numblocks(blocksx, blocksy, totalblocks / (blocksx*blocksy));
        //dim3 numblocks(blocksx, blocksy, 2 * BLOCKS_PER_SM * NUM_SM  / (blocksx*blocksy));
        //int blocksreq = NUM_SM * 64 / ngpu;
        //dim3 numblocks(blocksx, blocksy, blocksreq / blocksx / blocksy);
        ELLgemmT_multi_gpu << <numblocks, threadsPerBlock >> >(result, index, vals, denses, maps, lengths, nnz, y);
}

void ELLgemmT(float *result, int *index, float *vals, float **denses, int *maps, int *lengths, unsigned int nnz, unsigned int y, int ngpu)
{
        int blocksx = nnz / 32;
        int blocksy = y / 32;
        dim3 threadsPerBlock(32, ngpu);
        dim3 numblocks(blocksx, blocksy, 2 * BLOCKS_PER_SM * NUM_SM / (blocksx*blocksy));
        //int blocksreq = NUM_SM * 64 / ngpu;
        //dim3 numblocks(blocksx, blocksy, blocksreq / blocksx / blocksy);
        ELLgemmT_multi_gpu << <numblocks, threadsPerBlock >> >(result, index, vals, denses, maps, lengths, nnz, y);
}

void ELLgemm(float **results, int *index, float *vals, float *dense, int *maps, int *lengths, unsigned int nnz, unsigned int y, int ngpu, int totalblocks)
{
        dim3 threadsPerBlock(32, ngpu);
        int numblocks = totalblocks;
        //int blocksreq = NUM_SM * 64 / ngpu;
        //int numblocks = blocksreq;
        ELLgemm_multi_gpu << <numblocks, threadsPerBlock >> >(results, index, vals, dense, maps, lengths, nnz, y);
}

void ELLgemm(float **results, int *index, float *vals, float *dense, int *maps, int *lengths, unsigned int nnz, unsigned int y, int ngpu)
{
        dim3 threadsPerBlock(32, ngpu);
        int numblocks = BLOCKS_PER_SM * NUM_SM;
        //int blocksreq = NUM_SM * 64 / ngpu;
        //int numblocks = blocksreq;
        ELLgemm_multi_gpu << <numblocks, threadsPerBlock >> >(results, index, vals, dense, maps, lengths, nnz, y);
}

class Timer
{
public:
        Timer() : beg_(clock_::now()) {}
        void start() { beg_ = clock_::now(); }
        double stop() const {
                return std::chrono::duration_cast<second_>
                        (clock_::now() - beg_).count();
        }

private:
        typedef std::chrono::high_resolution_clock clock_;
        typedef std::chrono::duration<double, std::ratio<1> > second_;
        std::chrono::time_point<clock_> beg_;
};

int main(int argc, char* argv[])
{
        try
        {
                if(argc!=2){
                        throw;
                }

                int ngpu = atoi(argv[1]);

Timer t;

int rows_per_gpu = 375000;//6e6 / ngpu;
                int cols_per_gpu = 625000;//1e7 / ngpu;
                int cols_per_gpu1 = 3125000;//5e7 / ngpu;
                int nnz = 96;
                int batch = 32;

                int *map;
                int *elems;
                float **out_dvc;
                float **out2_dvc;

                CUDA(cudaMallocHost((void**)&map, ngpu * sizeof(int)));
                CUDA(cudaMallocHost((void**)&elems, ngpu * sizeof(int)));
                CUDA(cudaMallocHost((void**)&out_dvc,ngpu * sizeof(float**)));
                CUDA(cudaMallocHost((void**)&out2_dvc, ngpu * sizeof(float**)));

                int **colIdx_dvc = (int **)malloc(ngpu*sizeof(int**));
                int **colIdx1_dvc = (int **)malloc(ngpu*sizeof(int**));
                float **Val_dvc = (float **)malloc(ngpu * sizeof(float**));
                float **data_dvc = (float **)malloc(ngpu * sizeof(float**));

                bool runp2pmode = true;

                for (int i = 0; i < ngpu; i++)
                {
                        CUDA(cudaSetDevice(i));
                        map[i] = rows_per_gpu * i;
                        elems[i] = rows_per_gpu;
                        for (int j = 0; j < ngpu; j++)
                        {
                                if (i != j)
                                {
                                        int access;
                                        CUDA(cudaDeviceCanAccessPeer(&access, j, i));
                                        runp2pmode = runp2pmode & (bool)access;
                                        if (access)
                                        {
                                                CUDA(cudaDeviceEnablePeerAccess(j, 0));
                                                MSG("Enabled P2P %d ->%d", i, j, access);
                                        }
                                }
                        }

                        int *colIdx = (int *)malloc(ngpu * rows_per_gpu * nnz * sizeof(int));
                        int *colIdx1 = (int *)malloc(ngpu * rows_per_gpu * nnz * sizeof(int));
                        float *Val = (float *)malloc(ngpu * rows_per_gpu * nnz * sizeof(float));
                        float *data = (float *)malloc(cols_per_gpu1 * batch * sizeof(float));
                        float *out = (float *)malloc(rows_per_gpu * batch * sizeof(float));

                        //int *Idx_Count = (int *)malloc(cols_per_gpu * sizeof(int));
                        //for (int j = 0; j < cols_per_gpu; j++) Idx_Count[j] = 0;

                        for (int j = 0; j < ngpu * rows_per_gpu * nnz; j++) {
                                Val[j] = 1; // ((float)std::rand()) / RAND_MAX;
                                colIdx[j] = (((long long int)j*(long long int)(cols_per_gpu - 1)) % (long long int) cols_per_gpu);
                                colIdx1[j] = (((long long int)j*(long long int)(cols_per_gpu1 - 1)) % (long long int) cols_per_gpu1);

                        }

                        for (int j = 0; j < cols_per_gpu1 * batch; j++)
                        {
                                data[j] = 1.0f;
                        }

                        for (int j = 0; j < rows_per_gpu * batch; j++) out[j] = 0.0f;

                        CUDA(cudaMalloc((void **)&colIdx_dvc[i], ngpu * rows_per_gpu * nnz * sizeof(int)));
                        CUDA(cudaMalloc((void **)&colIdx1_dvc[i], ngpu * rows_per_gpu * nnz * sizeof(int)));
                        CUDA(cudaMalloc((void **)&Val_dvc[i], ngpu * rows_per_gpu * nnz * sizeof(float)));
                        CUDA(cudaMalloc((void **)&data_dvc[i], cols_per_gpu1 * batch * sizeof(float)));
                        CUDA(cudaMalloc((void **)&out_dvc[i], rows_per_gpu * batch * sizeof(float)));

                        CUDA(cudaMemcpy(colIdx_dvc[i], colIdx, ngpu * rows_per_gpu * nnz * sizeof(int), cudaMemcpyDefault));
                        CUDA(cudaMemcpy(colIdx1_dvc[i], colIdx1, ngpu * rows_per_gpu * nnz * sizeof(int), cudaMemcpyDefault));
                        CUDA(cudaMemcpy(Val_dvc[i], Val, ngpu * rows_per_gpu * nnz * sizeof(float), cudaMemcpyDefault));
                        CUDA(cudaMemcpy(data_dvc[i], data, cols_per_gpu1 * batch * sizeof(float), cudaMemcpyDefault));
                        CUDA(cudaMemcpy(out_dvc[i], out, rows_per_gpu * batch * sizeof(float), cudaMemcpyDefault));

                        free(out);
                        free(data);
                        free(Val);
                        free(colIdx);
                        free(colIdx1);

                        std::cout << "Gpu " << i << " allocated" << std::endl;

                }

                for (int i = 0; i < ngpu; i++)
                {
                        CUDA(cudaSetDevice(i));
                        CUDA(cudaDeviceSynchronize());

                }

                //int pfac=19;
                //int blockcnt = (int)ceil(100 * pow(1.5f, (double)pfac));

                MSG("Peer To Peer mode enabled? %s", runp2pmode ? "true" : "false");
                CUDA(cudaSetDevice(0));
                CUDA(cudaDeviceSynchronize());

                for (int pfac=0; pfac<20; pfac++)
{
int blockcnt = (int)ceil(100 * pow(1.5f, (double)pfac));
                t.start();
                ELLgemm(out_dvc, colIdx_dvc[0], Val_dvc[0], data_dvc[0], map, elems, nnz, batch, ngpu, blockcnt);
                CUDA(cudaPeekAtLastError());
                CUDA(cudaDeviceSynchronize());
                MSG("%d blocks, Emul time Taken per iteration %f", blockcnt, t.stop());
}

                float *out_hst = (float *)malloc(rows_per_gpu*batch * sizeof(float));
                std::ofstream os1("./out_dvc1.csv", std::ios::binary | std::ios::out);
                for (int i = 0; i < ngpu; i++)
                {
                        CUDA(cudaSetDevice(i));
                        CUDA(cudaMemcpy(out_hst, out_dvc[i], rows_per_gpu*batch * sizeof(float), cudaMemcpyDefault));
                        os1.write(reinterpret_cast<const char*>(out_hst), std::streamsize(rows_per_gpu*batch * sizeof(float)));
                }

                CUDA(cudaSetDevice(0));
                CUDA(cudaDeviceSynchronize());
for (int pfac=0; pfac<20; pfac++)
{
int blockcnt = (int)ceil(100 * pow(1.5f, (double)pfac));
                t.start();
                ELLgemm(out_dvc, colIdx1_dvc[0], Val_dvc[0], data_dvc[0], map, elems, nnz, batch, ngpu, blockcnt);
                CUDA(cudaPeekAtLastError());
                CUDA(cudaDeviceSynchronize());
                MSG("%d blocks, Emul time Taken per iteration %f", blockcnt, t.stop());
}
                std::ofstream os2("./out_dvc2.csv", std::ios::binary | std::ios::out);
                for (int i = 0; i < ngpu; i++)
                {
                        CUDA(cudaSetDevice(i));
                        CUDA(cudaMemcpy(out_hst, out_dvc[i], rows_per_gpu*batch * sizeof(float), cudaMemcpyDefault));
                        os2.write(reinterpret_cast<const char*>(out_hst), std::streamsize(rows_per_gpu*batch * sizeof(float)));
                }
                free(out_hst);

        }
        catch (std::exception &e)
        {
                {
                        std::cerr << " exception caught: " << e.what() << '\n';
                }
        }
        catch (...)
        {
                //MSG("Unknown Error");
        }
        //std::cin.get();
}

Here’s the output that’s interesting to me:

c++11_cuda.cu(393):100 blocks, Emul time Taken per iteration 0.010653
c++11_cuda.cu(393):150 blocks, Emul time Taken per iteration 0.007467
c++11_cuda.cu(393):225 blocks, Emul time Taken per iteration 0.005421
c++11_cuda.cu(393):338 blocks, Emul time Taken per iteration 0.006916
c++11_cuda.cu(393):507 blocks, Emul time Taken per iteration 0.005345
c++11_cuda.cu(393):760 blocks, Emul time Taken per iteration 0.005964
c++11_cuda.cu(393):1140 blocks, Emul time Taken per iteration 0.005792
c++11_cuda.cu(393):1709 blocks, Emul time Taken per iteration 0.006914
c++11_cuda.cu(393):2563 blocks, Emul time Taken per iteration 0.008924
c++11_cuda.cu(393):3845 blocks, Emul time Taken per iteration 0.011919
c++11_cuda.cu(393):5767 blocks, Emul time Taken per iteration 0.018326
c++11_cuda.cu(393):8650 blocks, Emul time Taken per iteration 0.025692
c++11_cuda.cu(393):12975 blocks, Emul time Taken per iteration 0.030417
c++11_cuda.cu(393):19462 blocks, Emul time Taken per iteration 0.038933
c++11_cuda.cu(393):29193 blocks, Emul time Taken per iteration 0.035674
c++11_cuda.cu(393):43790 blocks, Emul time Taken per iteration 0.034968
c++11_cuda.cu(393):65685 blocks, Emul time Taken per iteration 0.034947
c++11_cuda.cu(393):98527 blocks, Emul time Taken per iteration 0.035048

c++11_cuda.cu(393):147790 blocks, Emul time Taken per iteration 0.035199
c++11_cuda.cu(393):221684 blocks, Emul time Taken per iteration 0.035203
EmulT running
c++11_cuda.cu(427):100 blocks, EmulT time Taken per iteration 0.004580
c++11_cuda.cu(427):150 blocks, EmulT time Taken per iteration 0.004593
c++11_cuda.cu(427):225 blocks, EmulT time Taken per iteration 0.004386
c++11_cuda.cu(427):338 blocks, EmulT time Taken per iteration 0.005068
c++11_cuda.cu(427):507 blocks, EmulT time Taken per iteration 0.004087
c++11_cuda.cu(427):760 blocks, EmulT time Taken per iteration 0.004080
c++11_cuda.cu(427):1140 blocks, EmulT time Taken per iteration 0.003910
c++11_cuda.cu(427):1709 blocks, EmulT time Taken per iteration 0.003682
c++11_cuda.cu(427):2563 blocks, EmulT time Taken per iteration 0.003617
c++11_cuda.cu(427):3845 blocks, EmulT time Taken per iteration 0.003621
c++11_cuda.cu(427):5767 blocks, EmulT time Taken per iteration 0.003459
c++11_cuda.cu(427):8650 blocks, EmulT time Taken per iteration 0.003520
c++11_cuda.cu(427):12975 blocks, EmulT time Taken per iteration 0.003530
c++11_cuda.cu(427):19462 blocks, EmulT time Taken per iteration 0.003699
c++11_cuda.cu(427):29193 blocks, EmulT time Taken per iteration 0.003726
c++11_cuda.cu(427):43790 blocks, EmulT time Taken per iteration 0.003838
c++11_cuda.cu(427):65685 blocks, EmulT time Taken per iteration 0.003877
c++11_cuda.cu(427):98527 blocks, EmulT time Taken per iteration 0.003920
c++11_cuda.cu(427):147790 blocks, EmulT time Taken per iteration 0.003996
c++11_cuda.cu(427):221684 blocks, EmulT time Taken per iteration 0.004095
Old Emul running
c++11_cuda.cu(510):Old Emul time taken per iteration 0.014299
Old EmulT running
c++11_cuda.cu(566):Old EmulT time taken per iteration 0.012525
Finished.

I’ll see if I can get a single gpu version to replicate same - it’s designed for 16 gpus

That certainly applies to me. Lucky you. Maybe someone else here has access to such a platform.

On a first perusal of the code nothing jumps out at me as questionable from a performance perspective. Can you document, for the benefit of others, how exactly you compile this code (CUDA version, exact nvcc invocation)?

How much performance do you think you are losing to the “Long Scoreboard” stall?

I’ll get back to you with those details, I’m writing this up from memory I’m afraid.

On an aside, this article was written only in 2015 and still seems to indicate similar issues (unless I’ve failed to understand the problem):

That article refers to an issue very much particular / isolated to the GTX 970, which caused a portion of its memory to be connected with much less performance than the bulk of it. NVIDIA caught a lot of flak for this (basically they stood accused of using chicken wire and duct tape to construct a Franken-GPU to maximize profits), and this has helped to ensure there have not been any repeats. Or at least I have not read of a comparable case since then.

The issue affecting the GTX 970 was very different from the old partition camping issues on < sm_30. Implementation artifact rather than architectural shortcoming.

If the “long scoreboard” stalls turn out to be connected to waiting for memory, I would suggest looking into occupancy and also the grid dimensioning. The basic idea is that memory latency is overcome best if there are lots of blocks and lots of threads in flight.

While the following is a rule of thumb based on old GPUs, it might still have some utility on newer hardware, but I have had no exposure to the Volta architecture: For memory-intensive applications, size the grid such that total number of blocks >= 20 * #SMs * concurrent blocks per SM

NVIDIA has a recent developer blog post available that mentions the “long scoreboard” stall; it may be worth reading:
https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/

Interesting.
I’m not explicitly doing any TEX memory ops. All the kernel is doing is a whole ton of coalesced global mem reads, and coalesced(?) global mem atomic writes. The key difference between this and, say, other similar sparse matrix ops, is that the kernel is actively writing across NVSWITCH to separate gpus’ output arrays. Perhaps NVSWITCH writes trigger the long scoreboard delays?

In saying all that, using NVSWITCH over writing to local arrays and then reduce/broadcasting at the end of the kernel, is still about 50% faster, so I’m hardly complaining!!

I wasn’t trying to imply that you are using texture operations. There are exceedingly few places where the “long scoreboard” stall is mentioned at all, and the blog entry I linked was one of the few relevant places I could find. I did not read the entire blog entry, but the quoted sentence seems to indicate that the “long scoreboard” has something to do with memory accesses, probably memory latency.

In the CUDA execution model, memory latency is supposed to be covered by running lots of active threads. Maybe the use of atomics is playing into the stall, and as far as the use case requires atomics, there is nothing to be done? I don’t know.

Sorry - I got ahead of myself, I thought that’s what TEX was (texture ops) from the devblogs quote. I’ll keep digging on my side. I feel like the nsight compute tool is still so young that it hasn’t been well explored by the community - perhaps docs need to be fleshed out a little more :)

Thanks for your help!

WARP STALL REASONS

A warp is launched to a warp scheduler (SM sub-partition). Once allocated to the scheduler the warp is in an active state (actively scheduled). On each cycle a warp is considered eligible to issue an instruction if all dependencies of the instruction are met. This includes the warp having fetched and decoded the instruction, the pipeline for the instruction is available, and any dependencies have been met. If any of these conditions are false then the warp is stalled and the warp will reports its highest priority stall reason. So on each cycle the PM system will report active_warp number of stall reasons. If there are 1 or more eligible warps the scheduler will pick the highest priority warp and issue 1-2 instructions. The selected warp reports the warp reason selected. All other eligible warps report the reason not_selected.

One of the most common reasons is long_scoreboard. This reason is reported by warps that have an instruction that is dependent on the completion of an issued memory load (texture, local, or global). A dependency on slow math, shuffle, or shared memory will report short_scoreboard.

The warp stall reasons can be collected at either a kernel level or at the instruction level using the program counter sampler (GM20x and above).

EXAMPLE LONG SCOREBOARD

The image of Nsight Compute Source View shows the CUDA vectorAdd sample modified to add 2^30 elements. The sample is run on a Quadro GV100.

In the diagram I have reduced the view to several of the warp stall reasons. The FADD instruction requires registers R2 and R4 which are the results of LDG instructions. The FADD cannot be executed until the LDG write-back R2 and R4. The warps are stalled on long-scoreboard (column stall_long_sb). The Sampling Data (All) shows that warps are stalling >99% of the time on FADD instruction all due to long scoreboard.

When analyzing a kernel the source view should not be used until the high level bottlenecks of the kernel are understood. The first item is to determine if the kernel is compute bound (SM) or memory bound. The first section shows the kernel is memory bound. NOTE: Most GPUs cannot exceed 80% SOL FB (frame buffer).

The rules system in Nsight Compute (still evolving) will direct the user through the various sections.

TIME PER ITERATION

Performing a quick analysis on your output and scanning your code you can calculate

  1. blocks per SM
  2. waves
  3. graph time per iteration

The code assumes block size is 4 warps and maximum blocks per SM is 16. Without an analysis trace I’m assuming this is valid and the code does not exceed the registers/thread limit.

For a GV100 (84 SMs) all launches < 1344 will be executed in one wave of block launches. Exceeding this will result in multiple waves of launches. Executing a kernel with a small integer number of waves and a small fractional portion (3.1) will likely result in a bad tail effect. Given homogeneous work per block launching a fraction of a wave of blocks will under utilize the GPU.

Inspecting the graph there are 3 areas that I colored yellow, orange, an red. The yellow (left) section is when the GPU has less then 1 wave. Data set is small and memory system TLBs, MMUs, and caches are likely efficient. The orange section (middle) shows an increase in iteration time per block. In this region two things occur: (1) the data set on a SM increases and is more varied, and (2) the cache efficiency likely drops. The red (right) section shows a roofline that is likely a combination of (1) cache efficiency resulting in a stable access latency, (b) data set variation resulting in stable MMU performance, and © majority of accesses going to the frame buffer (vs. L1/L2 cache).

Without running several of the iterations through the program it is hard to say what is varying as you increase from 1709 to ~13000.

I would recommend running on a single GPU and profiling a number of points on the line.

The profilers do not currently expose TLB and MMU counters. I would focus on L1, L2, and DRAM throughput and hit rate.

STRATEGY 1: On Maxwell - Turing architecture misses to L1 are completed in order. On Volta and Turing hits are returned out of order. This means that as latency increases all warps on the SM have a higher memory latency. In addition more warps per SM will potentially thrash the TLBs and MMU. Try to (1) reduce the data set size per SM and (2) reduce the occupancy in the SM to improve L1 and L2 cache hit rates and reduce the memory latency.

STRATEGY 2: Use persistent threads to control data locality per SM. The default block launch order is breadth first across SMs which is not always ideal for TLBs and MMU. This is a very advanced strategy so I would not recommend this unless you are trying to get every last piece of performance out of the system and the profiler shows you are not hitting above 70% FB/DRAM SOL.

1 Like

Thanks so much for that in depth analysis!
I’ve been trying, but am having trouble replicating it on 1 gpu, as the same performance degradation does not seem to appear. I’ll keep trying to replicate it and let you know if I find something :)

Cheers.

One followup question:
The way I had structured the kernel above is to have a block Dimension of (32, ngpu) - i.e. blockDim.y is the number of gpus - and each 32 thread warp in a block will be reading a part of the distributed ELL matrix (all aligned reads, mind you) from a different gpu over nvswitch.

Does this kind of fragmented memory access for a block normally result in a performance hit? Would it be better to let block 0 read from gpu 0, the second block read from gpu 1, and so on?

Of course it’s possible to test these things (and I intend to when I get access to the code again), but I’d rather try and understand from first principles what’s going on exactly, rather than brute forcing a solution with trial and error.

Thanks

I have not worked on a DGX2 system so I am not familiar with the performance issues related to different access patterns over NVLINK. NVPROF has slow sampling of NVLINK throughput counters. It would be interesting to see if the application is saturating all NVLINK connections. You may also be able to look at L1 throughput. If this is close to theoretical maximum NVLINK read throughput (?150 GiB/s on DGX2) then you are likely B/W limited. If this is much lower then you are likely memory latency limited.