Extremely low occupancy on GPU [NSight / 1080ti], need help!!!

Hi guys! I’m testing an almost empty kernel:

__global__ void processIIR(float* samples, IIRBandFactors* bandFactors, IIRBandData* bandData, float* afc)
{
	int blockId = blockIdx.x;
	int threadId = threadIdx.x;

	int bandId = blockId*IIR_THREAD_NUM + threadId;
	__shared__ float sampleBlock[128];
}

on my single GeFroce 1080Ti card (only one card on my computer), and it would seem, under ideal conditions, I get very low occupancy, only 3,42%, look at image:

Hmm, I thought that perhaps this figure is the workload of the entire GPU, in which 28 SM processors (1080Ti), ok:
3.42% * 28 ~= 97,44%…(few percentage for drawing graphics) I’m right?

Ok, let’s try to load 100%. I take: 28 SMs * 32 blocks (64 thread per block/ 2 warps per block) = 896 blocks. Ok, I run the test and get almost the same low occupancy, a little bit higher, actually why?

You have not shown us all your code. What is the grid / block configuration with which you call this kernel? Since the compiler optimizes away dead code, your kernel is in fact empty (a null kernel) if you did a release build. I could imagine that profiling a bunch of empty kernels is not going to give accurate results, because most threads will be done by the time the profiler had a chance to pull relevant statistics counters from the hardware. Try something like a copy kernel for your experiments that copies a few hundred MB between locations in global memory.

This can be seen in the screenshots on top-right of img.
First screenshoot: Grid(32,1,1), BlockDim(64,1,1)
Second: Grid(896,1,1), BlockDim(64,1,1)

Hmm, yes, you right, it’s incorrect to discuss the problem on empty kernels. Ok, I switched to debug mode and ran my full code:

__global__ void processIIR(float* samples, IIRBandFactors* bandFactors, IIRBandData* bandData, float* afc)
{
	int blockId = blockIdx.x;
	int threadId = threadIdx.x;

	int bandId = blockId*IIR_THREAD_NUM + threadId;	

	__shared__ float sampleBlock[IIR_SAMPLE_BUFFER_SIZE];
	for (int i = 0; i < IIR_THREADS_SHARED_COPY_SIZE; i++)
	{
		int idx = threadId*IIR_THREADS_SHARED_COPY_SIZE + i;
		sampleBlock[idx] = samples[idx];
	}
	__syncthreads();

	IIRBandFactors bf = bandFactors[bandId];
	IIRBandData bda[IIR_CASCADES_NUM];
	for (int i = 0; i < IIR_CASCADES_NUM; i++)
	{
		int idx = bandId*IIR_CASCADES_NUM + i;
		bda[i] = bandData[idx];
	}

	for (int s = 0; s < IIR_SAMPLE_BUFFER_SIZE; s++)
	{
		float sample = sampleBlock[s];	
		
		float sampler[IIR_CASCADES_NUM + 1];
		sampler[0] = sample;

		for (int c = 0; c < IIR_CASCADES_NUM; c++)
		{
			float result = bf.fFC[0] * sampler[c] + bda[c].dd0;
			bda[c].dd0 = bf.fFC[1] * sampler[c] - bf.fFC[3] * result + bda[c].dd1;
			bda[c].dd1 = bf.fFC[2] * sampler[c] - bf.fFC[4] * result;
			sampler[c + 1] = result;
		}

		int oid = bandId*IIR_SAMPLE_BUFFER_SIZE + s;
		afc[oid] = sampler[IIR_CASCADES_NUM];		
	}

	for (int i = 0; i < IIR_CASCADES_NUM; i++)
	{
		int idx = bandId*IIR_CASCADES_NUM + i;
		bandData[idx] = bda[i];
	}
}

Host run command:

processIIR << <IIR_BLOCK_NUM, IIR_THREAD_NUM >> >(сuda_iir_sampleBuffer, cuda_iir_factors, cuda_iir_data, сuda_iir_afc);

Defines:

#define IIR_SAMPLE_BUFFER_SIZE 1024

#define IIR_CASCADES_NUM 4
#define IIR_BANDS_NUM 1024

#define IIR_BLOCK_NUM 16
#define IIR_THREAD_NUM 64
#define IIR_THREADS_SHARED_COPY_SIZE 16

Structs:

struct IIRBandFactors
{
	float fFC[5];
};

struct IIRBandData
{
	float dd0 = 0.0f;
	float dd1 = 0.0f;
};
  • Should I somehow even align them in memory? or, if in the structure only floats, everything will be aligned automatically well? :-)

This is Infinite Impulse Response (IIR) filter in audio processing software. In kernel “float* samples” - just pack of 1024 samples (flaot values), each of which must be processed by a separate fitler band (int bandId = blockId*IIR_THREAD_NUM + threadId; ). That is, each band works separately with the entire set of samples.

Well, again switch to Release (It makes no sense to measure the time of the full code in debug :-) ) and got the following results on my single GeFroce 1080Ti:

–So, what’s wrong with occupancy? (or with me :-) )






Thanks you for any help!

That is 1024 threads total. Your high-end GPU has significantly more CUDA cores (3,584) than you have threads. You would want to target 128 to 256 threads per blocks, and hundreds of blocks. The goal is to have 50,000+ threads running to fill up the machine. There is a reason they call this massively parallel programming.

As for your screenshots: They are unreadable at my screen size and screen resolution, which is why I have been ignoring them.

If you want to acquaint yourself with the profiler, try any of the sample apps that ship with CUDA, or try this small copy test app:

#include <stdlib.h>
#include <stdio.h>

#define ZCOPY_THREADS  128
#define ZCOPY_DEFLEN   100000000
#define ZCOPY_ITER     10           // as in STREAM benchmark

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

__global__ void zcopy (const double2 * __restrict__ src, 
                       double2 * __restrict__ dst, int len)
{
    int stride = gridDim.x * blockDim.x;
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = tid; i < len; i += stride) {
        dst[i] = src[i];
    }
}    

struct zcopyOpts {
    int len;
};

static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
{
    int error = 0;
    memset (opts, 0, sizeof(*opts));
    while (argc) {
        if (*argv[0] == '-') {
            switch (*(argv[0]+1)) {
            case 'n':
                opts->len = atol(argv[0]+2);
                break;
            default:
                fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                error++;
                break;
            }
        }
        argc--;
        argv++;
    }
    return error;
}

int main (int argc, char *argv[])
{
    double start, stop, elapsed, mintime;
    double2 *d_a, *d_b;
    int errors;
    struct zcopyOpts opts;

    errors = processArgs (argc, argv, &opts);
    if (errors) {
        return EXIT_FAILURE;
    }
    opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;

    /* Allocate memory on device */
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
    CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
    
    /* Initialize device memory */
    CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
    CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

    /* Compute execution configuration */
    dim3 dimBlock(ZCOPY_THREADS);
    int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
    if (threadBlocks > 65520) threadBlocks = 65520;
    dim3 dimGrid(threadBlocks);
    
    printf ("zcopy: operating on vectors of %d double2s (= %.3e bytes)\n", 
            opts.len, (double)sizeof(d_a[0]) * opts.len);
    printf ("zcopy: using %d threads per block, %d blocks\n", 
            dimBlock.x, dimGrid.x);

    mintime = fabs(log(0.0));
    for (int k = 0; k < ZCOPY_ITER; k++) {
        start = second();
        zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
        CHECK_LAUNCH_ERROR();
        stop = second();
        elapsed = stop - start;
        if (elapsed < mintime) mintime = elapsed;
    }
    printf ("zcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
            1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

    CUDA_SAFE_CALL (cudaFree(d_a));
    CUDA_SAFE_CALL (cudaFree(d_b));

    return EXIT_SUCCESS;
}

Well, they have normal resolution, you can zoom in by Ctrl+MouseWheel (google Chrome) :-)

Yes, your test code shows a good workload of the GPU, so the problem is not in my video card and it’s very good. Just in my case it’s very difficult to create thousands of independent threads, this is often the case in 3d graphics shaders. Okay, thanks, I’ll think on.

[result for your test code]