Internal Profiling error - insufficient kernel bounds data

I have a implementation of the Discrete Fourier Transform to calculate the maximum of each frequency bin of a 1024 PT DFT. (Yes, I know use the cuFFT, but this is for experimentation). It is called with 8 blocks and 1024 threads per block, each block processes 1/8th of the input time series (a simple tone) which in this case is 512 FFTs. It runs fine, albeit slow. But it is causing errors on two different GPUs when we try to run an individual kernel analysis
== 21652 == Error Internal Profiling Error OR
= 7880 == Error Internal Profiling Error

and says “insufficient kernel bounds data. The data needed to calculate compute, memory, and latency bounds for the kernel could not be collected”

I hope this is way to post code …

#include <stdio.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <helper_cuda.h>
#include <helper_functions.h>  
#include <cuda_profiler_api.h>

#define PI 3.14159265358979

__device__ float	D_PI = PI;

typedef struct {
	short	x;
	short	y;
} ComplexShort;

void  generateCmplxShortTone(float scaleFactor, ComplexShort *data, int count, float freq, float sampleFreq)
{
	float twoPI = 2 * PI;
	float step = (twoPI *freq) / sampleFreq;
	float phase;
	for (int n = 0; n < count; n++)  {
		phase = n*step;
		phase = fmod(phase, twoPI);
		data[n].x = cos(phase)*scaleFactor;
		data[n].y = sin(phase)*scaleFactor;
	}
	return;
}

// Each thread in this kernel will compute the maximum magnitude squared (power) of a 
// single frequency bin of a 1024 PT DFT.  It must be called with 1024 threads per block and 8 blocks.
// Each block will process 512 FFTs, so the d_data structures must be 512 * 8 * 1024.
// The results are stored in d_peakPickDFT which must be 8 * 1024 
__global__ void DFTPeakPick(ComplexShort *d_Data, float *d_peakPickDFT, int fftSize) {

	__shared__ ComplexShort timeSeries[1024];

	register float sumReals;
	register float sumImags;

	// Same as K
	int fftBin = threadIdx.x;
	int numFFTsToProcess = 512; // (Data Size / fftSize) / GridDim.x
	int numFFTsProcessed = 0;

	register float max = 0;
	register float newMagSquaredVal;

	int currentDataIndex = blockIdx.x * fftSize * numFFTsToProcess;

	timeSeries[fftBin].x = d_Data[currentDataIndex + fftBin].x;
	timeSeries[fftBin].y = d_Data[currentDataIndex + fftBin].y;

	// Wait for all threads to load a data sample
	__syncthreads();

	register float realCoef = cos(fftBin * 2 * D_PI / fftSize);
	register float imagCoef = -1 * sin(fftBin * 2 * D_PI / fftSize);

	while (numFFTsProcessed < numFFTsToProcess) {

		register float nextRealCoef = realCoef;
		register float tempNextRealCoef;
		register float nextImagCoef = imagCoef;

		sumReals = timeSeries[0].x;
		sumImags = timeSeries[0].y;

		for (int n = 1; n < fftSize; n++) {

			sumReals += nextRealCoef * timeSeries[n].x - nextImagCoef * timeSeries[n].y;
			sumImags += nextRealCoef * timeSeries[n].y + nextImagCoef * timeSeries[n].x;

			tempNextRealCoef = nextRealCoef * nextRealCoef - nextImagCoef * nextImagCoef;
			nextImagCoef = nextRealCoef * nextImagCoef * 2; // + imagCoef * realCoef;
			nextRealCoef = tempNextRealCoef;

		}

		newMagSquaredVal = sumReals * sumReals + sumImags * sumImags;

		if (newMagSquaredVal > max) {
			max = newMagSquaredVal;
		}


		//// Wait for everyone to finish before loading another FFT
		currentDataIndex += fftSize;
		__syncthreads();
		timeSeries[fftBin].x = d_Data[currentDataIndex + fftBin].x;
		timeSeries[fftBin].y = d_Data[currentDataIndex + fftBin].y;
		__syncthreads();

		numFFTsProcessed++;

	}

	// 8 Peak picked DFT that must be decimated further
	d_peakPickDFT[blockIdx.x*1024 + fftBin] = max;

}


int main(int argc, char **argv)
{

	int	fftSize = 1024;  // same as threads per block
	int numFFTsPerBatch = 4096;
	int numBlocks = 8;  // 
	int dataLength = fftSize * numFFTsPerBatch; 
	
	ComplexShort *hostComplexShortBuffer; // Tone
	ComplexShort *d_ComplexShortBuffer;   // Kernel input
	float *d_peakPickDataBuffer;          // Kernel output

	cudaHostAlloc(&hostComplexShortBuffer, dataLength * sizeof(ComplexShort), cudaHostAllocWriteCombined);
	cudaMalloc((void**)&d_ComplexShortBuffer, dataLength * sizeof(ComplexShort));
	cudaMalloc((void**)&d_peakPickDataBuffer, numBlocks * fftSize * sizeof(float));

    // Fill host complex short buffer with a tone
	float freq = 100.0;
	float sampleFreq = 1024.0;
	float scaleFactor = 10000;
	generateCmplxShortTone(scaleFactor, hostComplexShortBuffer, dataLength, freq, sampleFreq);

	// Do we need this if we're runnning NVIDIA profiler?
	cudaProfilerStart();
	
	// Copy data from host to device
	cudaMemcpyAsync(d_ComplexShortBuffer, hostComplexShortBuffer, dataLength*sizeof(ComplexShort), cudaMemcpyHostToDevice);
	
	// This will fully occupy NVIDIA Qaudro K3100M (4 MPs, 2048 thread per MP)
	DFTPeakPick << <  8, 1024, 0  >> >
		(d_ComplexShortBuffer,d_peakPickDataBuffer,fftSize);

	// Copy data from device to host (tbd)

	cudaFreeHost(hostComplexShortBuffer);
	cudaFree((void*)d_ComplexShortBuffer);
	cudaFree((void*)d_peakPickDataBuffer);

    cudaDeviceReset();

	cudaDeviceSynchronize();
	
	cudaProfilerStop();
	
}

I left off full error …

==10384== Error: Internal profiling error 3105:999. OR
==21653== Error: Internal profiling error 3105:999.

which profiler? what platform? Which GPU? which CUDA version? How have you compiled the code?

This is the output I get when profiling your code using nvprof on linux on a cc2.0 GPU:

$ nvcc -I/usr/local/cuda/samples/common/inc t1142.cu -o t1142
$ nvprof --print-gpu-trace ./t1142
==12277== NVPROF is profiling process 12277, command: ./t1142
==12277== Profiling application: ./t1142
==12277== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
1.21125s  2.7976ms                    -               -         -         -         -  16.000MB  5.5852GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
1.21405s  280.55ms              (8 1 1)      (1024 1 1)        22  4.0000KB        0B         -           -  Quadro 5000 (0)         1         7  DFTPeakPick(ComplexShort*, float*, int) [186]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

Although I can’t say it’s the source of your observation, I would re-work the last few lines of your code as follows:

cudaDeviceSynchronize();
	
	cudaProfilerStop();
	cudaFreeHost(hostComplexShortBuffer);
	cudaFree((void*)d_ComplexShortBuffer);
	cudaFree((void*)d_peakPickDataBuffer);

        cudaDeviceReset();

NVIDIA Visual Profiler V7.5 when I chose Perform Kernel Analysis
Linux with Quardro M4000
Windows with Quadro K3100M

What do you mean: How have you compiled the code? I’m executing it …so yes of course.

Command line works fine for me as well … it’s the detailed kernel analysis we want.

The profiler depends on various hardware counters to capture event and metric data. These counters operate in real time/full speed and have finite limits in terms of the number of events they can capture. If you (i.e. your kernel code) exceed these limits, you will have a counter overflow. The profiler will detect this condition but cannot fix it. You can witness this particular condition in your case by selecting the “Details” tab and scrolling - you will see some parameters that are marked “Overflow” in red.

The solution in general is to decrease the “scope” of your kernel. This can be accomplished in a variety of ways, such as reducing the number of blocks. But your kernel is already launching a relatively small number of blocks. However your kernel code loops:

int numFFTsToProcess = 512; // (Data Size / fftSize) / GridDim.x
        ...
	while (numFFTsProcessed < numFFTsToProcess) {

If you reduce this loop count to, say, 4, instead of 512, I think you’ll have better results. I haven’t studied your code but my guess is that this won’t change the general character of your code behavior, so the profiling results should still be useful. If you need a specific metric (such as flop count) that would actually be impacted, then you should attempt to capture that metric separately, although you may also run into the same limit, requiring you again to reduce the loop count and scale up the reported metric, if necessary.

A few other suggestions:

  1. rework the code as I describe in my previous comment.
  2. Deselect “concurrent kernel profiling” and “unified memory profiling”. Your code does not use either of these features. These are checkboxes in the wizard that runs after you select File…New Session

A side-benefit of the above reduction in scope is that you also get profiling results a lot quicker.

Try this … I believe it’s what the Visual Profiler is doing under the covers … throws the error, resets the display driver …

nvprof --analysis-metrics -o analysis.nvprof --print-gpu-trace executablename.exe

==10780== NVPROF is profiling process 10780, command: executablename.exe
==10780== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
==10780== Replaying kernel “DFTPeakPick(ComplexShort*, float*, int)” (0 of 52)…==10780== Error: Internal profiling error 3105:999.

The reset of the display driver is occurring on windows because your kernel is taking too long to execute. If you reduce the kernel execution time as I have indicated above, it will fix that issue as well.

Otherwise you’ll need to modify the wddm timeout on windows. Which has been discussed extensively such as here:

https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

But if all you did was modify the wddm timeout, it still would not fix the overflow issue I described previously.

Thanks Txbob. The wddm timeout/too many ffts was indeed the problem. I don’t see anything in red in the details tab