No kernels were profiled warning/problem

Greetings,

I’m trying to profile my application on a dgx box on the 3rd (counting from 0) V100 contained within. When running, I get the warning no kernels were profiled. Any ideas what’s going on? This is with Cuda 10.0 and Ubuntu with the 4.4.0 kernel. I’m fairly sure that the related .cu file was compiled with -G, but I’m under the impression that the kernel is profilable (at a high-level) either way. The command and response follows the signature.

Thanks!
David

myuser@dgx-test:~/r/my_dir$ /usr/local/NVIDIA-Nsight-Compute-2019.4/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli --devices 3 --export "/home/myuser/r/my_dir/nsight_compute_prof1" --force-overwrite --target-processes all --kernel-regex my_kernel_name_copy_pasted --kernel-regex-base function --launch-skip-before-match 0 --section ComputeWorkloadAnalysis --section InstructionStats --section LaunchStats --section MemoryWorkloadAnalysis --section MemoryWorkloadAnalysis_Chart --section MemoryWorkloadAnalysis_Tables --section Occupancy --section SchedulerStats --section SourceCounters --section SpeedOfLight --section WarpStateStats --sampling-interval auto --sampling-max-passes 5 --sampling-buffer-size 33554432 --nvtx --profile-from-start 1 --clock-control base --apply-rules "/home/myuser/r/my_dir/my_binary" arg1 arg2 arg3
<output indicating the process is running>
==PROF== Connected to process 3442
my_particular_test: PASS (latency XXX things/s)
1 test passed
==PROF== Disconnected from process 3442
==WARNING== No kernels were profiled

My assumption would be that the combination of filters you are using causes no kernels in your application to match. Either the kernel “my_kernel_name_copy_pasted” is not running on device 3, or the name simply doesn’t match.

My suggestion would be to start with a simpler command line, since most of the parameters you are passing match the defaults anyways and likely aren’t necessary in your case. Start with

myuser@dgx-test:~/r/my_dir$ /usr/local/NVIDIA-Nsight-Compute-2019.4/target/linux-desktop-glibc_2_11_3-x64/nv-nsight-cu-cli --devices 3 --export “/home/myuser/r/my_dir/nsight_compute_prof1” --force-overwrite --target-processes all --section SpeedOfLight --apply-rules “/home/myuser/r/my_dir/my_binary” arg1 arg2 arg3

and see if that matches your kernel. If that works, check the kernel name that is shown and try adding back the kernel name filter.

–kernel-regex my_kernel_name_copy_pasted

Thanks for the idea @felix_dt!

As background, the way I generated that long command was by using the nsight gui. BTW, it’s pretty nice that the gui actually shows me the command it’s about to run, though I had to modify it manually to restrict execution to device 3.

Even without the kernel restriction, I still receive the warning of “no kernels were profiled”.

Since posting this, I’ve learned that sometimes nsight/nvvp/nvprof struggles to profile on any device other than the default device. I don’t really have access to device 0 on this machine, but does that sound like it could be part of the issue? My guess is that the nsight developers tend to develop for “device 0” and so are less likely to have completely tested the non-0 devices.

BTW, here’s my environment, and I’m able to confirm via nvidia-smi that the program is indeed running on Device 3.

myuser@dgx-test:~/$ env | grep "DEVICE"
CUDA_VISIBLE_DEVICES=3

I think the problem here might be the combination of CUDA_VISIBLE_DEVICES together with --devices.

Using the environment variable, you are instructing the CUDA driver that there should be only one device visible to CUDA applications (device 3 in your system), which will be made available to CUDA as the first device, i.e. device 0.

Using the --devices 3 options, you are instructing Nsight Compute to restrict profiling to the fourth device (the one with ID 3), but there aren’t four devices anymore at that point.

If you really only want your application to run on CUDA_VISIBLE_DEVICES=3, there is no need for the --devices option (or it can be set to 0). If you want your application to run on all devices, but only profile on device with ID 3, remove the CUDA_VISIBLE_DEVICES env variable and keep the --devices 3 option

Ah ha, this is good insight to have, as it would be difficult for me to track this down on my own. I’m giving it a whirl at the moment, but there’s an unrelated issue that’s keeping me from running the test. I’ll give it another try in the morning and report back. Thanks!

@felix_dt was able to run the test just now. The test itself passes, but the profiler is having issues:

my_kernel: ==ERROR== Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

This is what motivated me to use the --devices 3 option.

I don’t have sudo access on this machine. I’ve never needed sudo to profile in the past, so I’m surprised I need it. However, that link states that it’s a relatively new requirement from the 418.43 driver, and a lot of my experience is with a driver that’s slightly older than that. I guess I’ll see if it makes sense for me to gain sudo access…

Seeing this error message now is actually a good thing, as it implies that the profiler is now finding a kernel (on your physical device 3, CUDA device 0). As you found, it’s a new requirement from the driver, and you will need to work with your machine owner to get access again using one of the options listed on the page

  • run the profiler as root/sudo
  • temporarily load the kernel module with NVreg_RestrictProfilingToAdminUsers=0
  • permanently enable profiling for non-admin users with a file in /etc/modprobe.d
1 Like

Hi @felix_dt

I encounter “==WARNING== No Kernels were profiled” as I updated my CUDA version to 11.5 - Is it possible that I removed when I purged the earlier Driver to be able to update?

The CUDA program that I try to profile is just a simple Add but by using the following command I get that warning!

$ sudo nv-nsight-cu-cli --device 1 --target-process all --section SpeedOfLight a.out

I am testing on DGX A100 Station Machine.

Thanks

Is it possible that I removed when I purged the earlier Driver to be able to update?

I don’t fully understand what you mean by this. Are you saying you updated your display driver and now you are concerned that updating the CUDA toolkit used to build the app doesn’t work with that newer driver? Which driver version did you update to and which Nsight Compute version are you using?

The warning indicates that Nsight Compute didn’t find any kernels to profile, given the specified flags. This may be because the app in fact does not launch any CUDA kernels, or because the filtering flags cause all of them to be ignored. You are passing --device 1. This tells Nsight Compute to only profile kernels on the 2nd device (as numbering starts at 0, similar to CUDA_VISIBLE_DEVICES). Hence, unless the app launches kernels on at least two devices, it won’t find anything. I suggest to remove this flag and check if that solves your problem. If not, note that --devices refers to the list of devices used by the app, numbered sequentially. This is on top to any potentially set CUDA_VISIBLE_DEVICES environment variable, which may be used to filter the devices that are visible to CUDA in the first place.

1 Like

@felix_dt
Thanks for your response.
As I checked it again the ‘nvcc’ ignore the kernel part. It does not enter the kernel, just the c++ part is compiled and running!!

CUDA version 11.5
Driver Verison 495.29.05

CUDA_VISIABLE_DEVICE = 1

Ok, sounds like you can hopefully fix the problem in your build command then.

CUDA_VISIABLE_DEVICE = 1

Just for completeness, this value would be “incompatible” with ncu --devices 1.
From the two or more devices in your machine
(0) (1)
CUDA_VISIBLE_DEVICES=1 would make only the second one visible to CUDA with index 0
( ) (0) → (0)
–devices 1 would tell ncu to profile only kernels on the 2nd GPU (the one with index 1), but there now is none
(0) → (0) (?)

1 Like

As I checked the code running, it appears that you were right on the first mention that no kernel runs! Indeed the nvcc is not working! It just does not compile kernel! Because I just compiled it and tried to run it but no results. It is just like running the code without its GPU kernels?! Does the compiler have problems?

If you have the code available, you can share it here for us to have a look. It’s not possible to debug this generically. Some things to try

  • Pay attention to compiler errors/warnings
  • Make sure compilations succeeds and re-builds your binary
  • Make sure you execute the right binary
  • Add a printf or similar in your code where you expect it to launch your kernel to see if that path is executed.
1 Like

This is the code

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  printf("thread %d", threadIdx.x);
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  printf("Entering kernel!");
  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  printf("Out of Kernel");
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

and the result

Entering kernel!Out of KernelMax error: 2.000000

Compiled with “nvcc simpleKernel.cu -o out” command.

I don’t think there is any error with your code, but the output suggests that there is an error with running this on your system. I recommend to check the return codes of all CUDA API calls to ensure they return cudaSuccess. You can use a macro like below for convenience. You can also run your binary through cuda-gdb, which is part of the CUDA toolkit.

#define RUNTIME_API_CALL(apiFuncCall)                                   \
  do {                                                                  \
    cudaError_t _status = apiFuncCall;                                  \
    if (_status != cudaSuccess) {                                       \
      fprintf(stderr, "%s:%d: error: function %s failed with error %s (%d).\n", \
              __FILE__, __LINE__, #apiFuncCall,                         \
              cudaGetErrorString(_status), _status);                    \
    }                                                                   \
  } while (0)
1 Like

@felix_dt
The error that pops up while entering the kernel is “CUDA Error: initialization error”.

@felix_dt
Thanks a lot for your help.
The problem got solved as I reinstalled the driver and the cuda toolkit.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.