Illegal instruction (error 715) with H100

I’m encountering error 715, an illegal instruction, in a function that only calls __syncthreads() and various warp shuffle functions. The code compiles fine under 12.2 with the 535.183 driver, using code=sm_90a/arch=compute_90a (or 90 for either), but errors on the first call to a warp shuffle.

The error occurs on the first call in a device function (a peak detector, fwiw) with the instruction __shfl_down_sync(), particularly

...
float mag = <input value>;
float tempMag = __shfl_down_sync(0xFFFFFFFF, mag, 16);

I also have an LS40, and it compiles AND runs fine using code=sm_89/arch=compute_89.

UPDATE:

I figured it out. The above line was NOT the offending instruction, though it was one that followed after I reduced to a single warp. IOW, if (warp == 0) { … }, that is, warp shuffle functions can’t be issued to a single warp.

This makes sense in hindsight given the sync nature, but I’m curious why this worked with previous architectures. Originally, the code was written with the non-sync versions, but I updated it once the deprecation notice came out.

device int peakDetect(float mag, float position) {
// 256 threads
int lane = threadIdx.x % 32; // 32 lanes
int warp = threadIdx.x >> 5; // 8 warps
… // other logic to reduce the peak into warp 0
if (warp == 0) {
float tempMag = __shfl_down_sync(0x0xFF, mag, 4);

}

While it may not be true to you, this shuffle function generates an illegal instruction error with 90/90a (not with 89), whereas those issued to all warps do not.

Let me try again.

It should be possible to issue a shuffle op to a single warp. There are no restrictions in this area.

If you are wondering what is going on, folks here may be able to provide better help if you can provide a short but complete reproducer of the problem.

No, it’s not. It fails, issuing an illegal instruction error, and I have an even simpler “test case” for which it does not work, either (a thread reduction sum).

Every instance in which every warp in a thread block gets the same shuffle instruction - there are many before I issue the first “if (warp == 0)” conditional, works fine for every architecture. However, as soon as I reduce to one warp with the conditional (the remaining warps stop on a syncthreads() call), the code fails. It does not do this prior to the 90/90a architecture.

Telling me I don’t know what is happening with very simple code does not help me. In fact, it’s a pretty arrogant approach. For that matter, you could write a simple test case yourself and prove me wrong.

And yes, I saw your previous reply which was even more arrogant. And wrong.

In case anyone was wondering (if there are any code errors, they are due to hand-jamming from my development system which is not attached to the Internet, i.e., no cut and pasting)…

template inline device float floatSum(float val, float *sharedVal);
// sharedVal has at least 8 floats, though it only needs log2(blockDim)
template<> inline device floatSum<32>(float val, float *sharedVal) {
val += __shfl_down_sync(0xFFFFFFFF, val, 16);
val += __shfl_down_sync(0xFFFF, val, 8);
val += __shfl_down_sync(0xFF, val, 4);
val += __shfl_down_sync(0x3, val, 2);
return __shfl_sync(0xFFFFFFFF, val, 0);
}

template<> inline device floatSum<256>(float val, float *sharedVal) {
int warp = threadIdx.x >> 5; // 8 warps
int lane = threadIdx.x % 32; // 32 lanes
val = floatSum<32>(val, sharedVal);
if (lane == 0)
sharedVal[warp] = val;
__syncthreads();
if (warp == 0) {
if (lane < 8)
val = sharedVal[lane];
else
val = 0.0f;
val += __shfl_down_sync(0xFF, val, 4); // <<=== illegal instruction
// compute-sanitizer fails on this line as well, and reports illegal instruction
val += __shfl_down_sync(0xF, val, 2);
val += __shfl_down_sync(0x3, val, 1);
}
__syncthreads();
return __shfl_sync(0xFFFFFFFF, val, 0);
}

I changed the code to the following (which is probably just as fast):

// sharedVal has at least 8 floats
template<> inline device floatSum<256>(float val, float *sharedVal) {
int warp = threadIdx.x >> 5; // 8 warps
int lane = threadIdx.x % 32; // 32 lanes
val = floatSum<32>(val, sharedVal);
if (lane == 0)
sharedVal[warp] = val;
__syncthreads();
if (threadIdx.x == 0) {
val = sharedVal[0];
for (int ii = 1; ii < 8; ii++)
val += sharedVal[ii];
}
__syncthreads();
return __shfl_sync(0xFFFFFFFF, val, 0);
}

Sorry to have been arrogant.

I doubt its a coding error. In my opinion, an “illegal instruction” error should be pretty rare. The compiler is in control of what instructions get created/generated, so for many coding scenarios, it shouldn’t be possible to run into an illegal instruction error (except via compiler defect). Yes, there might be particular situations where you are doing for example a function call jump table that goes awry, and an illegal instruction error could arise from that, but otherwise I’d say its rare.

Based on that, if you’re running into one, one thing that seems possible to me is a compiler defect.

I won’t have access to an H100 machine until next week sometime, but will put together a simple test as time permits.

You are not using the correct mask for shuffle. CUDA C++ Programming Guide

Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

It should work with mask 0xFFFFFFFF

It’s a thread reduction: the mask is correct for summing the values in threads 0-7. You only use a mask of 0xFFFFFFFF if you’re including all 32 threads.

The code works fine for all previous architectures. I’m testing side-by-side with an LS40 and it’s been in place since the sync versions first appeared.

Thank you, and please do run this.

As noted, I’m using 12.2 with the 535 driver on Ubuntu 22.04. I have both an H100 and an LS40S in the box and we’re comparing the two for our future development efforts. FWIW, the LS40S is winning for what I do, but we also have some AI development that might be better suited for the H100.

I admit the language is confusing in that paragraph.

The first part:

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware.

Seemingly contradicts what you quoted:

Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.

This also seems contrary to the working code above (floatSum<32>) that does not have a problem.

Curious.

Ok, I tried it using striker159’s suggesting and it does work.

Which brings up several questions:

  1. Why does it not matter in my floatSum<32>() function?
  2. What is the point of the mask if all 32 threads in a warp need to participate?
  3. Why doesn’t it matter with previous architectures?

I think I know the answer to 2 of these 3 questions:

  1. There’s an inherent synchronization in the calls, and since all threads in the block get the instruction, they can all synchronize properly.
  2. The mask works as I previously expected when all threads in a block get the instruction.
  3. No idea.

Thanks for the help.

With undefined behavior, “it seems to work” does not imply “it is correct”. There is no point in guessing why it seems to work sometimes, on some architectures.

Except that it doesn’t generate an illegal instruction in the hardware prior to 9.0, not just “some architectures.” And in all of the cases, even 9.0, it compiles without error. Something changed with 9.0.

And that STILL doesn’t address the contradictory language in the very paragraph you and I both cited. I’m not convinced it’s supposed to behave as you are interpreting. The mask is generally pointless if it does.

UPDATE: it is worth noting that the language did change from 11.8 to 12.0.

As additional documentation, there is also the description of the shfl ptx instruction. PTX ISA 8.5

There it is stated:

The behavior of shfl.sync is undefined if the executing thread is not in the membermask

I do not have more to add regarding ambiguous documentation.

If the documentation appears to be unclear, there is always the possibility to open a bug ticket asking for clarification and / or for suggesting changes. How to report a bug
To report a compiler defect, you will typically be asked to provide a self contained reproducer which can be compiled and executed.

Practically speaking, personally I would not bother writing my own parallel reduction unless it is for educational purpose. Parallel reduction is a well studied algorithm and is provided by cub which is part of the CUDA toolkit.
It has warp-wide, block-wide, and device-wide reduction. Documentation of the block-wide API can be found here:

https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockReduce.html#_CPPv4I0_i_20BlockReduceAlgorithm_i_i_iEN3cub11BlockReduceE

I wasn’t able to reproduce any problem (i.e. no report of illegal instruction; I didn’t study any of the produced data) based on the following code in CUDA 12.5.1, driver 560.28.03:

# cat t2.cu
#include <iostream>
template<int radix> inline __device__ float floatSum(float val, float *sharedVal);
// sharedVal has at least 8 floats, though it only needs log2(blockDim)
template<> inline __device__ float floatSum<32>(float val, float *sharedVal) {
        val += __shfl_down_sync(0xFFFFFFFF, val, 16);
        val += __shfl_down_sync(0xFFFF, val, 8);
        val += __shfl_down_sync(0xFF, val, 4);
        val += __shfl_down_sync(0x3, val, 2);
        return __shfl_sync(0xFFFFFFFF, val, 0);
}

template<> inline __device__ float  floatSum<256>(float val, float *sharedVal) {
        int warp = threadIdx.x >> 5; // 8 warps
        int lane = threadIdx.x % 32; // 32 lanes
        val = floatSum<32>(val, sharedVal);
        if (lane == 0)
                sharedVal[warp] = val;
        __syncthreads();
        if (warp == 0) {
                if (lane < 8)
                        val = sharedVal[lane];
                else
                        val = 0.0f;
                val += __shfl_down_sync(0xFF, val, 4); // <<=== illegal instruction
                // compute-sanitizer fails on this line as well, and reports illegal instruction
                val += __shfl_down_sync(0xF, val, 2);
                val += __shfl_down_sync(0x3, val, 1);
        }
        __syncthreads();
        return __shfl_sync(0xFFFFFFFF, val, 0);
}


__global__ void k(int *d){

  __shared__ float s[1024];
  int val = d[threadIdx.x];
  val += floatSum<256>(val, s);
  d[threadIdx.x] = val;
}

int main(){

  const int ds = 32;
  int *d;
  cudaMallocManaged(&d, sizeof(d[0])*ds);
  k<<<1,ds>>>(d);
  cudaError_t err = cudaDeviceSynchronize();
  if (err != cudaSuccess) std::cout << "error: " << cudaGetErrorString(err) << std::endl;
}

# nvcc -arch=sm_90 -o t2 t2.cu
# compute-sanitizer ./t2
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
#
  • the code is slightly different than what is posted above, in that I added a float declaration to the template instantiation that needed it (and I added my own kernel and main functions)
  • I can probably try CUDA toolkit 12.2 at some point. I wouldn’t normally recommend anything prior to CUDA 12.3 on hopper due to a known issue, but you could possibly say that about any CUDA version, and that issue is unrelated to whatever is happening here.
  • I cannot downgrade the driver on this (shared) machine
  • I also tried another simpler test case, basically my kernel with just this line: val += __shfl_down_sync(0xFF, val, 4); rather than the device function call; that also did not report an illegal instruction
  • I also tried const int ds=256; in my code, no problem reports there, either.
  • I also tried -arch=sm_90a; no issue.

OK, thanks for this. Fortunately, I can upgrade both my toolkit and driver versions. I’ll report back when I have a chance to test.

Note that stryker’s suggestion also runs, though I haven’t checked at the data level - it shouldn’t be wrong that I can tell.

Can you test with

if (threadIdx.x < 2)
   val += __shfl_down_sync(0x3, val, 1);

instead, whether Each calling thread must have its own bit set is the reason?