Invalid Device Ordinal

calvin.chang · August 12, 2024, 6:48pm

Problem
I was going through a DLI module: Asynchronous Streaming, and Visual Profiling for Accelerated Applications with CUDA C/C++: 06-stream-init and completed everything fine in the jupyter lab env.

However I wanted to make sure my system and cuda were configured correctly and am finding that the code is returning “invalid device ordinal”, but still passing the check. Could someone help me debug further?

When I add print statements deviceId returns 0. So I’m not really sure why there is a device ordinal error happening.

System
CUDA 12.3
Windows 11
VS 2022, v142

Error

Error: invalid device ordinal
Success! All values calculated correctly.

Code

#include <stdio.h>

__global__
void initWith(float num, float *a, int N)
{

  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    a[i] = num;
  }
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  /*
   * Create 3 streams to run initialize the 3 data vectors in parallel.
   */

  cudaStream_t stream1, stream2, stream3;
  cudaStreamCreate(&stream1);
  cudaStreamCreate(&stream2);
  cudaStreamCreate(&stream3);

  /*
   * Give each `initWith` launch its own non-standard stream.
   */

  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream1>>>(3, a, N);
  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream2>>>(4, b, N);
  initWith<<<numberOfBlocks, threadsPerBlock, 0, stream3>>>(0, c, N);

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  cudaMemPrefetchAsync(c, size, cudaCpuDeviceId);

  checkElementsAre(7, c, N);

  /*
   * Destroy streams when they are no longer needed.
   */

  cudaStreamDestroy(stream1);
  cudaStreamDestroy(stream2);
  cudaStreamDestroy(stream3);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

Robert_Crovella · August 12, 2024, 7:31pm

windows doesn’t permit prefetching. It doesn’t have the same unified memory model that you find on linux.

So if you comment out the cudaMemPrefetchAsync() calls, the remainder should be fine.

You can read about it in the managed/unified memory section of the programming guide.

In a nutshell, on windows, the UM system behaves differently. General behavior is that all managed allocations are migrated to the device at the point of kernel launch, and become accessible to host code again after a subsequent cudaDeviceSynchronize() call.

I would also suggest you may want to add full CUDA error checking so that you can see where the errors are “first visible” i.e. what call is reporting them (first).

Topic		Replies	Views
Invalid Device Ordinal with cudaCpuDeviceId Jetson Orin NX	1	170	April 28, 2025
Invalid device ordinal CUDA Programming and Performance	1	842	January 25, 2013
CUDA Error 101 with cudaMemPrefetchAsync Positioning on WSL2 CUDA Programming and Performance	2	451	May 29, 2024
cudaLaunchCooperativeKernelMultiDevice fails with invalid device ordinal CUDA Programming and Performance	3	939	November 14, 2017
cudaMemcpyAsync returns weird error NVAPI	0	782	December 10, 2016
cudaMemcpyAsync returns weird error CUDA Programming and Performance	0	470	December 10, 2016
cudaErrorInvalidDevice: invalid device ordinal CUDA Setup and Installation	0	403	April 18, 2024
Invalid device ordinal error on multiGPU system cudaSafeCall() Runtime API error : invalid device or CUDA Programming and Performance	0	1190	February 7, 2011
invalid device ordinal (I can't find any help about this) CUDA Programming and Performance	7	19877	July 1, 2014
When/why are CUdevice's not the same as their ordinal? CUDA Programming and Performance	0	444	July 13, 2020

Invalid Device Ordinal

Related topics