How to replace CUDA function to single threaded CPU functions?

Hello,
I’m trying to debug my cuda program but i look very unpractical to me to do as it is multithreaded as you have to select the warps, and also have two debugger if you want to debug both cpu and gpu at the same time. Also I searched for a single threaded debugging mode but it doesn’t seems to exist then I decided to modify the functions when I am in debugging configuration by adding #ifdef _DEBUG lines.
Like that :

#ifndef DEBUG
__global__
#endif
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

And :

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
ifdef _DEBUG
        gridDim.x = numBlocks;
        blockDim.x = blockSize;
        for (threadIdx.x = 0; threadIdx.x < blockSize; threadIdx.x++)
            for (blockIdx.x = 0; blockIdx.x < numBlocks; blockIdx.x++)
                add(N, x, y);
#else
        add<<<numBlocks, blockSize>>>(N, x, y);
        cudaDeviceSynchronize();
#endif // _DEBUG

It could work but the compiler complain about gridDim.x, blockDim.x, threadIdx.x and blockIdx.x not beeing editable.

I tried that :

#ifdef _DEBUG

    #define __global__
    uint3 threadIdx;
    uint3 blockIdx;
    uint3 blockDim;
    uint3 gridDim;

#else

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"

#endif

But it doesn’t like this anymore giving a declaration is incompatible with “const uint3 threadIdx”. My c++ knowledge is not enough extended to find the solution.

I could change all my kernels like that :

void add(int n, float *x, float *y)
{
  #ifdef _DEBUG
     int index=mythreadidxx;
     int stride=myblockdimx;
  #else
     int index = threadIdx.x;
     int stride = blockDim.x;
  #endif

  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

But it would make the code heavier again.

Thank you in advance.

Use grid-stride looping with one block per grid and one thread per block. See https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ for details. That will make a CUDA kernel run serially.

Do you mean setting <<<1, 1>>> ? If yes I can’t because I designed my code to work with the data dimensions.

grid stride loop:

https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/