Unspecified Launch Failure Memory Read

I get an unspecified launch failure, when i try to read my global memory with
the calculated index and write the value to shared memory at a calculated position at the shared memory block.

__global__ void filterPoints(int *inputData, const int width, const int height, const int treshold) {

extern __shared__ int tempValues[];

int xIndex = blockIdx.x * TILE_W + threadIdx.x - SOBEL_RADIUS;
int yIndex = blockIdx.y * TILE_H + threadIdx.y - SOBEL_RADIUS;

// check borders of image and clamp to edge
int xIn = max(0, xIndex);
xIn = min(xIndex, width - 1);
int yIn = max(yIndex, 0);
yIn = min(yIndex, height - 1);

const short tempIndex = threadIdx.y * blockDim.x + threadIdx.x;

// copy to shared memory
tempValues[tempIndex] = inputData[yIn * width + xIn];
__syncthreads();

}

If i comment the line, where i write the shared memory, i get no error. My thread-block size is 16x16 and my tile size is 14x14. my image is 384 width and 288 height. any ideas?
I looked at the inputeData and it was correctly copied to the GPU.

maybe you haven’t properly allocated shared memory in your kernel launch

it’s difficult to say what is wrong with any certainty, without a complete code.

you can use the methodology described here:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

to debug a kernel execution error

it’s very special because in other test cases it works perfectly but i will try to memcheck

i call my kernel like this

filterPoints<<<gridNewSize, blockSize, BLOCK_H * BLOCK_W * sizeof(int)>>>(d_inputData, width, height, threshold);

BLOCK_H and BLOCK_W are the same: 16

I suggest providing a complete code. It should only require about 10-15 more lines of code beyond the kernel code you have already shown.

Or provide the following:

  • how you allocate for d_intputData
  • definition of TILE_H and TILE_W and SOBEL_RADIUS
  • definition of blockSize, gridNewSize
  • definition of width,height
  • is this on windows or linux

if any of the above depend on other constants or variables, provide the definition of those also

int width: 384 / int height: 288

d_inputData:

const size_t sizeOfInput = width * height * sizeof(int);
cudaMalloc((void**)&d_inputData, sizeOfInput);
cudaMemcpy(d_inputData, inputData, sizeOfInput, cudaMemcpyHostToDevice);

definition of the constants

#define SOBEL_RADIUS 1
#define TILE_W 14
#define TILE_H 14
#define BLOCK_W (TILE_W + 2 * SOBEL_RADIUS)
#define BLOCK_H (TILE_H + 2 * SOBEL_RADIUS)

blockSize and gridNewSize

const dim3 blockSize(8, 8);
const int gridNewX = (width + TILE_W - 1) / TILE_W;
const int gridNewY = (height + TILE_H - 1) / TILE_H;
const dim3 gridNewSize(gridNewX, gridNewY);

it runs on windows. thanks a lot for your help

You have an error in these lines:

int xIn = max(0, xIndex);
xIn = min(xIndex, width - 1);
int yIn = max(yIndex, 0);
yIn = min(yIndex, height - 1);

It should be:

int xIn = max(0, xIndex);
xIn = min(xIn, width - 1);   // note change
int yIn = max(yIndex, 0);
yIn = min(yIn, height - 1);  // note change

Thanks a lot