I get an unspecified launch failure, when i try to read my global memory with
the calculated index and write the value to shared memory at a calculated position at the shared memory block.
__global__ void filterPoints(int *inputData, const int width, const int height, const int treshold) {
extern __shared__ int tempValues[];
int xIndex = blockIdx.x * TILE_W + threadIdx.x - SOBEL_RADIUS;
int yIndex = blockIdx.y * TILE_H + threadIdx.y - SOBEL_RADIUS;
// check borders of image and clamp to edge
int xIn = max(0, xIndex);
xIn = min(xIndex, width - 1);
int yIn = max(yIndex, 0);
yIn = min(yIndex, height - 1);
const short tempIndex = threadIdx.y * blockDim.x + threadIdx.x;
// copy to shared memory
tempValues[tempIndex] = inputData[yIn * width + xIn];
__syncthreads();
}
If i comment the line, where i write the shared memory, i get no error. My thread-block size is 16x16 and my tile size is 14x14. my image is 384 width and 288 height. any ideas?
I looked at the inputeData and it was correctly copied to the GPU.
maybe you haven’t properly allocated shared memory in your kernel launch
it’s difficult to say what is wrong with any certainty, without a complete code.
you can use the methodology described here:
[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow
to debug a kernel execution error
it’s very special because in other test cases it works perfectly but i will try to memcheck
i call my kernel like this
filterPoints<<<gridNewSize, blockSize, BLOCK_H * BLOCK_W * sizeof(int)>>>(d_inputData, width, height, threshold);
BLOCK_H and BLOCK_W are the same: 16
I suggest providing a complete code. It should only require about 10-15 more lines of code beyond the kernel code you have already shown.
Or provide the following:
- how you allocate for d_intputData
- definition of TILE_H and TILE_W and SOBEL_RADIUS
- definition of blockSize, gridNewSize
- definition of width,height
- is this on windows or linux
if any of the above depend on other constants or variables, provide the definition of those also
int width: 384 / int height: 288
d_inputData:
const size_t sizeOfInput = width * height * sizeof(int);
cudaMalloc((void**)&d_inputData, sizeOfInput);
cudaMemcpy(d_inputData, inputData, sizeOfInput, cudaMemcpyHostToDevice);
definition of the constants
#define SOBEL_RADIUS 1
#define TILE_W 14
#define TILE_H 14
#define BLOCK_W (TILE_W + 2 * SOBEL_RADIUS)
#define BLOCK_H (TILE_H + 2 * SOBEL_RADIUS)
blockSize and gridNewSize
const dim3 blockSize(8, 8);
const int gridNewX = (width + TILE_W - 1) / TILE_W;
const int gridNewY = (height + TILE_H - 1) / TILE_H;
const dim3 gridNewSize(gridNewX, gridNewY);
it runs on windows. thanks a lot for your help
You have an error in these lines:
int xIn = max(0, xIndex);
xIn = min(xIndex, width - 1);
int yIn = max(yIndex, 0);
yIn = min(yIndex, height - 1);
It should be:
int xIn = max(0, xIndex);
xIn = min(xIn, width - 1); // note change
int yIn = max(yIndex, 0);
yIn = min(yIn, height - 1); // note change