Kernel is not launching

/* Im trying to solve convolution problem but kernel is not launching in this program*/

#include<stdio.h>
#include<stdlib.h>
#include<cuda.h>
#include<math.h>
global void convo_kernel(int *d_input, int *d_output, int n);
int main()
{
int i,n = 16;
int *h_input = (int )malloc(nsizeof(int));
int *h_output = (int )malloc(nsizeof(int));
int d_input, d_output;
cudaMalloc((void **)&d_input, n
sizeof(int));
cudaMalloc((void **)&d_output,n
sizeof(int));

// printf("\n Input Array: “);
for(i = 0; i < n; i++)
{
h_input[i] = 1;
//printf(” %d ", h_input[i]);
}

cudaMemcpy(d_input, h_input, n*sizeof(int), cudaMemcpyHostToDevice);
int Block = 4;
int Threads = 4;
convo_kernel<<<Block,Threads>>>(d_input, d_output, n);
return(0);
}
global void convo_kernel(int *d_input, int *d_output, int n)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
printf("\n i = %d",i);
}

kernel launches are asynchronous with respect to the host thread.

Therefore after this kernel launch:

convo_kernel<<<Block,Threads>>>(d_input, d_output, n);

The host thread continues on and begins processing the next line of code:

return(0);

which indicates termination of the program. The host and the OS then begin application shutdown before the kernel actually gets a chance to execute.

One possible fix is to add a synchronization function after the kernel launch:

convo_kernel<<<Block,Threads>>>(d_input, d_output, n);
cudaDeviceSynchronize();
return(0);

This would normally not be needed or typical because typically you might have a cudaMemcpy() or other operations after the kernel launch.

If you continue to have trouble, I suggest running programs with cuda-memcheck and add proper cuda error checking to your code.