Hey Everyone,
I’m trying to execute this kernel. Everything works properly until the total memory size of the inputs exceeds 16kB, the kernel fails to work properly. I thought 16kB was the limit for shared memory, not device memory. When arr_size is 496, everything runs fine. However, if I increase arr_size to 512, I get an “invalid argument” error. Anyone have any ideas?
Thanks,
Joe
unsigned int arr_size=496;
unsigned int mem_size = sizeof( float) * arr_size;
float* h_x = (float*) malloc( mem_size);
float* d_x;
//initialize d_x
cutilSafeCall( cudaMalloc( (void**) &d_x, mem_size));
cutilSafeCall( cudaMemcpy( d_x, h_x, mem_size,cudaMemcpyHostToDevice) );
float* d_ox;
cutilSafeCall( cudaMalloc( (void**) &d_ox, mem_size));
float* h_ox = (float*) malloc( mem_size);
The above code is the same for _y,_z,_theta,_phi,_ox,_oy, and _oz too.
dim3 cylgrid(arr_size/16,1,1);
dim3 cylthreads(16,1,1);
cylkernel<<<cylgrid,cylthreads,mem_size*8>>>(d_x,d_y,d_z,d_theta,d_phi,d_ox,d_oy,d_oz);
cutilCheckMsg(“Kernel execution failed”);
mykernel( float* g_ix,float* g_iy,float* g_iz,float* g_itheta,float* g_iphi,float* g_ox,float* g_oy,float* g_oz)
{
#include “template.h”
const unsigned int tid = threadIdx.x;
const unsigned int bid = blockIdx.x;
shared float sx[16],sy[16],sz[16];
sx[tid]=g_ix[bid16+tid];
sy[tid]=g_iy[bid16+tid];
sz[tid]=g_iz[bid*16+tid];
__syncthreads();
sx[tid]=10.0;
sy[tid]=10.0;
sz[tid]=10.0;
g_ox[bid16+tid]=sx[tid];
g_oy[bid16+tid]=sy[tid];
g_oz[bid*16+tid]=sz[tid];
__syncthreads();
}