Hi, I am just trying to get my head around CUDA and have run into a problem in trying to store arrays in shared memory. My current code results in a segmentation fault and i don’t know why.
I am trying to compute the NMI of a set of variables where i have the following variables
xd = the dataset (a matrix stored as a single array of ints) size = nrow*ncol - on device
ncatd = the number of categories for each variable in xd (stored as an array of ints) size = ncol - on device
entropies = the final destination for the computed entropy (stored as an array of floats) size = ncol - on device
nrow = number of rows in xd
ncol = number of cols in xd
maxCat = the maximum table size in number of bytes
To compute the NMI i need to construct a frequency table of the categories within each x[i]. I am trying to build this table in shared memory where each thread computes the NMI for x[i] and build 1 table. I want each table to be stored in a unique shared memory location, something like:
shared_memory = [table1,table2,table3 …] where each table location is the maximum table size.
As i am only allowed 16kb of shared memory i first determine the maximum table size (maxCat)
then work out the maximum number of threads i can use and then the number of blocks.
This implementation seems to work for small number of columns (ncol), say < 100. But as the number of columns increases i get memory issues which results in a segmentation fault. I don’t know why.
My current understanding, after CUDA programming for 1 week only is that __syncthreads() ensures that all memory accesses have been completed before progressing into the code. However i have had trouble with applying __syncthreads here for some reason as all computations are done within a conditional and not all threads reach this point. Also as each table in shared memory is stored in a different address, i am not sure if __syncthreads is required. If __syncthreads is the answer where should it go?
What have i done wrong here?
[codebox]
maxThreadNumber = floor((16384)/maxTableSize); // maximum number of tables that can be stored in 16 Kb.
maxBlockSize = floor(sqrt(maxThreadNumber)); // maximum number of blocksize
if (maxBlockSize < 16) blocksize = maxBlockSize; // if can’t do 16*16 threads per block, then reduce the block size
// then the shared memory = (table size)*(number of threads)*sizeof(int)
sharedMemorySize = maxCatblocksizeblocksize*sizeof(int);
// and then set the blocks
xBlocks = ncol/(blocksize * blocksize) + (ncol % blocksize*blocksize == 0?0:1);
dim3 entGrid(xBlocks);
dim3 entBlock(blocksize*blocksize);
// and finally call the kernel
entropy_kernel<<<entGrid, entBlock, sharedMemorySize>>>(xd,ncatd,nrow,ncol,maxCat,entropies);
global void entropy_kernel(int *x,int *ncat,int nrow,int ncol,int tblsize,float *entropies) {
int bx;
int n,m;
int *table;
float p=0,ent = 0;
extern __shared__ int tdata[];
// Matrix column index
bx = blockIdx.x * blockDim.x + threadIdx.x; // current row of A
//don't do anything is outside of the range of the data
if (bx < ncol) {
// define table to be a unique part of shared memory
table = &tdata[bx * tblsize];
// initialize all table values to 0
for (n = 0;n < ncat[bx];n = n + 1) table[n] = 0;
// create the table counts
for (n = 0;n < nrow;n = n + 1) {
m = x[nrow*bx + n];
table[m] = table[m] + 1;
}
// compute the NMI
ent = 0;
for (n = 0;n < ncat[bx];n = n + 1) {
p = (float)(table[n])/(float)(nrow);
if (p > 0) ent = ent - p*log2(p);
}
// store
entropies[bx] = ent;
}
}[/codebox]