Shared Memory issues

Hi, I am just trying to get my head around CUDA and have run into a problem in trying to store arrays in shared memory. My current code results in a segmentation fault and i don’t know why.

I am trying to compute the NMI of a set of variables where i have the following variables

xd = the dataset (a matrix stored as a single array of ints) size = nrow*ncol - on device

ncatd = the number of categories for each variable in xd (stored as an array of ints) size = ncol - on device

entropies = the final destination for the computed entropy (stored as an array of floats) size = ncol - on device

nrow = number of rows in xd

ncol = number of cols in xd

maxCat = the maximum table size in number of bytes

To compute the NMI i need to construct a frequency table of the categories within each x[i]. I am trying to build this table in shared memory where each thread computes the NMI for x[i] and build 1 table. I want each table to be stored in a unique shared memory location, something like:

shared_memory = [table1,table2,table3 …] where each table location is the maximum table size.

As i am only allowed 16kb of shared memory i first determine the maximum table size (maxCat)

then work out the maximum number of threads i can use and then the number of blocks.

This implementation seems to work for small number of columns (ncol), say < 100. But as the number of columns increases i get memory issues which results in a segmentation fault. I don’t know why.

My current understanding, after CUDA programming for 1 week only is that __syncthreads() ensures that all memory accesses have been completed before progressing into the code. However i have had trouble with applying __syncthreads here for some reason as all computations are done within a conditional and not all threads reach this point. Also as each table in shared memory is stored in a different address, i am not sure if __syncthreads is required. If __syncthreads is the answer where should it go?

What have i done wrong here?


maxThreadNumber = floor((16384)/maxTableSize); // maximum number of tables that can be stored in 16 Kb.

maxBlockSize = floor(sqrt(maxThreadNumber)); // maximum number of blocksize

if (maxBlockSize < 16) blocksize = maxBlockSize; // if can’t do 16*16 threads per block, then reduce the block size

// then the shared memory = (table size)*(number of threads)*sizeof(int)

sharedMemorySize = maxCatblocksizeblocksize*sizeof(int);

// and then set the blocks

xBlocks = ncol/(blocksize * blocksize) + (ncol % blocksize*blocksize == 0?0:1);

dim3 entGrid(xBlocks);

dim3 entBlock(blocksize*blocksize);

// and finally call the kernel

entropy_kernel<<<entGrid, entBlock, sharedMemorySize>>>(xd,ncatd,nrow,ncol,maxCat,entropies);

global void entropy_kernel(int *x,int *ncat,int nrow,int ncol,int tblsize,float *entropies) {

int bx;

int n,m;

int *table;

float p=0,ent = 0;

    extern __shared__ int tdata[];

// Matrix column index

bx = blockIdx.x * blockDim.x + threadIdx.x; // current row of A

    //don't do anything is outside of the range of the data

if (bx < ncol) {

           // define table to be a unique part of shared memory

	table = &tdata[bx * tblsize];

// initialize all table values to 0

	for (n = 0;n < ncat[bx];n = n + 1) table[n] = 0;

// create the table counts

	for (n = 0;n < nrow;n = n + 1) {

		m = x[nrow*bx + n];

		table[m] = table[m] + 1;


// compute the NMI

	ent = 0;

	for (n = 0;n < ncat[bx];n = n + 1) {

		p = (float)(table[n])/(float)(nrow);

		if (p > 0) ent = ent - p*log2(p);


           // store

	entropies[bx] = ent;



You do indeed have a maximum of 16 KB of shared memory. In practice it may be a few hundred bytes less to hold kernel arguments and thread IDs… so watch out for that.

But your real problem is you’re mixing up shared memory size allocs… you’re allocating based on one size assumption of 1 byte but accessing based on ints of 4 bytes.

Print out the value of sharedMemorySize, you’ll see you’re requesting much more than 16K, especially since your block size compute is based on bytes, but you have a sizeof(int) multiplier in there when you actually declare the shared memory use.

So your kernel crashes because sometimes the launch may still succeed but you access shared memory way beyond the bounds you told CUDA you were going to use (that multiplier thing again!)

So put an assert on that shared memory size, check for kernel error codes whenever you launch, and check your array indexing in your kernel to only access what you said you needed.