Hi, I am just trying to get my head around CUDA and have run into a problem in trying to store arrays in shared memory. My current code results in a segmentation fault and i don’t know why.

I am trying to compute the NMI of a set of variables where i have the following variables

xd = the dataset (a matrix stored as a single array of ints) size = nrow*ncol - on device

ncatd = the number of categories for each variable in xd (stored as an array of ints) size = ncol - on device

entropies = the final destination for the computed entropy (stored as an array of floats) size = ncol - on device

nrow = number of rows in xd

ncol = number of cols in xd

maxCat = the maximum table size in number of bytes

To compute the NMI i need to construct a frequency table of the categories within each x[i]. I am trying to build this table in shared memory where each thread computes the NMI for x[i] and build 1 table. I want each table to be stored in a unique shared memory location, something like:

shared_memory = [table1,table2,table3 …] where each table location is the maximum table size.

As i am only allowed 16kb of shared memory i first determine the maximum table size (maxCat)

then work out the maximum number of threads i can use and then the number of blocks.

This implementation seems to work for small number of columns (ncol), say < 100. But as the number of columns increases i get memory issues which results in a segmentation fault. I don’t know why.

My current understanding, after CUDA programming for 1 week only is that __syncthreads() ensures that all memory accesses have been completed before progressing into the code. However i have had trouble with applying __syncthreads here for some reason as all computations are done within a conditional and not all threads reach this point. Also as each table in shared memory is stored in a different address, i am not sure if __syncthreads is required. If __syncthreads is the answer where should it go?

What have i done wrong here?

[codebox]

maxThreadNumber = floor((16384)/maxTableSize); // maximum number of tables that can be stored in 16 Kb.

maxBlockSize = floor(sqrt(maxThreadNumber)); // maximum number of blocksize

if (maxBlockSize < 16) blocksize = maxBlockSize; // if can’t do 16*16 threads per block, then reduce the block size

// then the shared memory = (table size)*(number of threads)*sizeof(int)

sharedMemorySize = maxCat*blocksize*blocksize*sizeof(int);

// and then set the blocks

xBlocks = ncol/(blocksize * blocksize) + (ncol % blocksize*blocksize == 0?0:1);

dim3 entGrid(xBlocks);

dim3 entBlock(blocksize*blocksize);

// and finally call the kernel

entropy_kernel<<<entGrid, entBlock, sharedMemorySize>>>(xd,ncatd,nrow,ncol,maxCat,entropies);

**global** void entropy_kernel(int *x,int *ncat,int nrow,int ncol,int tblsize,float *entropies) {

```
int bx;
int n,m;
int *table;
float p=0,ent = 0;
extern __shared__ int tdata[];
// Matrix column index
bx = blockIdx.x * blockDim.x + threadIdx.x; // current row of A
//don't do anything is outside of the range of the data
if (bx < ncol) {
// define table to be a unique part of shared memory
table = &tdata[bx * tblsize];
```

// initialize all table values to 0

```
for (n = 0;n < ncat[bx];n = n + 1) table[n] = 0;
```

// create the table counts

```
for (n = 0;n < nrow;n = n + 1) {
m = x[nrow*bx + n];
table[m] = table[m] + 1;
}
```

// compute the NMI

```
ent = 0;
for (n = 0;n < ncat[bx];n = n + 1) {
p = (float)(table[n])/(float)(nrow);
if (p > 0) ent = ent - p*log2(p);
}
// store
entropies[bx] = ent;
}
```

}[/codebox]