Cuda By Example Questions about some of the examples

Hello all,

I finished reading Cuda By Example recently and there are a couple of things that I need clarification on.

  1. One of the examples allocates Shared memory for the block inside of the kernel:

global void histo_kernel( unsigned char *buffer,long size,unsigned int *histo )
shared unsigned int temp[256];
some other code

Why do we define the shared memory there inside of the kernel. Isn’t the shared memory being allocated multiple times for each of the threads in a particular block?

  1. The Heat 2d example from chapter 7 is taking a really long time for each render. Somewhere between 350ms - 500ms depending on the arc I’m using (1.0, 1.3, 2.0). Is this time normal for this example.

Okay, this next one isn’t a Cuda By Example question, but I’m trying to use curand to generate some random numbers on the device. However the numbers are not completely random. I have a lot of copies of numbers. For example it would look something like this. {5642, 12314, 8469, 12314,8964302, 5642 96341}

I’m using the same major parts of the code that is outlined in the CURAND_Library.pdf

Thanks for your time

No. I think that the memory is allocated only once for each block.

Shared memory is by definition always allocated per block.

Can you show the code you use to setup and call CURAND? Results like you are seeing sometimes result from a failure to initialize the curandState struct before using it.

This is the basic outline for curand in my code:

#define Population_Size 100

#define Dim 20

struct Chrome{

int fitness[Dim]; 


global void SeedSet(curandState *state, unsigned long seed)


int id = threadIdx.x + blockIdx.x * Dim;

curand_init(seed, id, 0, &state[id]);


global void Fitnesseval (Chrome *dPop, curandState *state)


curandState localState = state[Dim*blockIdx.x + threadIdx.x];//setting state for RNG

dPop[blockIdx.x].fitness[threadIdx.x] = curand(&localState);


int main(void)

{ Chrome hPop[Population_Size];

Chrome *dPop;

cudaMalloc( (void**)&dPop,sizeof(Chrome)*Population_Size);

curandState *devStates;

cudaMalloc((void **)&devStates, Population_Size * Dim *sizeof(curandState));

SeedSet<<<Population_Size, Dim>>>(devStates, time(NULL));

Fitnesseval<<<Population_Size, Dim>>>(dPop, devStates);

cudaMemcpy( &hPop, dPop,sizeof(Chrome)*Population_Size,cudaMemcpyDeviceToHost);

return 0;


Edit: __________________________

Now that I have ran this little example it works just fine. I should go take a look at my original code, I must have screw up somewhere.

Thanks anyways!