Cuda By Example Questions about some of the examples

spwanasin · November 9, 2011, 10:50pm

Hello all,

I finished reading Cuda By Example recently and there are a couple of things that I need clarification on.

One of the examples allocates Shared memory for the block inside of the kernel:

global void histo_kernel( unsigned char *buffer,long size,unsigned int *histo )
{
shared unsigned int temp[256];
.
some other code
.
}

Why do we define the shared memory there inside of the kernel. Isn’t the shared memory being allocated multiple times for each of the threads in a particular block?

The Heat 2d example from chapter 7 is taking a really long time for each render. Somewhere between 350ms - 500ms depending on the arc I’m using (1.0, 1.3, 2.0). Is this time normal for this example.

Okay, this next one isn’t a Cuda By Example question, but I’m trying to use curand to generate some random numbers on the device. However the numbers are not completely random. I have a lot of copies of numbers. For example it would look something like this. {5642, 12314, 8469, 12314,8964302, 5642 96341}

I’m using the same major parts of the code that is outlined in the CURAND_Library.pdf

Thanks for your time

pasoleatis · November 10, 2011, 2:38pm

No. I think that the memory is allocated only once for each block.

seibert · November 10, 2011, 3:08pm

Shared memory is by definition always allocated per block.

Can you show the code you use to setup and call CURAND? Results like you are seeing sometimes result from a failure to initialize the curandState struct before using it.

spwanasin · November 10, 2011, 10:54pm

This is the basic outline for curand in my code:

define Population_Size 100

define Dim 20

struct Chrome{

int fitness[Dim];

};

global void SeedSet(curandState *state, unsigned long seed)

{

int id = threadIdx.x + blockIdx.x * Dim;

curand_init(seed, id, 0, &state[id]);

}

global void Fitnesseval (Chrome *dPop, curandState *state)

{

curandState localState = state[Dim*blockIdx.x + threadIdx.x];//setting state for RNG

dPop[blockIdx.x].fitness[threadIdx.x] = curand(&localState);

}

int main(void)

{ Chrome hPop[Population_Size];

Chrome *dPop;

cudaMalloc( (void**)&dPop,sizeof(Chrome)*Population_Size);

curandState *devStates;

cudaMalloc((void **)&devStates, Population_Size * Dim *sizeof(curandState));

SeedSet<<<Population_Size, Dim>>>(devStates, time(NULL));

Fitnesseval<<<Population_Size, Dim>>>(dPop, devStates);

cudaMemcpy( &hPop, dPop,sizeof(Chrome)*Population_Size,cudaMemcpyDeviceToHost);

return 0;

}

Edit: __________________________

Now that I have ran this little example it works just fine. I should go take a look at my original code, I must have screw up somewhere.

Thanks anyways!

Topic		Replies	Views
Shared memory does not have the same value for a block of thread (issue with curand) CUDA Programming and Performance	5	1004	September 4, 2015
Curand, my implementation works, but I am not sure it's the right way to do it CUDA Programming and Performance cuda	3	988	April 26, 2021
curandState: strange behaviour (CURAND) CUDA Programming and Performance	3	2049	August 16, 2013
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16306	July 21, 2017
CUDA curand memory error without a "dummy kernel" CUDA Programming and Performance	2	812	December 12, 2019
[SOLVED] Shared memory variable declaration CUDA Programming and Performance	3	15267	December 23, 2016
CURAND question CUDA Programming and Performance	1	1400	December 1, 2010
why i need setup_kernel for curand states? GPU-Accelerated Libraries	19	2532	June 14, 2019
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1732	August 24, 2009
Is it recommended to create variables in heap memory CUDA Programming and Performance	9	1501	December 25, 2019

Cuda By Example Questions about some of the examples

Related topics