I recently started using cuda and I have some troubles using shared memory. I hope someone here can help me figure out the problems within my code. The following code is trying to calculate the time averaged potential which is save in avg_potential, and the array named by potential is the values calculated in each iteration. The code passed the compile and linking, but didn’t run successfully.
// define number of threads and blocks
void avgCenterLinePotential( int arraysize, float *avgpotential,float *potential, int iteration)
{
int numThreads = 256; //arraysize>=256
int numBlocks = arraysize/numThreads; // assuming arraysize%numThreads = 0
averageCenterLinePotential <<< numBlocks, numThreads >>> ( arraysize,avgpotential, potential, iteration);
cutilCheckMsg(“avgCenterLinePotential kernel execution failed”);
}
// kernel global
void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)
{
int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;
if ( index >= arraysize ) return;
avg_potential[index] *= iteration;
int sharedarraysize = 256;
for ( int i = 0; i < intDivUp(arraysize, sharedarraysize); i++ ){
__shared__ float sharedArray[256];
if ( i*sharedarraysize+threadIdx.x < arraysize ) {
sharedArray[threadIdx.x] = potential[index];
}
__syncthreads();
if ( i*sharedarraysize+threadIdx.x < arraysize ) {
avg_potential[index]+=sharedArray[threadIdx.x];
}
__syncthreads();
}
avg_potential[index] /= iteration+1;
first of all, please use the code environment, then it’s better readable for the forum users.
I didn’t look at your code in that detail but what attracts my attention was that you allocate shared memory in a loop. I’m not sure if this works correctly. Normally, you allocate the needed shared memory once and then only use it.
So code should look like this:
// kernel
__global__
void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)
{
__shared__ float sharedArray[256];
int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;
if ( index >= arraysize ) return;
avg_potential[index] *= iteration;
int sharedarraysize = 256;
for ( int i = 0; i < intDivUp(arraysize, sharedarraysize); i++ ){
//__shared__ float sharedArray[256];
if ( i*sharedarraysize+threadIdx.x < arraysize ) {
sharedArray[threadIdx.x] = potential[index];
}
__syncthreads();
if ( i*sharedarraysize+threadIdx.x < arraysize ) {
avg_potential[index]+=sharedArray[threadIdx.x];
}
__syncthreads();
}
avg_potential[index] /= iteration+1;
}
mtrng:
Thanks for reformating the code to something readable.
Chrizh:
Your code looks a bit weird in that each thread reads the same element of potential over and over again, and that the “average” is divided by [font=“Courier New”]iteration[/font], not by the actual number of values summed up. As going through shared memory is of no use in this example anyway, I assume the code to be a stripped down version of the actual problematic code.
What are you trying to achieve, and what do you expect this code to do?
Thank you, chrizh. This is my first post here. Thank you for reforming the codes.
To Tera:
Thank you for your time reading my codes. Basically I want to calculate the average values of potential( which is an array because it saves potentials at different coordinates in Cartesian coordinate system ) in each iteration and save the averaged numbers in avg_potential which is also an array. I worried that the sum accumulated based on iteration might be larger than the max number of float, so I only save the average numbers. At the beginning of each iteration, each averaged number is multiplied by the current iteration ( which is an integer ) to get the sum. Then the fresh calculated potential value is added to the sum. Then the sum is divided by (current iteration+1) so that the save value is still the averaged potential.
Inside the kernel each thread is supposed to calculate the averaged potential at a set of different xyz coordinates.
Ah ok, I now understand the multiplication/division by iteration/(iteration+1). Note though that in CUDA, unlike on X86 CPUs, variables in registers are exactly the same width as variables in memory. So dividing by iteration does not protect you against overflow, it just means that you are working with an average and not the sum.
As I understand your code, each iteration generates one new set of values [font=“Courier New”]potential[/font] for the potential, and updates [font=“Courier New”]avg_potential[/font] from its old values extend the average over the new values as well. This would however already be achieved by this simpler kernel:
// kernel
__global__
void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)
{
int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;
if ( index >= arraysize ) return;
avg_potential[index] = (iteration * avg_potential[index] + potential[index]) / (iteration + 1);
}
What I do not yet understand is what the for loop in your kernel is supposed to iterate over. What is [font=“Courier New”]i[/font] there?
Yes, you are right. The code you posted is a simple and good solution to my problem. However I’m trying to use shared memory instead of global memory to improve the calculation speed. Since each block can process 256 threads at the same time, I am trying to load 256 data saved in the “potential” array to shared memory at once, with each thread read one value from the global memory. i is actually the block id used for read data by each thread from global memory.
This problem cannot be optimized by using shared memory. Providing the pointers provided to the kernel are suitably aligned (which they are if you obtained them with cudaMalloc), the memory accesses of all threads in a warp will already be coalesced into 128 Byte wide memory accesses.
Shared memory is for fast communication between the threads of a block, it does not accelerate global memory access.