parallel scan

littlehead · August 7, 2009, 3:55pm

Hi there,

I read the paper “Parallel Prefix Sum (Scan) with CUDA” by Mark Harris.

I tried the up-sweep phase with an array of 32 elements and block size 8. The kernel is mostly the same as the example in the paper except that I used statically allocated shared memory. See the code below.

[codebox]#include

#include <cutil.h>

using namespace std;

global void sumkernel(float *Array_d, int n)

{

__shared__ float temp[2];

int Idx = threadIdx.x;

int offset = 1;

temp[2 * Idx] = Array_d[blockIdx.x * blockDim.x + 2 * Idx];

temp[2 * Idx + 1] = Array_d[blockIdx.x * blockDim.x + 2 * Idx + 1];

for(int d = 8 >> 1; d > 0; d >>= 1)

{

	__syncthreads();

	if(Idx < d)

	{

		int ai = offset * (2 * Idx + 1) -1;

		int bi = offset * (2 * Idx + 2) -1;

		temp[bi] = temp[bi] + temp[ai];

	}

	offset = offset * 2;

}

__syncthreads();

Array_d[blockIdx.x * blockDim.x + 2 * Idx] = temp[2 * Idx];

Array_d[blockIdx.x * blockDim.x + 2 * Idx + 1] = temp[2 * Idx + 1];

}

int main()

{

int i = 0;

const int n = 32;

float ArrayIn[n];

float sum = 0.0f;

for(i = 0; i < n; i++)

{

	ArrayIn[i] = (float)i / n;

	sum = sum + ArrayIn[i];

	cout << ArrayIn[i] << " ";

}	

cout << endl;

cout << "sum is " << sum << endl;



float Array_h[n];

float *Array_d = NULL;

CUDA_SAFE_CALL(cudaMalloc((void **)&Array_d, n * sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(Array_d, ArrayIn, n * sizeof(float), cudaMemcpyHostToDevice));

int blocksize = 8;

int n_blocks = n / blocksize;

sumkernel <<< n_blocks, blocksize >>> (Array_d, n);

CUDA_SAFE_CALL(cudaMemcpy(Array_h, Array_d, n * sizeof(float), cudaMemcpyDeviceToHost));

for(i = 0; i < n; i++)

	cout << Array_h[i] << " ";

cout << endl;

cout << "final sum is " << Array_h[7] + Array_h[15] + Array_h[23] + Array_h[31] << endl;

CUDA_SAFE_CALL(cudaFree(Array_d));

}[/codebox]

This program rarely works fine, most of time, the last block does not work.

I also tried an array of 125 elements with block size of 25. It didn’t work either. For convenience, I post the code of 125 elements here too.

[codebox]#include

#include <cutil.h>

using namespace std;

global void sumkernel(float *Array_d, int n)

{

__shared__ float temp[25];

int Idx = threadIdx.x;

int offset = 1;

//for(int i = 0; i < 16; i++)

  // 	temp[16 * Idx + i] = Array_d[16 * Idx + i];

temp[5 * Idx] = Array_d[blockIdx.x * blockDim.x + 5 * Idx];

temp[5 * Idx + 1] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 1];

temp[5 * Idx + 2] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 2];

temp[5 * Idx + 3] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 3];

temp[5 * Idx + 4] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 4];

for(int d = 5; d > 0; d /= 5)

{

	__syncthreads();

	if(Idx < d)

	{

		//for(int j = 0; j < 16; j++)

		//	temp[offset * (16 * Idx + 16) - 1] = temp[offset * (16 * Idx + 16) - 1] + temp[offset * (16 * Idx + j + 1) - 1];

		int ai = offset * (5 * Idx + 1) -1;

		int bi = offset * (5 * Idx + 2) -1;

		int ci = offset * (5 * Idx + 3) -1;

		int di = offset * (5 * Idx + 4) -1;

		int ei = offset * (5 * Idx + 5) -1;

		temp[ei] = temp[ei] + temp[di] + temp[ci] + temp[bi] + temp[ai];

	}

	offset = offset * 5;

}

__syncthreads();

//for(int i = 0; i < 16; i++)

  // 	Array_d[16 * Idx + i] = temp[16 * Idx + i];	

Array_d[blockIdx.x * blockDim.x + 5 * Idx] = temp[5 * Idx];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 1] = temp[5 * Idx + 1];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 2] = temp[5 * Idx + 2];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 3] = temp[5 * Idx + 3];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 4] = temp[5 * Idx + 4];

}

int main()

{

const int n = 125;

float ArrayIn[n];

float sum = 0.0f;

for(int i = 0; i < n; i++)

{

	ArrayIn[i] = (float)i / n;

	sum = sum + ArrayIn[i];		

}	

cout << endl;

cout << "sum is " << sum << endl;



float Array_h[n];

float *Array_d = NULL;

CUDA_SAFE_CALL(cudaMalloc((void **)&Array_d, n * sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(Array_d, ArrayIn, n * sizeof(float), cudaMemcpyHostToDevice));

int blocksize = 25;

int n_blocks = n / blocksize;

sumkernel <<< n_blocks, blocksize >>> (Array_d, n);

CUDA_SAFE_CALL(cudaMemcpy(Array_h, Array_d, n * sizeof(float), cudaMemcpyDeviceToHost));

for(int i = 0; i < n; i++)

	cout << Array_h[i] << " ";

cout << endl;

cout << "final sum is " << Array_h[24] + Array_h[49] + Array_h[74] + Array_h[99] + Array_h[124] << endl;

CUDA_SAFE_CALL(cudaFree(Array_d));

} [/codebox]

I tried different size of “temp”, it does not help.

Can anybody help solve this problem?

Also, when should I use “extern shared” and how to use it?

Any idea will be highly appreciated!!

Yuping

jjp · August 7, 2009, 8:22pm

I think you misunderstand the declaration of shared memory arrays. Such an array is shared within the whole block. For example in your first code snippet you create an array of size 2. But with a block size of 8 this

temp[2 * Idx]

means thread 7, for example, will read or write from position 14. This is beyond the array boundaries and might contain anything.

Declaring an array as extern has to be done when the required amount of shared memory is only known at the time the kernel is executed and not beforehand.

I have not tried to understand your code in detail but at first glance it does not look like a correct binary tree scan/prefix sum. To get the prefix sum you need both the up- and down-sweep phase.

littlehead · August 7, 2009, 8:48pm

I think you misunderstand the declaration of shared memory arrays. Such an array is shared within the whole block. For example in your first code snippet you create an array of size 2. But with a block size of 8 this
temp[2 * Idx]
means thread 7, for example, will read or write from position 14. This is beyond the array boundaries and might contain anything.

I also tried temp size 8 and 16, neither worked. I have to admit that I did not understand the usage of shared memory nor parallel scan clearly. I know shared memory is shared inside the block, so I think there is no bank conflict in the first program. But in the second program, bank conflicts will happen in the loop.

Has anyone read the paper “Parallel Prefix Sum(Scan) with CUDA”? In the “Avoiding Bank Conflicts” section, the author said there would be up to 16-way bank conflicts when scanning a 512-element array. I don’t understand this, are the 512 elements in a single block or in 16 blocks? I cannot see how we will possibly get 16-way bank conflicts in either way.

By saying this, you mean I must do both phases to get the sum? I thought I can just manually add the last thread’s value in each block at the host side to get the sum.

Thanks,

Yuping

littlehead · August 7, 2009, 9:07pm

If I have a shared array temp[8], does this mean there are 8 memory banks, and when two threads in the same block access temp[0] and temp[8], a bank conflict occurs?
I know the shared memory of each block is split to 16 banks, but what is the actual number of banks when running a specific program?

jjp · August 7, 2009, 9:25pm

Don’t worry about bank conflicts before knowing that what you do is actually correct :)

for(int d = 8 >> 1; d > 0; d >>= 1)

{

__syncthreads();

if(Idx < d)

{

int ai = offset * (2 * Idx + 1) -1;

int bi = offset * (2 * Idx + 2) -1;

temp[bi] = temp[bi] + temp[ai];

}

offset = offset * 2;

}

In this snippet, for a block size of 8 “bi” will go up to 4 * (2 * Idx +1) - 1 in the last iteration of the loop. I.e. for thread 7 it will be 59. So your array is still too small when you give it a size of 16! And even if your array had the proper size of 60, I see you never again read these parts of the array. The code looks very wrong. I suggest you start again. Maybe first read up a bit on the scan / prefix sum primitive. One doesn’t usually use it just to get the sum of an array (a parallel reduction would suffice in that case) but because one needs all elements of the prefix sum.

navyblue · August 7, 2009, 10:43pm

The code is very wrong.the shared memory size is too small.if you just use it as temp var,do not use the threadIdx.x as the index.

hvo · August 7, 2009, 11:26pm

Physically, there are 16 banks, and memory access are issued in batches of 16 for each half-warp, thus, no bank-conflict when16 consecutive threads access to 16 consecutive memory addresses in the shared memory. So if a block access temp[0] and temp[8] but in thread 0 and 8 of a half-warp then there would be no bank conflict, otherwise depending on the access pattern the number of accesses may vary.

_Big_Mac · August 8, 2009, 1:22pm

I also tried an array of 125 elements with block size of 25. It didn’t work either. For convenience, I post the code of 125 elements here too.

[codebox]include

include <cutil.h>

using namespace std;

global void sumkernel(float *Array_d, int n)

{
__shared__ float temp[25];

int Idx = threadIdx.x;

temp[5 * Idx + 4] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 4];
}
   [/codebox]

You’re declaring temp’s size to 25 and your Idx can go up to 25, this means addressing temp[129]. How is that supposed to work?

Read the Programming Guide.

littlehead · August 11, 2009, 3:22pm

Thank you all for your replies!
As recommended, I used reduce6 from the reduction project in SDK, and it worked.
I will also try parallel scan for the other problem. Will let you know.

Thanks,
Yuping

Topic		Replies	Views
CUDA - calculation of a sum CUDA Programming and Performance	7	5529	April 30, 2010
help with kernel synchronization? CUDA Programming and Performance	22	13900	August 26, 2010
passing different sizes arrays data to shared variables CUDA Programming and Performance	9	1621	May 4, 2010
limit of computation CUDA Programming and Performance	44	32905	April 8, 2008
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7726	July 22, 2010
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6657	February 8, 2009
Why cannot run this program CUDA Programming and Performance	7	951	May 11, 2018
Very strange share memory bank conflicts CUDA Programming and Performance cuda	4	493	November 2, 2021
Iteration help in CUDA CUDA Programming and Performance	11	6892	April 19, 2012

parallel scan

Related topics