parallel scan

Hi there,

I read the paper “Parallel Prefix Sum (Scan) with CUDA” by Mark Harris.

I tried the up-sweep phase with an array of 32 elements and block size 8. The kernel is mostly the same as the example in the paper except that I used statically allocated shared memory. See the code below.

[codebox]#include

#include <cutil.h>

using namespace std;

global void sumkernel(float *Array_d, int n)

{

__shared__ float temp[2];

int Idx = threadIdx.x;

int offset = 1;

temp[2 * Idx] = Array_d[blockIdx.x * blockDim.x + 2 * Idx];

temp[2 * Idx + 1] = Array_d[blockIdx.x * blockDim.x + 2 * Idx + 1];

for(int d = 8 >> 1; d > 0; d >>= 1)

{

	__syncthreads();

	if(Idx < d)

	{

		int ai = offset * (2 * Idx + 1) -1;

		int bi = offset * (2 * Idx + 2) -1;

		temp[bi] = temp[bi] + temp[ai];

	}

	offset = offset * 2;

}

__syncthreads();

Array_d[blockIdx.x * blockDim.x + 2 * Idx] = temp[2 * Idx];

Array_d[blockIdx.x * blockDim.x + 2 * Idx + 1] = temp[2 * Idx + 1];

}

int main()

{

int i = 0;

const int n = 32;

float ArrayIn[n];

float sum = 0.0f;

for(i = 0; i < n; i++)

{

	ArrayIn[i] = (float)i / n;

	sum = sum + ArrayIn[i];

	cout << ArrayIn[i] << " ";

}	

cout << endl;

cout << "sum is " << sum << endl;



float Array_h[n];

float *Array_d = NULL;

CUDA_SAFE_CALL(cudaMalloc((void **)&Array_d, n * sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(Array_d, ArrayIn, n * sizeof(float), cudaMemcpyHostToDevice));

int blocksize = 8;

int n_blocks = n / blocksize;

sumkernel <<< n_blocks, blocksize >>> (Array_d, n);

CUDA_SAFE_CALL(cudaMemcpy(Array_h, Array_d, n * sizeof(float), cudaMemcpyDeviceToHost));

for(i = 0; i < n; i++)

	cout << Array_h[i] << " ";

cout << endl;

cout << "final sum is " << Array_h[7] + Array_h[15] + Array_h[23] + Array_h[31] << endl;

CUDA_SAFE_CALL(cudaFree(Array_d));

}[/codebox]

This program rarely works fine, most of time, the last block does not work.

I also tried an array of 125 elements with block size of 25. It didn’t work either. For convenience, I post the code of 125 elements here too.

[codebox]#include

#include <cutil.h>

using namespace std;

global void sumkernel(float *Array_d, int n)

{

__shared__ float temp[25];

int Idx = threadIdx.x;

int offset = 1;

//for(int i = 0; i < 16; i++)

  // 	temp[16 * Idx + i] = Array_d[16 * Idx + i];

temp[5 * Idx] = Array_d[blockIdx.x * blockDim.x + 5 * Idx];

temp[5 * Idx + 1] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 1];

temp[5 * Idx + 2] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 2];

temp[5 * Idx + 3] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 3];

temp[5 * Idx + 4] = Array_d[blockIdx.x * blockDim.x + 5 * Idx + 4];

for(int d = 5; d > 0; d /= 5)

{

	__syncthreads();

	if(Idx < d)

	{

		//for(int j = 0; j < 16; j++)

		//	temp[offset * (16 * Idx + 16) - 1] = temp[offset * (16 * Idx + 16) - 1] + temp[offset * (16 * Idx + j + 1) - 1];

		int ai = offset * (5 * Idx + 1) -1;

		int bi = offset * (5 * Idx + 2) -1;

		int ci = offset * (5 * Idx + 3) -1;

		int di = offset * (5 * Idx + 4) -1;

		int ei = offset * (5 * Idx + 5) -1;

		temp[ei] = temp[ei] + temp[di] + temp[ci] + temp[bi] + temp[ai];

	}

	offset = offset * 5;

}

__syncthreads();

//for(int i = 0; i < 16; i++)

  // 	Array_d[16 * Idx + i] = temp[16 * Idx + i];	

Array_d[blockIdx.x * blockDim.x + 5 * Idx] = temp[5 * Idx];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 1] = temp[5 * Idx + 1];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 2] = temp[5 * Idx + 2];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 3] = temp[5 * Idx + 3];

Array_d[blockIdx.x * blockDim.x + 5 * Idx + 4] = temp[5 * Idx + 4];

}

int main()

{

const int n = 125;

float ArrayIn[n];

float sum = 0.0f;

for(int i = 0; i < n; i++)

{

	ArrayIn[i] = (float)i / n;

	sum = sum + ArrayIn[i];		

}	

cout << endl;

cout << "sum is " << sum << endl;



float Array_h[n];

float *Array_d = NULL;

CUDA_SAFE_CALL(cudaMalloc((void **)&Array_d, n * sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(Array_d, ArrayIn, n * sizeof(float), cudaMemcpyHostToDevice));

int blocksize = 25;

int n_blocks = n / blocksize;

sumkernel <<< n_blocks, blocksize >>> (Array_d, n);

CUDA_SAFE_CALL(cudaMemcpy(Array_h, Array_d, n * sizeof(float), cudaMemcpyDeviceToHost));

for(int i = 0; i < n; i++)

	cout << Array_h[i] << " ";

cout << endl;

cout << "final sum is " << Array_h[24] + Array_h[49] + Array_h[74] + Array_h[99] + Array_h[124] << endl;

CUDA_SAFE_CALL(cudaFree(Array_d));

} [/codebox]

I tried different size of “temp”, it does not help.

Can anybody help solve this problem?

Also, when should I use “extern shared” and how to use it?

Any idea will be highly appreciated!!

Yuping

I think you misunderstand the declaration of shared memory arrays. Such an array is shared within the whole block. For example in your first code snippet you create an array of size 2. But with a block size of 8 this

temp[2 * Idx]

means thread 7, for example, will read or write from position 14. This is beyond the array boundaries and might contain anything.

Declaring an array as extern has to be done when the required amount of shared memory is only known at the time the kernel is executed and not beforehand.

I have not tried to understand your code in detail but at first glance it does not look like a correct binary tree scan/prefix sum. To get the prefix sum you need both the up- and down-sweep phase.

I also tried temp size 8 and 16, neither worked. I have to admit that I did not understand the usage of shared memory nor parallel scan clearly. I know shared memory is shared inside the block, so I think there is no bank conflict in the first program. But in the second program, bank conflicts will happen in the loop.

Has anyone read the paper “Parallel Prefix Sum(Scan) with CUDA”? In the “Avoiding Bank Conflicts” section, the author said there would be up to 16-way bank conflicts when scanning a 512-element array. I don’t understand this, are the 512 elements in a single block or in 16 blocks? I cannot see how we will possibly get 16-way bank conflicts in either way.

By saying this, you mean I must do both phases to get the sum? I thought I can just manually add the last thread’s value in each block at the host side to get the sum.

Thanks,

Yuping

If I have a shared array temp[8], does this mean there are 8 memory banks, and when two threads in the same block access temp[0] and temp[8], a bank conflict occurs?
I know the shared memory of each block is split to 16 banks, but what is the actual number of banks when running a specific program?

Don’t worry about bank conflicts before knowing that what you do is actually correct :)

for(int d = 8 >> 1; d > 0; d >>= 1)

{

__syncthreads();

if(Idx < d)

{

int ai = offset * (2 * Idx + 1) -1;

int bi = offset * (2 * Idx + 2) -1;

temp[bi] = temp[bi] + temp[ai];

}

offset = offset * 2;

}

In this snippet, for a block size of 8 “bi” will go up to 4 * (2 * Idx +1) - 1 in the last iteration of the loop. I.e. for thread 7 it will be 59. So your array is still too small when you give it a size of 16! And even if your array had the proper size of 60, I see you never again read these parts of the array. The code looks very wrong. I suggest you start again. Maybe first read up a bit on the scan / prefix sum primitive. One doesn’t usually use it just to get the sum of an array (a parallel reduction would suffice in that case) but because one needs all elements of the prefix sum.

The code is very wrong.the shared memory size is too small.if you just use it as temp var,do not use the threadIdx.x as the index.

Physically, there are 16 banks, and memory access are issued in batches of 16 for each half-warp, thus, no bank-conflict when16 consecutive threads access to 16 consecutive memory addresses in the shared memory. So if a block access temp[0] and temp[8] but in thread 0 and 8 of a half-warp then there would be no bank conflict, otherwise depending on the access pattern the number of accesses may vary.

You’re declaring temp’s size to 25 and your Idx can go up to 25, this means addressing temp[129]. How is that supposed to work?

Read the Programming Guide.

Thank you all for your replies!
As recommended, I used reduce6 from the reduction project in SDK, and it worked.
I will also try parallel scan for the other problem. Will let you know.

Thanks,
Yuping