Naive prefix sum algorithm from GPU gems not working

reikdas · February 2, 2020, 4:43pm

I am referring to the sample code given in Example 39-1 here - Chapter 39. Parallel Prefix Sum (Scan) with CUDA which I have copied over.

This is my sample code to test the given -

#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include <iostream>

__global__
void scan(float *g_odata, float *g_idata, int n) {
	extern __shared__ float temp[]; // allocated on invocation
	int thid = threadIdx.x;
	int pout = 0, pin = 1;
	// load input into shared memory.
	// Exclusive scan: shift right by one and set first element to 0
	temp[pout * n + thid] = (thid > 0) ? g_idata[thid - 1] : 0;
	__syncthreads();
	for (int offset = 1; offset < n; offset *= 2)
	{
		pout = 1 - pout; // swap double buffer indices
		pin = 1 - pout;
		if (thid >= offset)
			temp[pout * n + thid] += temp[pin * n + thid - offset];
		else
			temp[pout * n + thid] = temp[pin * n + thid];
		__syncthreads();
	}
	g_odata[thid] = temp[pout * n + thid]; // write output
}

int main() {
	float output[6];
	float input[] = { 1, 2, 3, 4, 5 };
	float* d_output, * d_input;
	cudaMalloc((void**)&d_output, 6 * sizeof(float));
	cudaMalloc((void**)&d_input, 5 * sizeof(float));
	cudaMemcpy(d_input, input, 5 * sizeof(float), cudaMemcpyHostToDevice);
	scan << <1, 6 >> > (d_output, d_input, 5);
	cudaMemcpy(output, d_output, 6 * sizeof(float), cudaMemcpyDeviceToHost);
	cudaFree(d_output);
	cudaFree(d_input);
	for (int i = 0; i < 6; i++) {
		std::cout << output[i] << "\n";
	}
	return 0;
}

But this does not work. The answer I get is -

-2.15633e-10
4.59037e-41
4.34403e-44
0
1.4013e-45
0

What is going wrong?

Robert_Crovella · February 2, 2020, 7:10pm

There may be several problems. However one is that you have provided no shared allocation in your kernel launch.

Before asking for help here, my suggestion is to use proper CUDA error checking and run your code with cuda-memcheck. If you had done so, cuda-memcheck would have reported an error that would have at least focused your attention on shared memory access. Even if you don’t understand the error output, it is useful for others who may try to help you.

This may also be of interest:

https://stackoverflow.com/questions/30832033/is-prefix-scan-cuda-sample-code-in-gpugems3-correct

Topic		Replies	Views
Bugged code in website CUDA Programming and Performance	6	1552	October 8, 2015
prefix_sum, can not syncthreads CUDA Programming and Performance	1	430	February 22, 2017
Broken link - GPUGems3 Prefix sum CUDA Programming and Performance	5	1486	December 31, 2020
CUDA parallel prefix CUDA Programming and Performance	0	3052	July 11, 2007
CUDA - calculation of a sum CUDA Programming and Performance	7	5530	April 30, 2010
Parallel sum, -arch sm_21, NVRAM xid CUDA Programming and Performance	0	8866	February 16, 2011
Learning by coding recursive sum using dynamic parallelism CUDA Programming and Performance	2	725	January 17, 2018
Parallel sum, arch sm_21, xid CUDA Programming and Performance	0	4669	February 16, 2011
Simple Inefficient Parallel Addition CUDA Programming and Performance	5	3158	April 10, 2009
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1516	November 8, 2023

Naive prefix sum algorithm from GPU gems not working

Related topics