How to reverse a list in CUDA

Here is some of my code:

__int64 i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < kSize)
{
      da_reverse[i] = d_a[kSize - 1 - i];
}
I tried to output da_reverse,and the result turns out not right: they are all zeros.
the following is my entire code, can you give me a hand?

__global__ void firls(__int64 freqSize, double* d_freq, __int64 amplitudeSize, double* d_amplitude, __int64 weightSize, double* d_weight, __int64 kSize, double* d_k, double* d_b, double* d_a, double* d_coefficients, double* da_reverse)
{
	__int64 i = threadIdx.x + blockDim.x * blockIdx.x;
	if (i < weightSize)     // weightSize = 2;
	{
		d_weight[i] = 1;
	}
	if (i < freqSize)   // freqSize = 4;
	{
		d_freq[i] = d_freq[i] * 0.5;
	}

	__int64 filterLength = (kSize - 1) * 2 + 1;
	__int64 Nodd = filterLength % 2;
	double b0 = 0;

	if (Nodd == 0)
	{
		if (i < kSize)
		{
			d_k[i] = i + 0.5;
			d_b[i] = 0;
		}
	}
	else
	{
		if (i < kSize)
		{
			d_k[i] = i;
			d_b[i] = 0;
		}
	}
	for (int j = 0; j < freqSize; j += 2)
	{
		double slope = (d_amplitude[j + 1] - d_amplitude[j]) / (d_freq[j + 1] - d_freq[j]);
		double b1 = d_amplitude[j] - slope * d_freq[j];
		if (Nodd == 1)
		{
			b0 += (b1 * (d_freq[j + 1] - d_freq[j])) +
				slope / 2.0 * (d_freq[j + 1] * d_freq[j + 1] - d_freq[j] * d_freq[j]) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);      //  wt = abs(sqrt(complex(weight)));
		}
		if (i < kSize)
		{
			d_b[i] += (slope / (4 * PI * PI) *
				(cos(2 * PI * d_k[i] * d_freq[j + 1]) - cos(2 * PI * d_k[i] * d_freq[j])) / (d_k[i] * d_k[i])) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);
			d_b[i] += (d_freq[j + 1] * (slope * d_freq[j + 1] + b1) * sinc(2 * d_k[i] * d_freq[j + 1]) -
				d_freq[j] * (slope * d_freq[j] + b1) * sinc(2 * d_k[i] * d_freq[j])) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);
		}
	}

	if (Nodd == 1)
	{
		d_b[0] = b0;
	}

	if (i < kSize)
	{
		d_a[i] = d_weight[0] * d_weight[0] * 4 * d_b[i];    // until here the result of d_a is still right
	}

	if (i < kSize)
	{
		da_reverse[i] = d_a[kSize - 1 - i];
		d_coefficients[i] = da_reverse[i];    // d_coefficients is the result i send back to host.
	}
}

here is the code in main function:

__int64 length = 2 * n * maxFactor + 1;

	dim3 ThreadPerBlock(512);
	dim3 GridSize((length + ThreadPerBlock.x - 1) / ThreadPerBlock.x );

	double firlsFreq = 1.0 / 2.0 / (double)maxFactor;  
	double firlsFreqs[] = { 0.0, 2.0 * firlsFreq, 2.0 * firlsFreq, 1.0 }; 
	double* firlsFreqsV = (double*)malloc(4 * sizeof(double));
	for (__int64 i = 0; i < 4; ++i)
		firlsFreqsV[i] = firlsFreqs[i];              

	double firlsAmplitude[] = { 1.0, 1.0, 0.0, 0.0 };
	double* firlsAmplitudeV = (double*)malloc(4 * sizeof(double));
	for (__int64 i = 0; i < 4; ++i)
		firlsAmplitudeV[i] = firlsAmplitude[i];

	double* coefficients = (double*)malloc(length * sizeof(double));


	double* d_weight;
	double* d_k;
	double* d_b;
	double* d_a;
	double* d_freq;
	double* d_amplitude;
	double* d_coefficients;
	double* da_reverse;

	__int64 freqSize = 4;
	__int64 amplitudeSize = 4;
	__int64 weightSize = 2;
	__int64 kSize = length / 2 + 1;


	cudaMalloc((void**)&d_weight, weightSize * sizeof(double));
	cudaMalloc((void**)&d_k, kSize * sizeof(double));
	cudaMalloc((void**)&d_b, kSize * sizeof(double));
	cudaMalloc((void**)&d_a, kSize * sizeof(double));
	cudaMalloc((void**)&d_freq, freqSize * sizeof(double));
	cudaMalloc((void**)&d_amplitude, amplitudeSize * sizeof(double));
	cudaMalloc((void**)&d_coefficients, length * sizeof(double));
	cudaMalloc((void**)&da_reverse, kSize * sizeof(double));

	cudaMemcpy(d_freq, firlsFreqsV, freqSize * sizeof(double), cudaMemcpyHostToDevice);
	cudaMemcpy(d_amplitude, firlsAmplitudeV, amplitudeSize * sizeof(double), cudaMemcpyHostToDevice);

	firls << <GridSize, ThreadPerBlock >> > (freqSize, d_freq, amplitudeSize, d_amplitude, weightSize, d_weight, kSize, d_k, d_b, d_a, d_coefficients,da_reverse);

	cudaMemcpy(coefficients, d_coefficients, length * sizeof(double), cudaMemcpyDeviceToHost);

	for (int i = 0; i < 1000; i++)
	{
		cout <<setprecision(20)<< coefficients[i] << "\t" <<  i << endl;
	}

it turns out the output is all zero.

if you just need the job done, consider switching to the thrust library, it offers vector-like containers for host and device data arrays and it provides a reverse() algorithm that is able to run fully parallelized.

https://thrust.github.io/doc/namespacethrust_aea54064035cc729c455783d47ade6c3d.html

1 Like

but if i want to set the “reverse part” in global function, is that ok using the thrust library?

I found out that it could only run on host, but it is still a good method, many thanks,hh

I found out that it could only run on host, but it is still a good method, many thanks,hh

when posting code on these forums, please format it correctly. A simple method to do that is as follows:

  • edit your post by clicking the pencil icon underneath it.
  • select the code
  • click the </> button at the top of the edit window
  • save your changes

Please do that now.

I also recommend using proper CUDA error checking (you can google for that) and also run your code with compute-sanitizer.

Hello Robert, thanks for your advice. I will take care of it from now on. And i want to ask why everytime i run my CUDA code, i would get a different answer which is sometimes right or sometimes wrong, i don’t know what is wrong with my code. Here is my code:

__global__ void firls(__int64 freqSize, double* d_freq, __int64 amplitudeSize, double* d_amplitude, __int64 weightSize, double* d_weight, __int64 kSize, double* d_k, double* d_b, double* d_a, double* d_coefficients)
{
	__int64 i = threadIdx.x + blockDim.x * blockIdx.x;
	if (i < weightSize)     // weightSize = 2;
	{
		d_weight[i] = 1;
	}
	if (i < freqSize)   // freqSize = 4;
	{
		d_freq[i] = d_freq[i] * 0.5;
	}

	__int64 filterLength = (kSize - 1) * 2 + 1;
	__int64 Nodd = filterLength % 2;
	double b0 = 0;

	if (Nodd == 0)
	{
		if (i < kSize)
		{
			d_k[i] = i + 0.5;
			d_b[i] = 0;
		}
	}
	else
	{
		if (i < kSize)
		{
			d_k[i] = i + 1;
			d_b[i] = 0;
		}
		d_k[kSize - 1] = kSize - 1;
	}
	for (__int64 j = 0; j < freqSize; j += 2)
	{
		double slope = (d_amplitude[j + 1] - d_amplitude[j]) / (d_freq[j + 1] - d_freq[j]);
		double b1 = d_amplitude[j] - slope * d_freq[j];
		if (Nodd == 1)
		{
			b0 =b0 + (b1 * (d_freq[j + 1] - d_freq[j])) +
				slope / 2.0 * (d_freq[j + 1] * d_freq[j + 1] - d_freq[j] * d_freq[j]) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);      //  wt = abs(sqrt(complex(weight)));
		}
		if (i < kSize)
		{
			d_b[i] = d_b[i] + (slope / (4 * PI * PI) *
				(cos(2 * PI * d_k[i] * d_freq[j + 1]) - cos(2 * PI * d_k[i] * d_freq[j])) / (d_k[i] * d_k[i])) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);
			d_b[i] = d_b[i] + (d_freq[j + 1] * (slope * d_freq[j + 1] + b1) * sinc(2 * d_k[i] * d_freq[j + 1]) -
				d_freq[j] * (slope * d_freq[j] + b1) * sinc(2 * d_k[i] * d_freq[j])) *
				abs(d_weight[(j + 1) / 2] * d_weight[(j + 1) / 2]);
		}
	}


	if (Nodd == 1)
	{
		if (i > 0 && i < kSize)
		{
			d_b[i] = d_b[i - 1];
		}
		d_b[0] = b0;
	}
	
	if (i < kSize)
	{
		d_a[i] = d_weight[0] * d_weight[0] * 4 * d_b[i];
	}
}

i will appreciate it if you can help me out !

I think the problem is about rules of CUDA instead of the logic of my code, like the synchronization of threads or the index of threads, but i dont know how to fix it.

Sorry,Robert, i may not get your formal idea. I am not a native speaker, so i may misunderstand what you say. Actually i rewrote the topic i presented(change the format of the code as you said), and i truly didn’t know how i offended you. So once again i send my apology, hope you can recieve it. If my format is still not right, can you give me an example link to learn how to propose a problem on the forum? And if there is still something i did annoyed you, you can tell me directly. I will change it right away. And I hope this problem would not influence our interactions later. Wish you a nice day.

As for the problem, i am now studying the online course, and would try to resolve it myself.

if (i > 0 && i < kSize)
{
	d_b[i] = d_b[i - 1];
}

You need to make sure that thread i-1 stored its results to d_b before you can access them in thread i.

1 Like

I used “__syncthreads();” to fix this problem above this part of code, you are right, thank you bro!

__syncthreads() is not correct in this context. Syncthreads only synchronizes threads within a block, but the first first thread of block N needs to wait for the last thread of block N-1.

yes,you are right. I forget this thing. But it’s weird that i get right results. And if i want to do it right, can you give an example how i fix this problem ?

yes,you are right. I forget this thing. But it’s weird that i get right results. And if i want to do it right, can you give an example how i fix this problem ?

I’m not offended, this happens quite frequently on these forums. I’m used to it, there is no need to apologize. Or if you prefer, apology accepted.

People are entitled to do more or less as they wish when trying to fix their code, including ignore me. But if someone is going to ignore me, I won’t be able to help them. That seems self-evident, but its OK if you disagree; its my view. So when I see that happening, I stop responding. The only debatable topic is whether I should mention anything at that point. I’m of two minds about it, but I think the smarter move is just to be silent. Which is what I eventually tried to do, but not very well.

I am really happy that you are not annoyed. Workers like you are truly helpful, and needed to be admired. Anyway, if i did something wrong , you can tell me directly, and I will correct right away. I really want to learn CUDA well, which may need your official support sometimes. So thank you again, Robert.
And I wonder if there are updated videos about advanced courses of CUDA, or other tutorials I can learn CUDA more deeply. Can you recommend it to me?

The CUDA training course I typically recommend is this one.

1 Like

Split your kernel into two parts. Do this in a second kernel that is launched after the first kernel in the same stream. Since the second kernel will not be executed before the first one finished, you can be sure that all values of d_b are set.

	if (Nodd == 1)
	{
		if (i > 0 && i < kSize)
		{
			d_b[i] = d_b[i - 1];
		}
		d_b[0] = b0;
	}
	
	if (i < kSize)
	{
		d_a[i] = d_weight[0] * d_weight[0] * 4 * d_b[i];
	}
1 Like

OK, I will have a try, many thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.