Is it possible to do infinite loops in CUDA 5?

ChoCChoK · August 21, 2020, 7:53am

Hello everyone,

I’m using CUDA 5 in C++ with a GTX 950M on Windows 10.

I tried to create an infinite loop using the tricky “volatile” keyword but it seems it does not work at all.

Here is the code to reproduce the thing:

global void wait()
{
volatile int infini = 1;
while(infini);
}

int main(void)
{
wait<<<4, 2>>>();
}

The execution just stopped instantly.

Why doesn’t it work? I don’t understand. Is this trick not working on CUDA 5?

Just to be clear, it’s only testing purpose.

Thanks a lot,

ChoCChoK.

njuffa · August 21, 2020, 10:22am

As part of its optimizations, the compiler performs dead code elimination. Code that does not (at least potentially) modify global state has no user-visible effect at kernel termination time and is considered “dead”. To prevent that from happening, have the code compute some piece of data that is ultimately written out to global memory.

ChoCChoK · August 21, 2020, 10:32am

Thanks a lot for the answer! I tried this:

global void wait(double* val)
{
volatile int infini = 1;
while(infini)
{
(*val) = 1;
}
}

int main(void)
{
double *val, d_val;
val = (double)malloc(sizeof(double));
val[0] = 0;
cudaMalloc(&d_val, sizeof(double));
cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);
wait<<<4, 2>>>(d_val);
}

But it does not work either… What kind of operations (if possible, the least memory and speed consuming) in global memory would work in my case?

Thanks again,

ChoCChoK.

ChoCChoK · August 21, 2020, 10:43am

I even tried something like that:

__global__ void wait(double* val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	volatile int infini = 1;
	while(infini)
	{
		(*val) = sqrt((double)i + clock());
	}
}

And it does not work either…

ChoCChoK.

njuffa · August 21, 2020, 11:23am

Another optimization: When a loop body is independent of the loop control, the entire loop can be replaced by zero or one instances of the loop body. The use of clock does not change that, because there is no dependency of clock on infini.

One way to retain the loop may be to accumulate a sum inside the loop, and write back the sum to global memory after the loop. Not sure what conclusions the compiler is allowed to draw from a readily recognized infinite loop. There may be logic that says: Since the loop is infinite, it will never write back the sum to global memory, therefore it will never change global state, therefore the same effect is achieved by doing nothing. If that hypothesis holds, you may need to hide the fact that the loop is infinite.

What are you ultimately trying to accomplish by constructing an infinite loop? Would an extremely long-running loop be sufficient?

ChoCChoK · August 21, 2020, 12:27pm

Thank you very much for your answer.

In fact, I’m creating a sort of “traffic policeman” class that handles accesses to memory, because I have a huge array that I can’t duplicate in the memory (300MB) and each thread (excepted the 0) calculates some values to put in the big array. Thus, the thread 0 is the “traffic policeman” handling function, that is constitued by a almost-infinite loop that gives priority to one thread or another based on their asking. Another almost-infinite loop is found in each thread, when the thread seeks to add values to the big array, waiting for the “traffic policeman” class to give him the priority.

I’m not sure I’m clear, please tell me if any further information would help.

Thanks again,

ChoCChoK.

FangQ · August 21, 2020, 3:51pm

please check out this thread and sample by @Robert_Crovella.

the idea is pretty similar, using a pinned memory to communicate between GPU and host, the difference is the above example we write in device, and read in host; in your case, you want your host to write and read form device. I suppose you should use cudaHostAllocWriteCombined | cudaHostAllocMapped when calling cudaHostAlloc

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html

ChoCChoK · August 21, 2020, 4:08pm

Thanks FangQ.

In fact, all my threads are on device side… I don’t handle this on host side. So it’s not necessary, does it?

ChoCChoK.

FangQ · August 21, 2020, 5:15pm

I thought you were trying to create an “indefinitely repeated loop”, until the host sends you a trigger to stop and return.

so you just want to let the kernel to loop to continue “infinitely” without a stopping criterion? I don’t really see a use-case for this, can you confirm if this is what you wanted?

ChoCChoK · August 21, 2020, 5:19pm

No, you’re right: it will be stopped, but not by the host, but by one of the GPU thread (let’s say the thread 0) which would execute a “handle” function of the policeman class. Would a practical sample of code help you understand? Please tell me if it is the case.

Thanks again,

ChoCChoK.

ChoCChoK · August 21, 2020, 5:37pm

Here is a complete example of “policeman” struct that does not work to me:

#include <stdio.h>
#include <iostream>
#include <iomanip>

#define NTH 2 // Number of threads
#define N   1<<5 // Number of total calculations

using namespace std;

struct policeman
{
	bool deja;
	int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
	bool ends[N]; // Is equal to "true" if the calculation has ended
	int i;

	__host__ __device__ void init() // Initialization
	{
		deja = false;
		i = 0;
		for(int i = 1; i < N; i++)
			inThere[i] = 0;
		for(int i = 0; i < N; i++)
			ends[i] = false;
	}

	__host__ __device__ void end(int const& n) // Called when a thread has finished
	{
		ends[n] = true;
	}

	__host__ __device__ int const& getIT(int const& n) // Called to retrieve the value of inThere
	{
		return inThere[n];
	}

	__host__ __device__ void canI(int const& n) // Called for entering the queue
	{
		inThere[n] = 1;
	}

	__host__ __device__ void done(int const& n) // Called when calculation is done
	{
		inThere[n] = 0;
	}

	__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
	{
		volatile int infinity = 1;
		while(infinity)
		{
			ends[0] = true;
			for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
				if(!ends[i])
				{
					i = N;
					infinity = 0;
				}
			deja = false;
			for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
				if(inThere[i] == 2)
				{
					i = N;
					deja = true;
				}
			if(!deja) // If noone is accessing the memory
				for(i = 1; i < N; i++)
					if(inThere[i] == 1)
					{
						i = N;
						inThere[i] = 2; // Give the access to the memory to the first thread encountered
					}
		}
	}
};

__global__
void saxpy(int n, policeman *x, double *val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i == 0) // If it's the thread 0 of the block 0, handle
	{
		x->handle();
	}
	else if (i < n && threadIdx.x != 0) // Else...
	{
		x->canI(i); // Ask for permission...
		volatile int infinity = 1;
		while(infinity) // ... until he gets it, and then...
			if(x->getIT(i) == 2)
				infinity = 0;
		(*val) += 1; // ... do the calculation and...
		x->done(i); // ... tell it!
	}
	if (i != 0 && i < n)
		x->end(i);
}

int main(void)
{
	policeman *x, *d_x;
	double *val, *d_val;

	clock_t start, end;

	val = (double*)malloc(sizeof(double));
	val[0] = 0;

	x = (policeman*)malloc(sizeof(policeman));
	x->init();
  
    start = clock();

	cudaMalloc((void**)&d_x, sizeof(policeman));
	cudaMalloc(&d_val, sizeof(double));

	cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
	cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);

	saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);

	cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
	cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);

    end = clock();
  
    double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
    cout << "Time taken by program is : " << fixed  
         << time_taken << setprecision(5);
    cout << " sec : " << val[0] << " " << x->i << endl;
	
	cudaFree(d_x);
	cudaFree(d_val);
	free(x);
	free(val);
}

For a dark reason that I don’t understand, the infinite loop from the “handle” function seems to work, but not the other one directly in the kernel (but I might be wrong).

Thanks a lot again,

ChoCChoK.

ChoCChoK · August 23, 2020, 1:22pm

Hello everyone.

To put this simpler and shorter to read, I re-wrote a little the code:

#include <stdio.h>
#include <iostream>
#include <iomanip>

#define NTH 2 // Number of threads
#define N   1<<5 // Number of total calculations

using namespace std;

struct policeman
{
	bool deja;
	int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
	bool ends[N]; // Is equal to "true" if the calculation has ended
	int i;

	__host__ __device__ void init() // Initialization
	{
		deja = false;
		i = 0;
		for(int i = 1; i < N; i++)
			inThere[i] = 0;
		for(int i = 0; i < N; i++)
			ends[i] = false;
	}

	__host__ __device__ void end(int const& n) // Called when a thread has finished
	{
		ends[n] = true;
	}

	__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
	{
		volatile int infinity = 1;
		while(infinity)
		{
			ends[0] = true;
			for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
				if(!ends[i])
				{
					i = N;
					infinity = 0;
				}
			deja = false;
			for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
				if(inThere[i] == 2)
				{
					i = N;
					deja = true;
				}
			if(!deja) // If noone is accessing the memory
				for(i = 1; i < N; i++)
					if(inThere[i] == 1)
					{
						i = N;
						inThere[i] = 2; // Give the access to the memory to the first thread encountered
					}
		}
	}
};

__global__
void saxpy(int n, policeman *x, double *val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i == 0) // If it's the thread 0 of the block 0, handle
	{
		x->handle();
	}
	else if (i < n && threadIdx.x != 0) // Else...
	{
		x->inThere[i] = 1; // Ask for permission...
		volatile int infinity = 1;
		while(infinity) // ... until he gets it, and then...
			if(x->inThere[i] == 2)
				infinity = 0;
		(*val) += 1; // ... do the calculation and...
		x->inThere[i] = 0; // ... tell it!
	}
	if (i != 0 && i < n)
		x->end(i);
}

int main(void)
{
	policeman *x, *d_x;
	double *val, *d_val;

	clock_t start, end;

	val = (double*)malloc(sizeof(double));
	val[0] = 0;

	x = (policeman*)malloc(sizeof(policeman));
	x->init();
  
    start = clock();

	cudaMalloc((void**)&d_x, sizeof(policeman));
	cudaMalloc(&d_val, sizeof(double));

	cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
	cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);

	saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);

	cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
	cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);

    end = clock();
  
    double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
    cout << "Time taken by program is : " << fixed  
         << time_taken << setprecision(5);
    cout << " sec : " << val[0] << " " << x->i << endl;
	
	cudaFree(d_x);
	cudaFree(d_val);
	free(x);
	free(val);
}

Please, feel free to test the above code (it’s pretty short) and to tell me if you ever have an idea to make it work…

Thanks a lot !

ChoCChoK.

ChoCChoK · August 30, 2020, 1:25pm

Do I try to do something unachievable? Any other way of handling memory addresses would be welcomed too…

Thanks again,

ChoCChoK.

ChoCChoK · September 8, 2020, 12:46pm

Up, please.

ChoCChoK.

Topic		Replies	Views
Infinite loop in CUDA kernel CUDA Programming and Performance	11	15963	October 25, 2010
this code resets my computer CUDA Programming and Performance	21	7193	March 29, 2008
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	680	April 4, 2017
Correct usage of ldcg and stcg for inter-block communication CUDA Programming and Performance cuda , kernel , linux	9	1323	February 7, 2023
Lock causes deadloop bug report CUDA Programming and Performance	6	3852	September 27, 2007
Cuda program taking more time. CUDA Programming and Performance	15	7059	November 21, 2010
Cuda code performance CUDA Programming and Performance	14	3153	December 16, 2014
CUDA limit for loops..? too large number of iterations? CUDA Programming and Performance	28	27378	March 20, 2008
Problem with performance running parallel blocks CUDA Programming and Performance	3	582	March 26, 2018
[Beginner]: CUDA slower than serial implementation fill Operation on entire image CUDA Programming and Performance	18	13519	September 15, 2011

Is it possible to do infinite loops in CUDA 5?

Related topics