Is it possible to do infinite loops in CUDA 5?

Hello everyone,

I’m using CUDA 5 in C++ with a GTX 950M on Windows 10.

I tried to create an infinite loop using the tricky “volatile” keyword but it seems it does not work at all.

Here is the code to reproduce the thing:

global void wait()
{
volatile int infini = 1;
while(infini);
}

int main(void)
{
wait<<<4, 2>>>();
}

The execution just stopped instantly.

Why doesn’t it work? I don’t understand. Is this trick not working on CUDA 5?

Just to be clear, it’s only testing purpose.

Thanks a lot,

ChoCChoK.

As part of its optimizations, the compiler performs dead code elimination. Code that does not (at least potentially) modify global state has no user-visible effect at kernel termination time and is considered “dead”. To prevent that from happening, have the code compute some piece of data that is ultimately written out to global memory.

1 Like

Thanks a lot for the answer! I tried this:

global void wait(double* val)
{
volatile int infini = 1;
while(infini)
{
(*val) = 1;
}
}

int main(void)
{
double *val, d_val;
val = (double
)malloc(sizeof(double));
val[0] = 0;
cudaMalloc(&d_val, sizeof(double));
cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);
wait<<<4, 2>>>(d_val);
}

But it does not work either… What kind of operations (if possible, the least memory and speed consuming) in global memory would work in my case?

Thanks again,

ChoCChoK.

I even tried something like that:

__global__ void wait(double* val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	volatile int infini = 1;
	while(infini)
	{
		(*val) = sqrt((double)i + clock());
	}
}

And it does not work either…

ChoCChoK.

Another optimization: When a loop body is independent of the loop control, the entire loop can be replaced by zero or one instances of the loop body. The use of clock does not change that, because there is no dependency of clock on infini.

One way to retain the loop may be to accumulate a sum inside the loop, and write back the sum to global memory after the loop. Not sure what conclusions the compiler is allowed to draw from a readily recognized infinite loop. There may be logic that says: Since the loop is infinite, it will never write back the sum to global memory, therefore it will never change global state, therefore the same effect is achieved by doing nothing. If that hypothesis holds, you may need to hide the fact that the loop is infinite.

What are you ultimately trying to accomplish by constructing an infinite loop? Would an extremely long-running loop be sufficient?

1 Like

Thank you very much for your answer.

In fact, I’m creating a sort of “traffic policeman” class that handles accesses to memory, because I have a huge array that I can’t duplicate in the memory (300MB) and each thread (excepted the 0) calculates some values to put in the big array. Thus, the thread 0 is the “traffic policeman” handling function, that is constitued by a almost-infinite loop that gives priority to one thread or another based on their asking. Another almost-infinite loop is found in each thread, when the thread seeks to add values to the big array, waiting for the “traffic policeman” class to give him the priority.

I’m not sure I’m clear, please tell me if any further information would help.

Thanks again,

ChoCChoK.

please check out this thread and sample by @Robert_Crovella.

the idea is pretty similar, using a pinned memory to communicate between GPU and host, the difference is the above example we write in device, and read in host; in your case, you want your host to write and read form device. I suppose you should use cudaHostAllocWriteCombined | cudaHostAllocMapped when calling cudaHostAlloc

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/online/group__CUDART__MEMORY_g217d441a73d9304c6f0ccc22ec307dba.html

1 Like

Thanks FangQ.

In fact, all my threads are on device side… I don’t handle this on host side. So it’s not necessary, does it?

ChoCChoK.

I thought you were trying to create an “indefinitely repeated loop”, until the host sends you a trigger to stop and return.

so you just want to let the kernel to loop to continue “infinitely” without a stopping criterion? I don’t really see a use-case for this, can you confirm if this is what you wanted?

No, you’re right: it will be stopped, but not by the host, but by one of the GPU thread (let’s say the thread 0) which would execute a “handle” function of the policeman class. Would a practical sample of code help you understand? Please tell me if it is the case.

Thanks again,

ChoCChoK.

Here is a complete example of “policeman” struct that does not work to me:

#include <stdio.h>
#include <iostream>
#include <iomanip>

#define NTH 2 // Number of threads
#define N   1<<5 // Number of total calculations

using namespace std;

struct policeman
{
	bool deja;
	int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
	bool ends[N]; // Is equal to "true" if the calculation has ended
	int i;

	__host__ __device__ void init() // Initialization
	{
		deja = false;
		i = 0;
		for(int i = 1; i < N; i++)
			inThere[i] = 0;
		for(int i = 0; i < N; i++)
			ends[i] = false;
	}

	__host__ __device__ void end(int const& n) // Called when a thread has finished
	{
		ends[n] = true;
	}

	__host__ __device__ int const& getIT(int const& n) // Called to retrieve the value of inThere
	{
		return inThere[n];
	}

	__host__ __device__ void canI(int const& n) // Called for entering the queue
	{
		inThere[n] = 1;
	}

	__host__ __device__ void done(int const& n) // Called when calculation is done
	{
		inThere[n] = 0;
	}

	__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
	{
		volatile int infinity = 1;
		while(infinity)
		{
			ends[0] = true;
			for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
				if(!ends[i])
				{
					i = N;
					infinity = 0;
				}
			deja = false;
			for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
				if(inThere[i] == 2)
				{
					i = N;
					deja = true;
				}
			if(!deja) // If noone is accessing the memory
				for(i = 1; i < N; i++)
					if(inThere[i] == 1)
					{
						i = N;
						inThere[i] = 2; // Give the access to the memory to the first thread encountered
					}
		}
	}
};

__global__
void saxpy(int n, policeman *x, double *val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i == 0) // If it's the thread 0 of the block 0, handle
	{
		x->handle();
	}
	else if (i < n && threadIdx.x != 0) // Else...
	{
		x->canI(i); // Ask for permission...
		volatile int infinity = 1;
		while(infinity) // ... until he gets it, and then...
			if(x->getIT(i) == 2)
				infinity = 0;
		(*val) += 1; // ... do the calculation and...
		x->done(i); // ... tell it!
	}
	if (i != 0 && i < n)
		x->end(i);
}

int main(void)
{
	policeman *x, *d_x;
	double *val, *d_val;

	clock_t start, end;

	val = (double*)malloc(sizeof(double));
	val[0] = 0;

	x = (policeman*)malloc(sizeof(policeman));
	x->init();
  
    start = clock();

	cudaMalloc((void**)&d_x, sizeof(policeman));
	cudaMalloc(&d_val, sizeof(double));

	cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
	cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);

	saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);

	cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
	cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);

    end = clock();
  
    double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
    cout << "Time taken by program is : " << fixed  
         << time_taken << setprecision(5);
    cout << " sec : " << val[0] << " " << x->i << endl;
	
	cudaFree(d_x);
	cudaFree(d_val);
	free(x);
	free(val);
}

For a dark reason that I don’t understand, the infinite loop from the “handle” function seems to work, but not the other one directly in the kernel (but I might be wrong).

Thanks a lot again,

ChoCChoK.

Hello everyone.

To put this simpler and shorter to read, I re-wrote a little the code:

#include <stdio.h>
#include <iostream>
#include <iomanip>

#define NTH 2 // Number of threads
#define N   1<<5 // Number of total calculations

using namespace std;

struct policeman
{
	bool deja;
	int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
	bool ends[N]; // Is equal to "true" if the calculation has ended
	int i;

	__host__ __device__ void init() // Initialization
	{
		deja = false;
		i = 0;
		for(int i = 1; i < N; i++)
			inThere[i] = 0;
		for(int i = 0; i < N; i++)
			ends[i] = false;
	}

	__host__ __device__ void end(int const& n) // Called when a thread has finished
	{
		ends[n] = true;
	}

	__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
	{
		volatile int infinity = 1;
		while(infinity)
		{
			ends[0] = true;
			for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
				if(!ends[i])
				{
					i = N;
					infinity = 0;
				}
			deja = false;
			for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
				if(inThere[i] == 2)
				{
					i = N;
					deja = true;
				}
			if(!deja) // If noone is accessing the memory
				for(i = 1; i < N; i++)
					if(inThere[i] == 1)
					{
						i = N;
						inThere[i] = 2; // Give the access to the memory to the first thread encountered
					}
		}
	}
};

__global__
void saxpy(int n, policeman *x, double *val)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if(i == 0) // If it's the thread 0 of the block 0, handle
	{
		x->handle();
	}
	else if (i < n && threadIdx.x != 0) // Else...
	{
		x->inThere[i] = 1; // Ask for permission...
		volatile int infinity = 1;
		while(infinity) // ... until he gets it, and then...
			if(x->inThere[i] == 2)
				infinity = 0;
		(*val) += 1; // ... do the calculation and...
		x->inThere[i] = 0; // ... tell it!
	}
	if (i != 0 && i < n)
		x->end(i);
}

int main(void)
{
	policeman *x, *d_x;
	double *val, *d_val;

	clock_t start, end;

	val = (double*)malloc(sizeof(double));
	val[0] = 0;

	x = (policeman*)malloc(sizeof(policeman));
	x->init();
  
    start = clock();

	cudaMalloc((void**)&d_x, sizeof(policeman));
	cudaMalloc(&d_val, sizeof(double));

	cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
	cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);

	saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);

	cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
	cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);

    end = clock();
  
    double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
    cout << "Time taken by program is : " << fixed  
         << time_taken << setprecision(5);
    cout << " sec : " << val[0] << " " << x->i << endl;
	
	cudaFree(d_x);
	cudaFree(d_val);
	free(x);
	free(val);
}

Please, feel free to test the above code (it’s pretty short) and to tell me if you ever have an idea to make it work…

Thanks a lot !

ChoCChoK.

Do I try to do something unachievable? Any other way of handling memory addresses would be welcomed too…

Thanks again,

ChoCChoK.

Up, please.

ChoCChoK.