As part of its optimizations, the compiler performs dead code elimination. Code that does not (at least potentially) modify global state has no user-visible effect at kernel termination time and is considered “dead”. To prevent that from happening, have the code compute some piece of data that is ultimately written out to global memory.
Another optimization: When a loop body is independent of the loop control, the entire loop can be replaced by zero or one instances of the loop body. The use of clock does not change that, because there is no dependency of clock on infini.
One way to retain the loop may be to accumulate a sum inside the loop, and write back the sum to global memory after the loop. Not sure what conclusions the compiler is allowed to draw from a readily recognized infinite loop. There may be logic that says: Since the loop is infinite, it will never write back the sum to global memory, therefore it will never change global state, therefore the same effect is achieved by doing nothing. If that hypothesis holds, you may need to hide the fact that the loop is infinite.
What are you ultimately trying to accomplish by constructing an infinite loop? Would an extremely long-running loop be sufficient?
In fact, I’m creating a sort of “traffic policeman” class that handles accesses to memory, because I have a huge array that I can’t duplicate in the memory (300MB) and each thread (excepted the 0) calculates some values to put in the big array. Thus, the thread 0 is the “traffic policeman” handling function, that is constitued by a almost-infinite loop that gives priority to one thread or another based on their asking. Another almost-infinite loop is found in each thread, when the thread seeks to add values to the big array, waiting for the “traffic policeman” class to give him the priority.
I’m not sure I’m clear, please tell me if any further information would help.
the idea is pretty similar, using a pinned memory to communicate between GPU and host, the difference is the above example we write in device, and read in host; in your case, you want your host to write and read form device. I suppose you should use cudaHostAllocWriteCombined | cudaHostAllocMapped when calling cudaHostAlloc
I thought you were trying to create an “indefinitely repeated loop”, until the host sends you a trigger to stop and return.
so you just want to let the kernel to loop to continue “infinitely” without a stopping criterion? I don’t really see a use-case for this, can you confirm if this is what you wanted?
No, you’re right: it will be stopped, but not by the host, but by one of the GPU thread (let’s say the thread 0) which would execute a “handle” function of the policeman class. Would a practical sample of code help you understand? Please tell me if it is the case.
Here is a complete example of “policeman” struct that does not work to me:
#include <stdio.h>
#include <iostream>
#include <iomanip>
#define NTH 2 // Number of threads
#define N 1<<5 // Number of total calculations
using namespace std;
struct policeman
{
bool deja;
int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
bool ends[N]; // Is equal to "true" if the calculation has ended
int i;
__host__ __device__ void init() // Initialization
{
deja = false;
i = 0;
for(int i = 1; i < N; i++)
inThere[i] = 0;
for(int i = 0; i < N; i++)
ends[i] = false;
}
__host__ __device__ void end(int const& n) // Called when a thread has finished
{
ends[n] = true;
}
__host__ __device__ int const& getIT(int const& n) // Called to retrieve the value of inThere
{
return inThere[n];
}
__host__ __device__ void canI(int const& n) // Called for entering the queue
{
inThere[n] = 1;
}
__host__ __device__ void done(int const& n) // Called when calculation is done
{
inThere[n] = 0;
}
__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
{
volatile int infinity = 1;
while(infinity)
{
ends[0] = true;
for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
if(!ends[i])
{
i = N;
infinity = 0;
}
deja = false;
for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
if(inThere[i] == 2)
{
i = N;
deja = true;
}
if(!deja) // If noone is accessing the memory
for(i = 1; i < N; i++)
if(inThere[i] == 1)
{
i = N;
inThere[i] = 2; // Give the access to the memory to the first thread encountered
}
}
}
};
__global__
void saxpy(int n, policeman *x, double *val)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i == 0) // If it's the thread 0 of the block 0, handle
{
x->handle();
}
else if (i < n && threadIdx.x != 0) // Else...
{
x->canI(i); // Ask for permission...
volatile int infinity = 1;
while(infinity) // ... until he gets it, and then...
if(x->getIT(i) == 2)
infinity = 0;
(*val) += 1; // ... do the calculation and...
x->done(i); // ... tell it!
}
if (i != 0 && i < n)
x->end(i);
}
int main(void)
{
policeman *x, *d_x;
double *val, *d_val;
clock_t start, end;
val = (double*)malloc(sizeof(double));
val[0] = 0;
x = (policeman*)malloc(sizeof(policeman));
x->init();
start = clock();
cudaMalloc((void**)&d_x, sizeof(policeman));
cudaMalloc(&d_val, sizeof(double));
cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);
saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);
cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);
end = clock();
double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
cout << "Time taken by program is : " << fixed
<< time_taken << setprecision(5);
cout << " sec : " << val[0] << " " << x->i << endl;
cudaFree(d_x);
cudaFree(d_val);
free(x);
free(val);
}
For a dark reason that I don’t understand, the infinite loop from the “handle” function seems to work, but not the other one directly in the kernel (but I might be wrong).
To put this simpler and shorter to read, I re-wrote a little the code:
#include <stdio.h>
#include <iostream>
#include <iomanip>
#define NTH 2 // Number of threads
#define N 1<<5 // Number of total calculations
using namespace std;
struct policeman
{
bool deja;
int inThere[N]; // Is equal to 0 if the calculation (i) is not waiting in queue, 1 if it's waiting and 2 if it's his turn
bool ends[N]; // Is equal to "true" if the calculation has ended
int i;
__host__ __device__ void init() // Initialization
{
deja = false;
i = 0;
for(int i = 1; i < N; i++)
inThere[i] = 0;
for(int i = 0; i < N; i++)
ends[i] = false;
}
__host__ __device__ void end(int const& n) // Called when a thread has finished
{
ends[n] = true;
}
__host__ __device__ void handle() // Called by the thread 0, to handle accesses to memory
{
volatile int infinity = 1;
while(infinity)
{
ends[0] = true;
for(i = 1; i < N; i++) // If all calculations are done, finish this thread too
if(!ends[i])
{
i = N;
infinity = 0;
}
deja = false;
for(i = 1; i < N; i++) // If someone already access to the memory, "deja" will be true
if(inThere[i] == 2)
{
i = N;
deja = true;
}
if(!deja) // If noone is accessing the memory
for(i = 1; i < N; i++)
if(inThere[i] == 1)
{
i = N;
inThere[i] = 2; // Give the access to the memory to the first thread encountered
}
}
}
};
__global__
void saxpy(int n, policeman *x, double *val)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i == 0) // If it's the thread 0 of the block 0, handle
{
x->handle();
}
else if (i < n && threadIdx.x != 0) // Else...
{
x->inThere[i] = 1; // Ask for permission...
volatile int infinity = 1;
while(infinity) // ... until he gets it, and then...
if(x->inThere[i] == 2)
infinity = 0;
(*val) += 1; // ... do the calculation and...
x->inThere[i] = 0; // ... tell it!
}
if (i != 0 && i < n)
x->end(i);
}
int main(void)
{
policeman *x, *d_x;
double *val, *d_val;
clock_t start, end;
val = (double*)malloc(sizeof(double));
val[0] = 0;
x = (policeman*)malloc(sizeof(policeman));
x->init();
start = clock();
cudaMalloc((void**)&d_x, sizeof(policeman));
cudaMalloc(&d_val, sizeof(double));
cudaMemcpy(d_x, x, sizeof(policeman), cudaMemcpyHostToDevice);
cudaMemcpy(d_val, val, sizeof(double), cudaMemcpyHostToDevice);
saxpy<<<(N+NTH-1)/NTH, NTH>>>(N, d_x, d_val);
cudaMemcpy(x, d_x, sizeof(policeman), cudaMemcpyDeviceToHost);
cudaMemcpy(val, d_val, sizeof(double), cudaMemcpyDeviceToHost);
end = clock();
double time_taken = double(end - start) / double(CLOCKS_PER_SEC);
cout << "Time taken by program is : " << fixed
<< time_taken << setprecision(5);
cout << " sec : " << val[0] << " " << x->i << endl;
cudaFree(d_x);
cudaFree(d_val);
free(x);
free(val);
}
Please, feel free to test the above code (it’s pretty short) and to tell me if you ever have an idea to make it work…