How to put a thread to sleep and let other threads wake it up?

ymc · June 1, 2012, 10:43am

Dear all, after days of debugging, I finally wrote my first real CUDA program. I achieved a 40x speed relative to my hexacore x79 implementation. External Image But I find that my 470’s VRAM of 1.25GB is not enough for my application. Is it possible to reduce memory usage by sacrificing speed?

My kernel looks like this:

// a for loop to compute and store three vectors of m temporary floats
for (…) {
}

// an EM loop that uses the above m temporary values to optimize a vector of size 3
do {
}while (…)

// write the output to the output vector

Originally I implemented the code with three mallocs and three frees at the end but it crashed. Then I allocated the three vectors of size m for all the threads in my host code. Since I have 160x160=25600 threads and m can be above 20000, I needed to allocate 3x20000x25600xsizeof(float) = approx 6GB. I simply couldn’t allocate that much VRAM in my host code.

If the malloc worked, each thread should use only 3x20000xsizeof(float) = 240KB. I think then I can write a malloc loop in my device code to sleep on failed malloc call and let other threads to wake up the sleeping threads when they finishes. Is that possible?? How do implement this sleep and wake up mechanism???

Many thanks in advance! External Image

ymc · June 2, 2012, 8:14am

I added cudaDeviceSetLimit. Now malloc doesn’t crash. I added

while ((x = (float*)malloc(m*sizeof(float))) == 0) ;

to the beginning and

free(x)

to the end.

It can complete when m is small. But when m is large, it seems to loop forever. Will it work if I can add a sleep function to the malloc loop to make it work? But what is the sleep function for device code??

tera · June 2, 2012, 11:09am

It is not possible to put threads to sleep on the GPU. The best option you have is to just exit the block and leave the work for later. You can either mark the work as not yet done and leave it for a second kernel invocation, or use atomic operations to distribute the work to the blocks that successfully allocated memory within a single kernel invocation. Something like this:

if ((x = (float*)malloc(m*sizeof(float))) != 0) {

        while ((n = atomicAdd(&workCounter, 1)) < noOfWorkItems) {

            // do work for work item n using memory x

        }

        free(x);

    }

    return;

And init [font=“Courier New”]workCounter[/font] to 0 from host code. This will automatically share the work between as many blocks as can successfully allocate memory.

ymc · June 2, 2012, 12:07pm

It is not possible to put threads to sleep on the GPU. The best option you have is to just exit the block and leave the work for later. You can either mark the work as not yet done and leave it for a second kernel invocation, or use atomic operations to distribute the work to the blocks that successfully allocated memory within a single kernel invocation. Something like this:
if ((x = (float*)malloc(m*sizeof(float))) != 0) {

        while ((n = atomicAdd(&workCounter, 1)) < noOfWorkItems) {

            // do work for work item n using memory x

        }

        free(x);

    }

    return;
And init [font=“Courier New”]workCounter[/font] to 0 from host code. This will automatically share the work between as many blocks as can successfully allocate memory.

Thank you very much for your reply. External Image I will give it a try and report back the cost in performance.

A slight complication is I am using shared memory of 32x32, so I will have to re-do a whole block when malloc fails.

ymc · June 4, 2012, 8:26am

Finally get this working. I need to determine how many thread blocks I can execute before I launch the kernel. If I let the blocks to malloc, there will be a race condition that never ends.

Theoretically, if I increase m, the time should go up linearly if I have unlimited memory. I expected it to get to 500sec for my desired m. Under this new implementation that only executes 14 16x16 blocks per kernel launch, I am able to finish it in about 800sec. External Image

Time to save up to buy the 780 6GB and thanks again for tera’s help. External Image

Topic		Replies	Views
kernel malloc() efficiency really bad CUDA Programming and Performance	3	8270	January 18, 2011
Dynamic memory allocation by several gangs at the same time nvc, nvc++ and nvfortran	6	37	October 28, 2024
CUDA in-kernel malloc CUDA Programming and Performance	4	9899	July 19, 2011
OnDevice malloc() , dependence on block_dim CUDA Programming and Performance	5	2083	June 15, 2011
Fermi memory management different? CUDA Programming and Performance	10	1322	April 17, 2011
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3761	October 14, 2008
Ideal memory storage on GPUs CUDA Programming and Performance	0	2765	June 8, 2011
malloc shared memory to 1.1 device and cudaDeviceMapHost CUDA Programming and Performance	9	29393	April 15, 2011
Relation between # of blocks and devicememory size questions about blocks and memory CUDA Programming and Performance	3	1781	July 23, 2008
Occupancy and memory CUDA Programming and Performance	3	1548	March 25, 2010

How to put a thread to sleep and let other threads wake it up?

My kernel looks like this:

// write the output to the output vector

Related topics