Dear all, after days of debugging, I finally wrote my first real CUDA program. I achieved a 40x speed relative to my hexacore x79 implementation. External Image But I find that my 470’s VRAM of 1.25GB is not enough for my application. Is it possible to reduce memory usage by sacrificing speed?
My kernel looks like this:
// a for loop to compute and store three vectors of m temporary floats
for (…) {
}
// an EM loop that uses the above m temporary values to optimize a vector of size 3
do {
}while (…)
// write the output to the output vector
Originally I implemented the code with three mallocs and three frees at the end but it crashed. Then I allocated the three vectors of size m for all the threads in my host code. Since I have 160x160=25600 threads and m can be above 20000, I needed to allocate 3x20000x25600xsizeof(float) = approx 6GB. I simply couldn’t allocate that much VRAM in my host code.
If the malloc worked, each thread should use only 3x20000xsizeof(float) = 240KB. I think then I can write a malloc loop in my device code to sleep on failed malloc call and let other threads to wake up the sleeping threads when they finishes. Is that possible?? How do implement this sleep and wake up mechanism???
I added cudaDeviceSetLimit. Now malloc doesn’t crash. I added
while ((x = (float*)malloc(m*sizeof(float))) == 0) ;
to the beginning and
free(x)
to the end.
It can complete when m is small. But when m is large, it seems to loop forever. Will it work if I can add a sleep function to the malloc loop to make it work? But what is the sleep function for device code??
It is not possible to put threads to sleep on the GPU. The best option you have is to just exit the block and leave the work for later. You can either mark the work as not yet done and leave it for a second kernel invocation, or use atomic operations to distribute the work to the blocks that successfully allocated memory within a single kernel invocation. Something like this:
if ((x = (float*)malloc(m*sizeof(float))) != 0) {
while ((n = atomicAdd(&workCounter, 1)) < noOfWorkItems) {
// do work for work item n using memory x
}
free(x);
}
return;
And init [font=“Courier New”]workCounter[/font] to 0 from host code. This will automatically share the work between as many blocks as can successfully allocate memory.
Finally get this working. I need to determine how many thread blocks I can execute before I launch the kernel. If I let the blocks to malloc, there will be a race condition that never ends.
Theoretically, if I increase m, the time should go up linearly if I have unlimited memory. I expected it to get to 500sec for my desired m. Under this new implementation that only executes 14 16x16 blocks per kernel launch, I am able to finish it in about 800sec. External Image
Time to save up to buy the 780 6GB and thanks again for tera’s help. External Image