'for' loop performance hacks?

Hello –

I’ve written an application to do pairwise comparisons of ~1000 arrays of floats. The performance in CUDA is too slow for my purposes (though I do see a 5-6x speed-up from a serial implementation), so I’m trying to get to the bottom of the perceived-performance lag. The kernel involves a ‘for’ loop over ~1000 elements (of the particular arrays it’s currently dealing with). It seems to me that the hit in performance is occurring because of the loop, so I’ve created a test kernel to look at the effects of loop-size, etc on performance. The test kernel doesn’t do anything non-trivial – I just wanted to see the effects of updating something in a loop and ultimately passing its result back to global memory.

My question is, given this sort of loop architecture, is there something I can do with memory management or otherwise to make it run faster? The kernel and my debugging steps are listed below, and I’ve added the actual program written in PyCuda (though the kernel could be easily decoupled and run from C/CUDA) at the very end of the post. Thanks so much!

[i]#define num_loops 1024
#define MATRIX_COLS 1000
#define num_threads_per_block 20

    __device__ float ReturnOut();
    __global__ void CKernel(float *c)
    {

        int tx = blockIdx.x * num_threads_per_block + threadIdx.x;
        int ty = blockIdx.y * num_threads_per_block + threadIdx.y;

        c[tx * MATRIX_COLS + ty] = ReturnOut();
        //c[tx * MATRIX_COLS + ty] = 1.1;
    }

    __device__ float ReturnOut() {

        float out = 0.0;
        for (int ii=0; ii<num_loops; ++ii) {
            out += 1.5;
        }
        return out;
    }[/i]

The kernel was launched from PyCuda with the specified thread architecture: grid_size = (50, 50)
block_size = (20, 20, 1)

So I’ve found that when I execute this thing as-is it takes ~28ms. When I comment out the 2nd to last line and uncomment the last line (c[tx * MATRIX_COLS + ty] = 1.1;) it takes ~3ms. It seems to me that in this case the compiler is performing some sort of optimization and not executing the loop at all. Interestingly, if in this scenario with the last line uncommented, I declare “out” to be a shared variable, the execution time goes back to ~28ms. Whereas a variable defined without a specifier like shared or constant is by default stored in a local register (presumably fast memory access), one defined like shared float out should be stored in shared memory. So it seems that, if the compiler knows that out is a shared variable, it will execute the loop, despite the fact that my output array is being updated with a constant (1.1). But if it thinks out is a local variable, it won’t bother executing a loop to update it because the end result (c) doesn’t depend on it. Additionally, if I allocate out outside of the kernel and pass it in (through global memory), things are slow.

One more thing – as expected, varying the number of loops changes execution time in a predictable manner, so a smaller loop ends up taking much less time, etc. I bring this up because I’ve tried scenarios where I have an array in the loop of a particular (large) size, and found that varying the number of elements updated in it in the loop changes the processing time like this. That rules out the possibility that execution is slow at the point of updating c simply because it has to load a large array into memory or something like that. To put it another way, a large, preallocated array will lead to more execution time the more number of time its elements have been updated (it isn’t a constant that only varies with array size).

So this leads me to the question: is there any way to make this faster, or are loops inherently just really slow in CUDA?

Thanks,
Brian

PyCuda code:

import sys
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from numpy.core import all

kernel_code = “”"

    #define num_loops 1024
    #define MATRIX_COLS 1000
    #define num_threads_per_block 20

    __device__ float ReturnOut();
    __global__ void CKernel(float *c)
    {

        int tx = blockIdx.x * num_threads_per_block + threadIdx.x;
        int ty = blockIdx.y * num_threads_per_block + threadIdx.y;

        c[tx * MATRIX_COLS + ty] = ReturnOut();
        //c[tx * MATRIX_COLS + ty] = 1.1;
    }

    __device__ float ReturnOut() {

        float out = 0.0;
        for (int ii=0; ii<num_loops; ++ii) {
            out += 1.5;
        }
        return out;
    }

    """

def to_gpu(numpy_array):
gpu_array = drv.mem_alloc(numpy_array.size * numpy_array.dtype.itemsize)
drv.memcpy_htod(gpu_array, numpy_array)
return gpu_array

if name == ‘main’:
import sys
import time
import math

if sys.platform == 'win32':
    now = time.clock
else:
    now = time.time

f_rows = 1024
f_cols = 1000

c = numpy.zeros((f_cols, f_cols)).astype(numpy.float32)
c_gpu = to_gpu©
#out = numpy.zeros((f_rows)).astype(numpy.float32)
#out_gpu = to_gpu(out)

mod = drv.SourceModule(kernel_code)
func = mod.get_function("CKernel")

grid_size = (50, 50)
block_size = (20, 20, 1)
start = now()
func(c_gpu, grid=grid_size, block=block_size)
drv.memcpy_dtoh(c, c_gpu)

print 'time for analysis: ', now() - start

Hi,
First I think its a bit weird that in the first scenerio the compiler didnt optimize the loop - you merly does a += 1.5
num_times which is known in compile time. According to the time you’ve measured (28ms vs 3ms) it seems that the
compiler didnt optimize the loop out.
Second, how much time does a totaly empty kernel takes? can u measure this? without anything in it and without
a call to another function.
Third - I don’t think loops are specificaly expensive in cuda or on the GPU. However you need to realize that a simple
loop includes ifs and other stuff that takes time. One thing to do, especially in simple loops like you’ve posted is to
unroll them.
Fourth - you’ll need to run the profiler to better see what takes the time.

And finally, always recheck that the code you write doesnt get optimized out and therefore the kernel gives imaginery
time results. I think it was one of the developers of nVidia in the recent conference in Texas who said, that sometimes
people get results which are not possible hardware wise and believe it is the real case, while there were optimization, cache
and other stuff taking place :)

BTW - regarding the original sort issue, make sure you consult with the SDK samples and the guys in this forum (which
are smarter then myself :) ) regarding the sort. x5-x6 seems to me a bit low.

Hope this helps,
eyal

Your memory access to the c array is non-coalesced. But that does not seem to be your problem if 3msec is fast enough.

For doing it differently: I would let a block do what a thread is doing now. And I would parallelize the loop over all the threads in a block. Each thread adds its ‘part’ to its shared memory location and then you finish up with a reduction. That way you also have uncoalesced access, but if your loop gets big (1024 may not be big enough to have benefit from this strategy) you largely remove the for loop which seems to be hurting you more than the uncoalesced accesses.

Hi Eyal,

Thanks for your quick response! An empty kernel takes around 3ms, roughly the same amount of time it takes when the loop is in but is presumably optimized out (when my output matrix value is set to a hard-coded float like 1.1).

I ported the rest of the calling code over to C/CUDA in hopes of running the Visual Profiler, but when I try to run the Profiler I get the error: ./cudaprof: error while loading shared libraries: libSM.so.6: wrong ELF class: ELFCLASS64

So I’m working on fixing that.

Regards,

Brian

Hi Denis,

Thanks for your reply! Indeed I’ve thought about changing the kernel so that there wouldn’t be a for loop at all – I just wanted to make sure first that this time lag from the for loop is unavoidable. To do what you suggest with parallelizing the loop over threads might be a little tricky. There are 1024 elements in each of my arrays, and I think one can only spawn a max of 512 threads from each block. I could split each array over two blocks, and perhaps this is the way to go, but this seems clunky.

Best,

Brian

I don’t know on what hardware you are running this, but from the times you are reporting I assume its not one of the powerful gpus. 3 ms for a empty kernel is a crazy overhead, i have a cg solver do 300 kernel calls (and quite a bit of math) in less then 6 ms. But in general this sort of thing would defiantly be parallelized and not done serially. You can have each thread handle 2 elements and then you are good with 512 or even 4 if 256 is better (you have to test these sort of things out).

You should make sure that you are timing correctly. 3 milliseconds for an empty kernel is way too long - It should be around 10 microseconds. Maybe your Python wrapper is playing tricks on you?

I agree on that you should partition the problem differently, but if you feel like keeping the current approach you could do some manual unrolling with the help of C++ templates. The Boost Preprocessor

is a nice way to do this (see This thread )

Hi Fugl,

Thanks for the tip about the Boost Preprocessor – I’ll try implementing that.

I agree, 3 milliseconds seems like it’s way too long for an empty kernel, considering I have a Tesla C1060. If I run an empty kernel and don’t try to return anything from it to the CPU, indeed the execution time is ~60 microseconds – the 3 millisecond number included a call from the CPU to get a value back from the GPU. Even so, it shouldn’t take 3 milliseconds to retrieve a simple float array from the GPU, right? Maybe my thread architecture is causing a bottleneck (50x50 blocks, 20x20 threads)? I ported the calling code to C and attached it, in case you’d kindly like to take a glance.

Btw, I’m timing with:

unsigned int timer = 0;

cutilCheckError( cutCreateTimer( &timer));

cutilCheckError( cutStartTimer( timer));

CKernel<<< grid, threads >>>( d_idata1, d_idata2, d_odata);

cutilSafeCall( cudaMemcpy( h_odata, d_odata, out_mem_size,

                            cudaMemcpyDeviceToHost) );

cutilCheckError( cutStopTimer( timer));

printf( "Processing time: %f (ms)\n", cutGetTimerValue( timer));

Regards,

Brian

c_kernel.cu.txt (715 Bytes)
c.cu.txt (2.4 KB)

Thanks, erdooom. I’ll try what you suggest with having each thread handle a smaller number of elements.

Regards,

Brian

A related question: say that in my “for” loop I would like to access ~2000 elements from a float array of ~10^6 elements that is passed over from the gpu (the elements to be access are specific to the thread). Every time I loop through these 2000 elements I’m making a call to global memory (inputted like: global void CKernel(float *c)), which is slow. Is there a quicker way to do this, where somehow I can pull these elements quickly into local memory and access them from there? It seems like that act of copying would involve just as many calls to global memory…

Thanks!

Is the access to the array random ? are any of the elements of the array shared among a number of threads ? do you use any of the elements more then once ?

if the access is not random then you can make sure to read from it in a coalesced fashion. even if you need to rearrange the data in the array once it might be worth it (depending on your problem) if some elements are shared or you use them more then once in a kernel then i would recommend loading them to shared memory. In any case, i still think doing a for loop with 2000 elements is the wrong approach in cuda External Media , let each thread do a small part and the use scan to collapse all the data. you can do it all in 1 kernel and it will probably get you the best results External Media .

Cheers

also i don’t know what is your application, but you can dma you mem copies so that cuda can start working on the next batch while copying the first. (if you dont need the output from the first fot the second). It would also be interesting to see the time for the kernel execution alone. Also your block size could be bit problematic, try making it to a power of 2 and then not using the remainder of the threads. And my last tip for today, is allocate that array in page locked memory.

I have a similar type of program. It has a massive for loop (each time, a new tile of memory is loaded to shared memory and all the threads perform calculations on that tile, very high arithmetic density). Erdoom and Riedijk suggested that a block handle what a thread is handling now. Why is this approach better (instruction overhead maybe? what if the for loop is unrolled?)?

when you split the load on “threads” the gpu has a better way of balancing the work, in the end the work on the gpu will look pretty similar, but it can let one part load data and another do some work. It also will help your program utilize future hardware which will probably have more processing units. If your problem is computation bound and you loop unroll it might not make a difference for current hardware. I have a masive loop in one of my kernels, but it only runs 6-16 times, so i decided to keep it in one thread. If you have a kernel with a loop that runs 2000 times, then that sounds like a sure candidate for parallelizing.

I’ve run into a problem, trying to get a block to do what a thread is doing now. To get each output element (around 200k elements, I have to go through each element in an input array (also around 200k elements). The way I see it, I can either have a block compute one output element and it copies in the whole input array in through global memory for each block or I can have a block compute one small section of each output element and use shared memory to store the specific tile of input. The problem with the second approach is that I have to do reduction afterward and either writing the output or reading the input for the reduction will be uncoalesced. I’ll be skipping over a ton elements just to get to the next one (the one from the next tile that I need to add to the other ones). Is there a clever way around this (or how badly will it effect performance? I’ll end up with with about 2k uncoalesced reads per output element, depending on my execution configuration.)? or am I stuck loading every element in the input thread from global memory for each output element :( ?

what are you trying to implement ? sounds like a bad approach for cuda, going through 200k elements for each one, no matter what you do …

Well, going through all those elements is unavoidable :( . There’s a fair amount of calculating per input element, hopefully enough to cover for all the memory access as long as I’m copying into shared… I’ll try the shared memory approach and hopefully the uncoalesced writes won’t slow me down too much.

yes i understand it is unaviodable, but ! in allot of cases it is possible to do it in some kind of piramid. like scan sort and the like … what are you doing ?