Hello –
I’ve written an application to do pairwise comparisons of ~1000 arrays of floats. The performance in CUDA is too slow for my purposes (though I do see a 5-6x speed-up from a serial implementation), so I’m trying to get to the bottom of the perceived-performance lag. The kernel involves a ‘for’ loop over ~1000 elements (of the particular arrays it’s currently dealing with). It seems to me that the hit in performance is occurring because of the loop, so I’ve created a test kernel to look at the effects of loop-size, etc on performance. The test kernel doesn’t do anything non-trivial – I just wanted to see the effects of updating something in a loop and ultimately passing its result back to global memory.
My question is, given this sort of loop architecture, is there something I can do with memory management or otherwise to make it run faster? The kernel and my debugging steps are listed below, and I’ve added the actual program written in PyCuda (though the kernel could be easily decoupled and run from C/CUDA) at the very end of the post. Thanks so much!
[i]#define num_loops 1024
#define MATRIX_COLS 1000
#define num_threads_per_block 20
__device__ float ReturnOut();
__global__ void CKernel(float *c)
{
int tx = blockIdx.x * num_threads_per_block + threadIdx.x;
int ty = blockIdx.y * num_threads_per_block + threadIdx.y;
c[tx * MATRIX_COLS + ty] = ReturnOut();
//c[tx * MATRIX_COLS + ty] = 1.1;
}
__device__ float ReturnOut() {
float out = 0.0;
for (int ii=0; ii<num_loops; ++ii) {
out += 1.5;
}
return out;
}[/i]
The kernel was launched from PyCuda with the specified thread architecture: grid_size = (50, 50)
block_size = (20, 20, 1)
So I’ve found that when I execute this thing as-is it takes ~28ms. When I comment out the 2nd to last line and uncomment the last line (c[tx * MATRIX_COLS + ty] = 1.1;) it takes ~3ms. It seems to me that in this case the compiler is performing some sort of optimization and not executing the loop at all. Interestingly, if in this scenario with the last line uncommented, I declare “out” to be a shared variable, the execution time goes back to ~28ms. Whereas a variable defined without a specifier like shared or constant is by default stored in a local register (presumably fast memory access), one defined like shared float out should be stored in shared memory. So it seems that, if the compiler knows that out is a shared variable, it will execute the loop, despite the fact that my output array is being updated with a constant (1.1). But if it thinks out is a local variable, it won’t bother executing a loop to update it because the end result (c) doesn’t depend on it. Additionally, if I allocate out outside of the kernel and pass it in (through global memory), things are slow.
One more thing – as expected, varying the number of loops changes execution time in a predictable manner, so a smaller loop ends up taking much less time, etc. I bring this up because I’ve tried scenarios where I have an array in the loop of a particular (large) size, and found that varying the number of elements updated in it in the loop changes the processing time like this. That rules out the possibility that execution is slow at the point of updating c simply because it has to load a large array into memory or something like that. To put it another way, a large, preallocated array will lead to more execution time the more number of time its elements have been updated (it isn’t a constant that only varies with array size).
So this leads me to the question: is there any way to make this faster, or are loops inherently just really slow in CUDA?
Thanks,
Brian
PyCuda code:
import sys
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from numpy.core import all
kernel_code = “”"
#define num_loops 1024
#define MATRIX_COLS 1000
#define num_threads_per_block 20
__device__ float ReturnOut();
__global__ void CKernel(float *c)
{
int tx = blockIdx.x * num_threads_per_block + threadIdx.x;
int ty = blockIdx.y * num_threads_per_block + threadIdx.y;
c[tx * MATRIX_COLS + ty] = ReturnOut();
//c[tx * MATRIX_COLS + ty] = 1.1;
}
__device__ float ReturnOut() {
float out = 0.0;
for (int ii=0; ii<num_loops; ++ii) {
out += 1.5;
}
return out;
}
"""
def to_gpu(numpy_array):
gpu_array = drv.mem_alloc(numpy_array.size * numpy_array.dtype.itemsize)
drv.memcpy_htod(gpu_array, numpy_array)
return gpu_array
if name == ‘main’:
import sys
import time
import math
if sys.platform == 'win32':
now = time.clock
else:
now = time.time
f_rows = 1024
f_cols = 1000
c = numpy.zeros((f_cols, f_cols)).astype(numpy.float32)
c_gpu = to_gpu©
#out = numpy.zeros((f_rows)).astype(numpy.float32)
#out_gpu = to_gpu(out)
mod = drv.SourceModule(kernel_code)
func = mod.get_function("CKernel")
grid_size = (50, 50)
block_size = (20, 20, 1)
start = now()
func(c_gpu, grid=grid_size, block=block_size)
drv.memcpy_dtoh(c, c_gpu)
print 'time for analysis: ', now() - start