Hi,
I’m trying to implement the fitness function for a genetic programming system to run on the GPU because it’s the most computationally expensive part of the system and I’ll need to calculate about 1000 at a time and each one is independent. So, I recently got it working and have tried it out. For one test I tried, it takes 17 seconds for just one fitness function evaluation on the GPU, while when I run it in emulation mode, it takes about 0.04 seconds… so I’d be needing to run about 400 fitness evaluations at the same time to even break even with the GPU, meaning I’d need another video card. Plus, if I ran the program on the CPU, I could speed it up based on inefficiencies I have to take into account on the GPU (such as no recursion or function pointers).
Am I perhaps using memory inefficiently on the card? Here’s what I’m doing:
Most of the calculations involve data from a really big table of floating point numbers, which I pass in from MATLAB and move to the card like this:
[codebox] CUDA_SAFE_CALL(cudaMalloc((void**)&data_gpu, sizeof(double)numSymbolsnumDates*numIndicators));
checkCudaError("h1");
double *data_double = (double*)mxMalloc(sizeof(double)*numSymbols*numDates*numIndi
cators);
for (i = 0; i < numSymbols*numDates*numIndicators; i++) data_double[i] = NA; //initialize the data
for (i = 0; i < numRows; i++) {
symbol = data[i];
date = data[i+numRows];
for (j = 0; j < numIndicators; j++) {
data_double[symbol*(numDates*numIndicators) + date*(numIndicators) + j] = (double)data[i+numRows+numRows*(j+1)];
}
}
CUDA_SAFE_CALL(cudaMemcpy(data_gpu, data_double, sizeof(double)*numSymbols*numDates*numIndicators, cudaMemcpyHostToDevice));
mxFree(data_double);
checkCudaError("h2");[/codebox]
And then, in the kernel, I have a loop that does all the calculations using this table, so the bottleneck is most likely due to memory accesses. Is there a better way to allocate the memory? Would using textures or shared memory or something else speed things up?
-Ryan