I want to calculate the cosine of a bunch of input values as fast as possible, using asynchronous memory transfers, so that the GPU can be computing and transferring at the same time. This is somewhat of a starting point; later on I would like to perform operations on several input arrays i1, i2, …, iN to produce several output arrays o1, o2, …, oM. For a given array index K, o1[K], o2[K], … would only be dependent on i1[K], i2[K], … .
At this point, I have the following code which computes the cosine of 31*2^20 floating point values, and compares them with the CPU:
#include <stdio.h>
#include <cutil.h>
#include <math.h>
__global__ void mycos(float *input, float *output, int size){
int i = blockIdx.x * blockDim.x + threadIdx.x;
output[i] = cosf(input[i]);
}
float RandFloat(float low, float high){
float t = (float)rand() / (float)RAND_MAX;
return (1.0f - t) * low + t * high;
}
int main(int argc, char *argv[]){
cudaSetDevice(1);
dim3 threads, blocks;
int dataSize = 1024*1024*31;
int myStreams = 8, errors = 0;
float *CPU_IN, *GPU_IN, *CPU_OUT, *GPU_OUT, time_memcpy;
CUDA_SAFE_CALL( cudaMallocHost((void**)&CPU_IN, sizeof(float)*dataSize) );
CUDA_SAFE_CALL( cudaMallocHost((void**)&CPU_OUT, sizeof(float)*dataSize) );
for(int i = 0; i < dataSize; i++){
CPU_IN[i] = RandFloat(-1.0, 1.0);
}
CUDA_SAFE_CALL( cudaMalloc((void**)&GPU_IN, dataSize*sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**)&GPU_OUT, dataSize*sizeof(float)) );
cudaEvent_t start_event, stop_event;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
cudaStream_t *streams2 = (cudaStream_t*) malloc(myStreams * sizeof(cudaStream_t));
for(int i = 0; i < myStreams; i++){
CUDA_SAFE_CALL( cudaStreamCreate(&(streams2[i])) );
}
for(int i = 0; i < myStreams; i++){
cudaMemcpyAsync(GPU_IN + (i * dataSize / myStreams), CPU_IN + (i * dataSize / myStreams), sizeof(float)*dataSize/myStreams, cudaMemcpyHostToDevice, streams2[i]);
}
for(int i = 0; i < myStreams; i++){
mycos<<<dataSize/512, 512, 0, streams2[i]>>>(GPU_IN, GPU_OUT, dataSize/myStreams);
}
for(int i = 0; i < myStreams; i++){
cudaMemcpyAsync(CPU_OUT + i * dataSize / myStreams, GPU_OUT + i * dataSize / myStreams, sizeof(float)*dataSize/myStreams, cudaMemcpyDeviceToHost, streams2[i]);
}
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&time_memcpy, start_event, stop_event) );
cudaThreadSynchronize();
printf("Total Time:\t%.2f\n", time_memcpy);
for(int i = 0; i < dataSize; i++){
if (fabs(cos(CPU_IN[i]) - CPU_OUT[i])>0.001){
errors++;
}
}
printf("Errors = %d out of a possible %d.\n",errors,dataSize);
CUT_EXIT(argc, argv);
return 0;
}
It is loosely based on the simpleStreams code. I have cudaSetDevice(1); so that the program runs on a 8500GT instead of a 9600GT, as the latter is the primary display driver. If the code crashes, the whole system won’t lock up, as it has a few times on me. Both cards have 512MB of memory.
If I set the input data size to 32*2^20, I start getting errors all over the place, even though I am able to fit all that data onto either of the cards. How can I fix that?
In section 4.5.2.4 of the programming guide, there is a code example of how to manage streams, but it doesn’t list how to code the kernel so that streams, blocks, and threads can all be related to the indexes of the input and output arrays. How can I fix the kernel and its call in the main program so that I don’t get errors (caused by index mismatching or data not being processed at all)?