Hello all!
I was wondering if you could give me some help getting some performance boost…
I’m using kernel to filter some elements of an array, and put those filtered results to an output array, which will in turn be used as the input buffer for a second filtering kernel…
Here is a code sample:
__global__ void first_filter(str *in, str *out, int in_size) {
int id = threadIdx.x+(blockDim.x*(blockIdx.x));
if(id < in_size) {
if(in[id].a < 1000 && in[id].b > 2 && in[id].c%2 == 0) {
out[id].a = in[id].a;
out[id].b = in[id].b;
out[id].c = in[id].c;
out[id].valid = 1;
}
else {
//control flag due to the output array beeing sparse
out[id].valid = 0;
}
}
}
This is how the kernel is beeing called
grid_size = BUFFER_SIZE/MAX_BLOCK_SIZE;
first_filter<<<grid_size, MAX_BLOCK_SIZE>>>(filter_1_in, filter_1_out, BUFFER_SIZE);
second_filter<<<grid_size, MAX_BLOCK_SIZE>>>(filter_1_out, filter_2_out, BUFFER_SIZE);
As you can see, the Input array to the second kernel is the Output array for the first kernel.
I already removed Atomics, and that’s why I’m using a validation flag in each element, so that the next kernel knows which elements to process.
In this aproach what I try to do is pass a number of threads equal to the size of the Input array, so that each thread computes a single array.
I should also say that the size of the input array doesn’t matter much…I mean, I don’t have a preference, but the bigger the better!
I thought I was having a good performance already, but while comparing this to a standard sequential implementantion I was disapointed…
Any hints?