Help improving performance

Hello all!

I was wondering if you could give me some help getting some performance boost…

I’m using kernel to filter some elements of an array, and put those filtered results to an output array, which will in turn be used as the input buffer for a second filtering kernel…

Here is a code sample:

__global__ void first_filter(str *in, str *out, int in_size) {

	int id = threadIdx.x+(blockDim.x*(blockIdx.x));

	if(id < in_size) {

		if(in[id].a < 1000 && in[id].b > 2 && in[id].c%2 == 0) {

			out[id].a = in[id].a;

			out[id].b = in[id].b;

			out[id].c = in[id].c;

			out[id].valid = 1;


		else {

			//control flag due to the output array beeing sparse

			out[id].valid = 0;




This is how the kernel is beeing called


first_filter<<<grid_size, MAX_BLOCK_SIZE>>>(filter_1_in, filter_1_out, BUFFER_SIZE);

second_filter<<<grid_size, MAX_BLOCK_SIZE>>>(filter_1_out, filter_2_out, BUFFER_SIZE);

As you can see, the Input array to the second kernel is the Output array for the first kernel.

I already removed Atomics, and that’s why I’m using a validation flag in each element, so that the next kernel knows which elements to process.

In this aproach what I try to do is pass a number of threads equal to the size of the Input array, so that each thread computes a single array.

I should also say that the size of the input array doesn’t matter much…I mean, I don’t have a preference, but the bigger the better!

I thought I was having a good performance already, but while comparing this to a standard sequential implementantion I was disapointed…

Any hints?

What is ‘str’ ?

I assume that str is a struct consisting of 3 ints, names a,b,c. You could use buildin type int3 for that, but that is just naming difference.

I see 3 things that can be improved.

    I am not sure if compiler will optimalise it, maybe you should try it on your own: in[id].letter is referenced twice in your code, hence you are accessing global memory twice. You could try loading in[id].letter to register first (local variable) and then process it.

    You will get better coalesing when reading from global memory if you use arrays for each component for your str separetly, meaning:

    int *strObj_a, int *strObj_b, int *strObj_c, int *strObj_valid, instead of str *strObj

    If you call several filters, or if first filter reduces the size of your array a lot and second one computes a lot, maybe you could try compacting it? You can simply accomplish that by computing prefix sum on your strObj_valid array and then rearranging the nodes.

Also note that you loose lots of time in copying data to and from device and initializing everything.

If your arrays are small (let’s say below 100000) you won’t get much improvement.

If I correctly assume that ‘str’ is a 16 byte structure then make sure that your data is loaded as 1 16 byte load, and written as 1 16 byte store (Just cast your ‘in’ and ‘out’ to int4).
You could also try processing several elements in each thread, i.e. something like threadIdx.x + i * gridDim.x to keep your data accesses coalesced.

But in general - if this is not a reduced code sample - you seem not to do a lot of stuff in your kernel, so I’d eliminate this pass completely, and merge it into the second kernel.

Given the relatively few number of instructions, kernel1 is definitely memory bandwidth bound. I also am curious what the size of the str datatype is. If it is 128 bits, then memory reads will be coalesced and there isn’t much memory read optimization to do in general. However, if you are using a GPU before the GTX 200 series, you might be suffering from an old performance bug, where 128-bit loads got about 50% of the bandwidth of 32 and 64 bit loads. If this is the case for you, there are a couple workaround options.

If for some reason str is between 64 and 128 bits wide, then you are underutilizing the memory bus. The amount of bandwidth loss depends on your GPU. The GTX 200 series has a much smarter memory controller, so the loss is probably no more than 33%. Older GPUs were much stricter about coalescing rules, and you could be losing quite a bit more bandwidth if your datatype size isn’t 32, 64, or 128 bits. The fix for this problem is to use some pointer tricks to issue 32-bit reads and use shared memory as a staging area to reassemble your struct.

Other than that, I also echo the suggestion of the others: Can you explain more what kernel2 does, and why kernel1 can’t be merged into it?

First, thanks in advance!

Sorry, yeah, str is a struct with 3 ints, in this case…It could actually be something else…As I’m just using this as some kind of proof of concept.

I’ll try your suggestions regarding global memory.

About the struct I’m using as array element, I’m not sure if it really matters it’s size or if I really should be taking that in account in this case, as it’s not fixed, I could be using a completely different struct.

I’m not saying I don’t want to merge both kernels, the thing is that I want to use them in some kind of runtime-evironment, where I’ll be calling these kernels depending on what there is to be done…I also tryed merging these kernels, but tbh didn’t notice any boost…

I know that there is a huge bottleneck while copying the memory but, from my bechmark, even if you don’t count with the time spent copying memory, the sequential implementation is still faster.

EDIT: About kernel2, it can actually be like the first one, just with a different filtering condition…

There appear to be 4 fields, though (a,b,c,valid). Is it 4 ints?

yeah, it’s as follows:

typedef struct {

	 int a;

	 int b;

	 int c;

	 //Control flag due to scattered input

	 int valid;


The thing is that this struct can be anything…The one I’m using is just an example…


I’ll tr to be more clear on what I’m trying to do.

Imagine I’m processing an input array. I’m trying to maximize this, so I’m passing portions of that array to the first kernel. This kernel computes the elements on it’s input array and writes them to an ouput array. This output array will be the input array of the second kernel, which will compute them and write to it’s output array, and so one (there may be a third kernel)…


I want to performe a query-like on a bunch of elements on an input array.

GET ALL FROM input WHEN a<1000 & b>2 & c%2==0

I’ll have a when_kernel, and a get_kernel…

Any hints?

Okay, so after running the profiler on this, I get about 1047 gld uncoalesced and only 2 coalesced, which is…bad :P

I’m trying to understand why is this, so if you could please give me some help understanding I’d be very appreciated

To get coalesced memory accesses you either need to use one of the built-in structs (e.g. int4) or separate your “array of structures” (AoS) into a “structure of arrays” (SoA).

An AoS approach that coalesces properly would be:


global void first_filter(int4 *in, int4 *out, int in_size) {

int id = threadIdx.x+(blockDim.x*(blockIdx.x));

if(id < in_size) {

    int4 input = in[id];

if(input.x < 1000 && input.y > 2 && input.z % 2 == 0) {

        input.w = 1; //valid

        out[id] = input;


    else {

        //control flag due to the output array being sparse

        out[id].w = 0; //invalid





Note that we load the entire structure once in the line int4 input = in[id]; before accessing its individual members.

The SoA approach is:


struct soa{

int * x;

int * y;

int * z;

int * w;


global void first_filter(soa in, soa out, int in_size) {

int id = threadIdx.x+(blockDim.x*(blockIdx.x));

if(id < in_size) {

    int x = in.x[id];

    int y = in.y[id];

    int z = in.z[id];

if(x < 1000 && y > 2 && z % 2 == 0) {

        out.x[id] = x;

        out.y[id] = y;

        out.z[id] = z;

        out.w[id] = 1; // valid


    else {

        //control flag due to the output array being sparse

        out.w[id] = 0; // invalid





In most cases the SoA approach is better suited for CUDA. In AoS there are only a few combinations that coalesce properly (e.g. float2, int4, etc.) whereas in SoA you can do any number of inputs (e.g. 13 inputs). From what you’ve written it sounds like you should use the SoA approach since the number of input arguments can vary.

Thanks! This works perfectly! No non-coalesced accesses now.

One question though: How flexible can this kind os structure be? I mean, can I have a struct with integers, doubles, strings,…?

sure. just be careful of unaligned accesses (e.g., put your arrays before any other data types).

I was thinking about using mapped memory, but I’m having a problem while calling:

cudaHostAlloc((void**)&stream_in, sizeof(soa), cudaHostAllocMapped);


Which results in a: “Cuda error: cudaHostAllocMapped: feature is not yet implemented.”.

I’m using the 2.2SDK and Toolkit so this should work, right?


Yes, I believe so. If not, it’s probably a bug :)

Sounds like you’re using a non-2.2 driver. Are you using 185.xx?

Yeah, I have the 185.xx, but also a non-supported card :thumbsdown:

Just one more thing: Would zero-copy be suited for this situation where I’m am copying about 2Mb or 4Mb of data from host to device? I have coalesced reads from this array and I only read from it exactly once each time…I haven’t had the chance to test it yet on a supported card, but just wanted to know if it would work…

Sounds like a good fit (assuming you have a card which supports host-mapped memory). When you say “read from it exactly once”, do you mean the same data once per kernel call, or that you read these particular values once, then replace them before the next kernel call?

Zero-copy over the PCI-Express bus (so not the MCP79 chipset kind) is ideal for coalesced reads and writes of “disposable” data. That is, data which is single use, then needs to be replaced anyway on the next call. If you reuse the data at all between kernel calls, then zero copy is almost certainly a net loss.

Well yeah, the first kernel reads from an array on mapped memory and puts the results on different global array to be used by different kernels. So only 1 kernel will read from mapped memory and only once…