tuning a fairly simple kernel / poor performance when returning (writing) data

Hi all,

I’m a newbie to CUDA, so I assume there’s a known (possibly very simple) solution to my problem. I already spent a few hours trying to solve it, but no luck.

The task is very simple - given a large array ‘A’ of integers (say, 800MB of 64-bit integers), and a small array ‘B’ of integers (say, 10k of 64-bit integers), find out which of the values in A are also in B. If you’re familiar with SQL, it’s generally this:

SELECT a FROM ‘A’ WHERE a IN (elements of ‘B’)

What I need to get is a bitmap of matches for the large array. I’d prefer actual bitmap, (i.e. if the array has N elements, then the bitmap has N/8 bytes with one bit for each element of A), but I can live with e.g. array of N characters if that’s necessary for performance reasons.

What I came up so far is this very simple kernel:

__global__
void
in_device_int64(int64 * A, int64 * B, int64 nitems, unsigned char * result) {

    /* this is the starting position of this block */
    int64 blockstart = (gridDim.x * blockDim.x) * blockIdx.y + (blockIdx.x * blockDim.x);

    /* and this is the item this thread should deal with */
    int64 idx = blockstart + threadIdx.x;

    int64 z = A[idx];
    unsigned char x = 0;                 /* true/false for this particular item */
    for (int k = 0; k < nitems; k++) {   /* loop through B and compare it with the elements */
        if (z == B[k]) {
            x = 1;                       /* yup, so set the bit (otherwise leave it 0) */
        }
    }

    result[idx] = x;

}

The input is a 1D array of values, array ‘B’ with ‘nitems’ elements and a preallocated unsigned char * array to store 0/1 depending on the check.

I’m running this with 1024 threads / block, each thread processes a single value and the blocks are 1D (so blockDim.y==1 all the time), so the blockstart and thread indexes are OK.

This works and returns the correct result, but it’s very very slow - about twice slower than running the same task on regular CPU. I’ve tried on two different machines, and the behavior is the same. One is a a laptop with Quadro NVS 4200M, the other one is a desktop with GTS 450, so both are fairly old/weak GPUs by todays standards, but I’m not sure that’s the problem here.

After a fair amount of tweaking I’ve narrowed it to the last command in the kernel, i.e. storing the output value

result[idx] = x;

If I remove / comment out this line, it’s fast, but obviously it’s somehow pointless without getting the data.

I’m really wondering what’s wrong here - I’m aware of the importance of coalescing / alignment etc, and IMHO this is perfectly aligned (it’s basically using a sequence of thread IDs to access/write the memory.

I tried to use shared memory to collect data from the whole block (and even combine the results into an actual bitmap), but the the performance remained about the same (can post this version, if needed).

The amount of output data is a few MBs and it’s taking ~20 seconds. The memory bandwidth seems to be about OK (~5GB/s in both directions).

Any ideas what’s wrong here? Any well known pattern how to solve this?

I suspect that with that line commented out the optimizer is also also removing the entire loop, so the speed up is not because the write has been removed but because the reading is no longer happening.

I think the killer here is that you are reading the entire B array to many times.

A simple suggestion is to have each thread read say 8 values from A and on each iteration of the loop test each of those values against the same element of B (read the element into a local variable) the 8 values can be in a local array (with a short array the compiler will probably assign a register to each of them if you have a separate if test for each of them) NB you may find you need to reduce the number of threads per block so that each thread has more registers. More threads per block doesn’t always make things faster anyway, sometimes 32 is best !

Are either or both arrays sorted ?, if so then make use of that too.

Thanks for the response. Yeah, I was suspecting the compiler might have optimized out some of the code too. I tried to prevent that by using -O0, but apparently that did not do the trick. That might explain the much better performance.

I’ve modified the kernel to read 8 values from A into local array (this way I can even build the actual bitmap)

__global__
void
in_device_int64(int64 * data, int64 * set, int64 nitems, unsigned char * result) {

	/* this is the starting position of this thread */
	int64 blockstart = ((gridDim.x * blockDim.x) * blockIdx.y + (blockIdx.x * blockDim.x))*8;
	int64 idx  = blockstart + threadIdx.x * 8;

	int64 ldata[8];

	for (int i = 0; i < 8; i++) {
		ldata[i] = data[idx + i];
	}

	unsigned char r = 0;

	for (int k = 0; k < nitems; k++) {
		int64 v = set[k];
		for (int i = 0; i < 8; i++) {
			if (ldata[i] == v)
				r |= (1 << i);
		}
	}
	
	result[blockstart/8 + threadIdx.x] = r;
	
}

And it seems to perform much better! I see ~5x speedup, and on the ‘larger’ I’m playing with it actually seems to scale with the number of cores.

Kudos to you for the hints so far! Is there something else that might improve the performance? Is it possible to guess how this will perform on “current” GPUs with more cores (say 1024, compared to the 192 cores on the GPU I’ve been fiddling with so far)?

Also, is there some general rule of thumb whether to preload 8, 16 or more items? The main parameter here is the amount of per SM memory, right?

As per the sorting - yes, I could sort the smaller array, but I’m not sure how this could be used on the GPU. On CPU it’s clear - use bsearch, but on GPU this would cause a lot of overhead because of branching. Or no?

The current GPU implementation is O(n) thanks for the linear search, while the CPU implementation is O(n*log(n)) thanks to the bsearch. I guess I can live with that - I’ll have to find the sweetspot (it’s somewhere around 512 items of B, but it’s probably dependent on the GPU model) and then use either CPU or GPU.

A simple and fast way of doing this on the GPU is to binary search from B into A if A is sorted, or to sort B and binary search from A into B if A is not sorted. Branch divergence doesn’t matter here, the straightforward, naive implementation gets pretty decent performance:

sort(B)

parallel for all a in A, result in Results:
   result = binary-search(a, B)

parallel-gather(Results)

See this code for a hybrid approach somewhere between linear scan and binary search that is somewhat faster than vanilla binary search: http://nvlabs.github.io/moderngpu/sortedsearch.html

You can also do this directly with thrust library calls if you don’t want to write the kernel yourself. Use thrust::sort and thrust::set_intersection or thrust::lower_bound.

If you really just want a relational/SQL join, then see this code: http://nvlabs.github.io/moderngpu/join.html

It should be fairly straightforward to adapt these ideas to work with bitmasks rather than arrays of elements or indices.

Awesome! If this was reddit, I’d send gold to both of you. Thanks for your advices.

Using a very naive binsearch implementation resulted in a significant speedup, so while before it was ~3x slower than a CPU, now it’s ~5x faster than CPU on the lousiest GPU on earth (NVS 4200 has just 48 cores), and 20x faster on the pretty lousy GTS450 in my desktop. I really haven’t expected this to work this great, and my impression is that with more cores the performance will be even better. Nice!

I can’t really use the hybrid approach, though, because I can’t sort the first (large) array. I need to get the bitmap with ‘matching rows’. combine it with some other bitmaps and keep the indexes, and the sorting would make that impossible / very difficult to do.

I’ll probably spend some more time tweaking the implementation, but I’m already quite happy with the current performance and generally confident that buying a current / more expensive GPU won’t be a waste of money.

Yeah, it doesn’t work for every application. Sometimes the arrays end up being persistent and you can keep them sorted. One nice property is that the intermediate results generated by multiple JOINs on sorted inputs are also sorted.

If you absolutely can’t sort the data, but one array is much smaller than the other, then the binary search approach works fine, and you can do even better with a hash join implementation. Basically you build a hash table for B, then do parallel probes from A into B (something like one per thread). This should be even faster than the binary search.

You mean building a linear hash table on CPU, and then use that from the GPU with open addressing?. Good idea, I guess - the number of values in B is known in advance and tends to be reasonably small, so I can build the hash table large enough to be very effective (say, 2x the number of elements).

I’ll keep this in my sleeve as an optimization.

Just to answer this

I would have suggested 4 except 8 would allow you to do the bitmap.
the larger you use the more registers you use and at some point performance will drop.
Probably little advantage going more than 4 except that it allows you to do the bitmap.

Could try adding a #pragma unroll just before the inner loop, manual says small loops are automatically unrolled but I’m not sure if 8 is “small”

OK, this is my attempt at implementing the hash table approach:

#define HASH_STEP 1

#define HASH_TABLE_SIZE(nitems) \
    (offsetof(hash_table_t, elements) + nitems * 2 * sizeof(hash_element_t))

typedef struct hash_element_t {
    int      used;
    int64    value;
} hash_element_t;

typedef struct hash_table_t {
    int nelements;
    hash_element_t elements[1];
} hash_table_t;

hash_table_t * build_hash_table(int64 * set, int64 nitems) {
    hash_table_t * table = (hash_table_t *)malloc(HASH_TABLE_SIZE(nitems));
    memset(table, 0, HASH_TABLE_SIZE(nitems));
    table->nelements = 4 * nitems;
    
    for (int i = 0; i < nitems; i++) {
        int64 hash = (set[i] % table->nelements);
        
        while (table->elements[hash].used != 0) {
            if (table->elements[hash].value == set[i])
                break;
            hash = (hash + HASH_STEP) % table->nelements;
        }
        
        table->elements[hash].used = 1;
        table->elements[hash].value = set[i];
    }
    return table;
}

__device__
int hash_find(hash_table_t * table, int64 value) {
    int64 hash = (value % table->nelements);

    while (table->elements[hash].used != 0) {
        if (table->elements[hash].value == value)
            return 1;
        hash = (hash + HASH_STEP) % table->nelements;
    }

    return 0;
}

__global__
void
in_device_hash_int64(int64 * data, hash_table_t * htab, unsigned char * result) {

    /* this is the starting position of this thread */
    int64 blockstart = ((gridDim.x * blockDim.x) * blockIdx.y + (blockIdx.x * blockDim.x))*8;
    int64 idx  = blockstart + threadIdx.x * 8;

    unsigned char r = 0;

    for (int i = 0; i < 8; i++) {
        /* this is the key we're looking for */
        if (hash_find(htab, data[idx + i]))
            r |= (1 << i);
    }
    
    result[blockstart/8 + threadIdx.x] = r;

}

In short it builds a linear hash table of hash_element_t elements, and the threads then do linear probing. It works, but I haven’t observed any performance improvement. At best it’s about as fast as the binary search, usually slightly slower.

I’ve tried to tweak it in different ways (lowering load factor to 25%, using different data structures for the hash table etc.) but none of that resulted in significant improvement.

I’ve added the pragma like this:

__global__
void
in_device_int64(int64 * data, int64 * set, int64 nitems, unsigned char * result) {

	/* this is the starting position of this thread */
	int64 blockstart = ((gridDim.x * blockDim.x) * blockIdx.y + (blockIdx.x * blockDim.x))*8;
	int64 idx  = blockstart + threadIdx.x * 8;

	int64 ldata[8];

	for (int i = 0; i < 8; i++) {
		ldata[i] = data[idx + i];
	}

	unsigned char r = 0;

	for (int k = 0; k < nitems; k++) {
		int64 v = set[k];
#pragma unroll
		for (int i = 0; i < 8; i++) {
			if (ldata[i] == v)
				r |= (1 << i);
		}
	}
	
	result[blockstart/8 + threadIdx.x] = r;

}

No difference at all, so I guess it already was unrolled. I even tried unrolling it manually, same result.

deleted

I would guess that the hash table is slower because the buckets are larger than the elements themselves, and the while loop for linear probing adds some overhead. If you are happy with the performance of the binary search you might just want to stick with that since it is much simpler.

Getting the hash table code to be faster probably involves shrinking the size of the buckets, and separating the probing code into a fast case where there is no collision and separate cleanup code that handles collisions. It probably isn’t worth the added complexity if B is sufficiently small.

[quote=“Gregory Diamos”]

What do you mean by ‘bucket size’? I’ve implemented the hash table using open addressing / linear probing, which does not work with buckets.