Hi all,
I’m a newbie to CUDA, so I assume there’s a known (possibly very simple) solution to my problem. I already spent a few hours trying to solve it, but no luck.
The task is very simple - given a large array ‘A’ of integers (say, 800MB of 64-bit integers), and a small array ‘B’ of integers (say, 10k of 64-bit integers), find out which of the values in A are also in B. If you’re familiar with SQL, it’s generally this:
SELECT a FROM ‘A’ WHERE a IN (elements of ‘B’)
What I need to get is a bitmap of matches for the large array. I’d prefer actual bitmap, (i.e. if the array has N elements, then the bitmap has N/8 bytes with one bit for each element of A), but I can live with e.g. array of N characters if that’s necessary for performance reasons.
What I came up so far is this very simple kernel:
__global__
void
in_device_int64(int64 * A, int64 * B, int64 nitems, unsigned char * result) {
/* this is the starting position of this block */
int64 blockstart = (gridDim.x * blockDim.x) * blockIdx.y + (blockIdx.x * blockDim.x);
/* and this is the item this thread should deal with */
int64 idx = blockstart + threadIdx.x;
int64 z = A[idx];
unsigned char x = 0; /* true/false for this particular item */
for (int k = 0; k < nitems; k++) { /* loop through B and compare it with the elements */
if (z == B[k]) {
x = 1; /* yup, so set the bit (otherwise leave it 0) */
}
}
result[idx] = x;
}
The input is a 1D array of values, array ‘B’ with ‘nitems’ elements and a preallocated unsigned char * array to store 0/1 depending on the check.
I’m running this with 1024 threads / block, each thread processes a single value and the blocks are 1D (so blockDim.y==1 all the time), so the blockstart and thread indexes are OK.
This works and returns the correct result, but it’s very very slow - about twice slower than running the same task on regular CPU. I’ve tried on two different machines, and the behavior is the same. One is a a laptop with Quadro NVS 4200M, the other one is a desktop with GTS 450, so both are fairly old/weak GPUs by todays standards, but I’m not sure that’s the problem here.
After a fair amount of tweaking I’ve narrowed it to the last command in the kernel, i.e. storing the output value
result[idx] = x;
If I remove / comment out this line, it’s fast, but obviously it’s somehow pointless without getting the data.
I’m really wondering what’s wrong here - I’m aware of the importance of coalescing / alignment etc, and IMHO this is perfectly aligned (it’s basically using a sequence of thread IDs to access/write the memory.
I tried to use shared memory to collect data from the whole block (and even combine the results into an actual bitmap), but the the performance remained about the same (can post this version, if needed).
The amount of output data is a few MBs and it’s taking ~20 seconds. The memory bandwidth seems to be about OK (~5GB/s in both directions).
Any ideas what’s wrong here? Any well known pattern how to solve this?