That is something I was about to ask: If you have an upper limit for str_len so that the inner loop and the update loop over z can be fully unrolled, the compiler should be able to place matrix_buffer in registers (which should really help with both memory bandwidth) and also eliminate a lot of the bookkeeping. Something like
#pragma unroll
for (j = 0; j < 140; j++) {
if (j>=*str_len)
break;
Compiling with [font=“Courier New”]–ptxas-options=-v[/font] should show you whether matrix_buffer ends up in registers or in local memory.
The other option is to use the byte array variant of your code and place matrix_buffer in shared memory. You’ll then have to bite and make the blocksize a less than optimal 50, but not wasting local memory bandwidth should more than outweigh this:
__shared__ char matrix_buffer[161][50][2];
I’ve reordered the indices to reduces all shared memory bank conflicts to 2-way. A bit of threadIdx rearrangement would allow to fully remove bank conflicts, but given the low occupancy this would probably not make any difference.
Interesting…with shared memory it takes twice as long :) …Elapsed=[6.503296msec] …now I’m really confused…maybe because i lowered the thread count per block from 500 to 50?
Maybe I’m just asking for too much…i mean…9.8 billion characters per second is a great achievement …since this is my first experience with cuda, I don’t have any reference points for comparison…
The test data consist of 1 000 000 messages, each 144 bytes in size…each of those messages is compared with another test message which is also 140 bytes in length…given the fact that for each comparison a 140x140 loop has to be executed, do you think that 2 seconds for 1 million records is something I should be happy with or is there room for improvement?
Yes, lowering the blocksize to as low as 50 should take quite a hit. I’m surprised though that we don’t seem to see any benefits from the reduced local memory bandwidth consumption. Is the matrix_buffer really in shared memory now?
It seems that it would be quite possible to work on a whole message per block, spreading the inner loop amongst the threads. That would allow a much larger blocksize and give better occupancy while still keeping matrix_buffer in shared memory.
I’m a bit surprised: Is it that overwhelmingly common case that messages have the same length? If yes, that data implies that you are using about 19.6 GByte/s of local memory bandwidth, which is close to the specified range of 24-32 GB/sec. If not, the kernel would use twice the local mem bandwidth, which obviously isn’t possible. Note that the compiler seems to have automatically moved the memory accesses for m1 and m2 under the (ch==ch2) conditional.
Whether you are happy with this is your call. I think that there should still be quite some improvement possible. I’d also guess that the CPU version is still a lot faster, because it has the benefit of a large cache while working on only one message at a time?
It was the byte array one…The problem with this algorithm is that the inner loop cannot be spread among different threads because each value in matrix_buffer is calculated based on the previous one…so I’m guessing I would have to use synchronisation and that would probably slow down everything even more…maybe I’m wrong…haven’t tried it yet…
No…messages are not the same size…this is just …hmm…well…worst case scenario…
About the numbers…
On my laptop the memory bandwidth is 20GBps and It takes 3s for 100 000 records(330M 48 cuda cores). 1 000 000 records in 2s is the time I got on a different machine which has a much higher memory bandwidth and the number of cuda cores is 448.
Memory access patterns are everything in CUDA. Consider these two model kernels:
__global__ void k1(unsigned int *in, unsigned int *out)
{
volatile unsigned int tidx = threadIdx.x + blockIdx.x * blockDim.x;
int localmin = in[msglen * tidx];
#pragma unroll
for(int i = 1; i < msglen; i++) {
localmin = min(localmin, in[msglen * tidx + i]);
}
out[tidx] = localmin;
}
__global__ void k2(unsigned int *in, unsigned int *out)
{
volatile __shared__ unsigned int buffer[msglen];
volatile unsigned int tidx = threadIdx.x + blockIdx.x * blockDim.x;
for(int midx = blockIdx.x; midx < nmessages; midx += gridDim.x) {
buffer[threadIdx.x] = in[midx * msglen + tidx];
if (threadIdx.x == 0) {
int localmin = buffer[0];
#pragma unroll
for(int i = 1; i < msglen; i++) {
localmin = min(localmin, buffer[i]);
}
out[midx] = localmin;
}
}
}
Both do the same thing, finding the minimum of a “message” of 32 integers. One does roughly what your kernel does: one thread per message with the thread reading the message a word at a time to find the minimum. The other uses the same serial minimum search, the only difference being that the whole message is loaded into a shared memory buffer by a block of threads in a coalesced pattern. In the second case, only one thread actually does any computation, the rest of the block does nothing but the initial load.
Running both kernels on 1048576 messages, each 128 bytes long like this:
avidday@cuda:~$ nvcc -arch=sm_20 -Xptxas="-v" -o comp comp.cu
ptxas info : Compiling entry function '_Z2k2PjS_' for 'sm_20'
ptxas info : Used 12 registers, 128+0 bytes smem, 48 bytes cmem[0], 12 bytes cmem[16]
ptxas info : Compiling entry function '_Z2k1PjS_' for 'sm_20'
ptxas info : Used 10 registers, 48 bytes cmem[0]
avidday@cuda:~$ ./comp
k1 throughput = 3133.366099 Mb/s
k2 throughput = 13027.027200 Mb/s
Four times performance difference for the same serial algorithm, with the only difference being memory access patterns used to load the message from global memory. On an older card without the luxury of L1 and L2 cache and relaxed coalescing rules, the difference will be larger. I believe there are still wins to be had with your code.
…this is the actual algorithm…the only difference is that in my case, I am not comparing two strings but one string(str) against a list of strings(data_lst)…that’s why I think CUDA is an excellent choice since it can do a lot of those comparisons in parallel…
also…one more difference from the original algorithm…the actual LD algorithm keep the entire matrix in the memory…in my case, i only have two rows(the current one and the one before - matrix_buffer[2][161])
from Wikipedia under possible improvements
We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.