Memory problem? ...incredible slowdown

tera · January 30, 2011, 3:11am

That is something I was about to ask: If you have an upper limit for str_len so that the inner loop and the update loop over z can be fully unrolled, the compiler should be able to place matrix_buffer in registers (which should really help with both memory bandwidth) and also eliminate a lot of the bookkeeping. Something like

#pragma unroll

for (j = 0; j < 140; j++) {

    if (j>=*str_len)

        break;

Compiling with [font=“Courier New”]–ptxas-options=-v[/font] should show you whether matrix_buffer ends up in registers or in local memory.

tera · January 30, 2011, 3:24am

The other option is to use the byte array variant of your code and place matrix_buffer in shared memory. You’ll then have to bite and make the blocksize a less than optimal 50, but not wasting local memory bandwidth should more than outweigh this:

__shared__ char matrix_buffer[161][50][2];

I’ve reordered the indices to reduces all shared memory bank conflicts to 2-way. A bit of threadIdx rearrangement would allow to fully remove bank conflicts, but given the low occupancy this would probably not make any difference.

dfranusic · January 30, 2011, 9:30am

Interesting…with shared memory it takes twice as long :) …Elapsed=[6.503296msec] …now I’m really confused…maybe because i lowered the thread count per block from 500 to 50?
Maybe I’m just asking for too much…i mean…9.8 billion characters per second is a great achievement …since this is my first experience with cuda, I don’t have any reference points for comparison…
The test data consist of 1 000 000 messages, each 144 bytes in size…each of those messages is compared with another test message which is also 140 bytes in length…given the fact that for each comparison a 140x140 loop has to be executed, do you think that 2 seconds for 1 million records is something I should be happy with or is there room for improvement?

tera · January 30, 2011, 11:11am

Is that the byte array variant or the packed one?

Yes, lowering the blocksize to as low as 50 should take quite a hit. I’m surprised though that we don’t seem to see any benefits from the reduced local memory bandwidth consumption. Is the matrix_buffer really in shared memory now?

It seems that it would be quite possible to work on a whole message per block, spreading the inner loop amongst the threads. That would allow a much larger blocksize and give better occupancy while still keeping matrix_buffer in shared memory.

I’m a bit surprised: Is it that overwhelmingly common case that messages have the same length? If yes, that data implies that you are using about 19.6 GByte/s of local memory bandwidth, which is close to the specified range of 24-32 GB/sec. If not, the kernel would use twice the local mem bandwidth, which obviously isn’t possible. Note that the compiler seems to have automatically moved the memory accesses for m1 and m2 under the (ch==ch2) conditional.

Whether you are happy with this is your call. I think that there should still be quite some improvement possible. I’d also guess that the CPU version is still a lot faster, because it has the benefit of a large cache while working on only one message at a time?

dfranusic · January 30, 2011, 11:30am

It was the byte array one…The problem with this algorithm is that the inner loop cannot be spread among different threads because each value in matrix_buffer is calculated based on the previous one…so I’m guessing I would have to use synchronisation and that would probably slow down everything even more…maybe I’m wrong…haven’t tried it yet…

No…messages are not the same size…this is just …hmm…well…worst case scenario…
About the numbers…
On my laptop the memory bandwidth is 20GBps and It takes 3s for 100 000 records(330M 48 cuda cores). 1 000 000 records in 2s is the time I got on a different machine which has a much higher memory bandwidth and the number of cuda cores is 448.

LATOP(330M, 48 cores, 20Gbps)
16Mb input data, 100 000 messages - 3s

SERVER(GTX470, 448 cores)
160Mb input data, 1 000 000 messages - 2s

tera · January 30, 2011, 12:43pm

Oh, I didn’t know some of your numbers where for a GTX 470. On the GTX 470, you can of course set the block size to 150 in the shared memory version.

How fast is this on the CPU?

dfranusic · January 30, 2011, 1:02pm

The CPU version is much slower, 10s for 100 000 records…

avidday · January 30, 2011, 1:24pm

Memory access patterns are everything in CUDA. Consider these two model kernels:

__global__ void k1(unsigned int *in, unsigned int *out)

{

	volatile unsigned int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	int localmin = in[msglen * tidx];

	#pragma unroll

	for(int i = 1; i < msglen; i++) {

		localmin = min(localmin, in[msglen * tidx + i]);

	}

	out[tidx] = localmin;

}

	

__global__ void k2(unsigned int *in, unsigned int *out)

{

	volatile __shared__ unsigned int buffer[msglen];

	volatile unsigned int tidx = threadIdx.x + blockIdx.x * blockDim.x;

	for(int midx = blockIdx.x; midx < nmessages; midx += gridDim.x) {

		buffer[threadIdx.x] = in[midx * msglen + tidx];

		if (threadIdx.x == 0) {

			int localmin = buffer[0];

			#pragma unroll

			for(int i = 1; i < msglen; i++) {

				localmin = min(localmin, buffer[i]);

			}

			out[midx] = localmin;

		}

	}

}

Both do the same thing, finding the minimum of a “message” of 32 integers. One does roughly what your kernel does: one thread per message with the thread reading the message a word at a time to find the minimum. The other uses the same serial minimum search, the only difference being that the whole message is loaded into a shared memory buffer by a block of threads in a coalesced pattern. In the second case, only one thread actually does any computation, the rest of the block does nothing but the initial load.

Running both kernels on 1048576 messages, each 128 bytes long like this:

double k1start = systimer();

	k1 <<< nmessages/128, 128 >>> (_mbuffer, _output1);

	gpuAssert( cudaGetLastError() );

	gpuAssert( cudaThreadSynchronize() );

	double k1end = systimer();

	gpuAssert( cudaMemcpy(output1, _output1, outsize, cudaMemcpyDeviceToHost) );

	for (unsigned int i = 0; i < nmessages; i++)

		assert( output1[i] == ans);

	fprintf(stdout,"k1 throughput = %f Mb/s\n", 1e-6 * float(datasize) / (k1end-k1start));

	double k2start = systimer();

	k2 <<< 112, 32 >>> (_mbuffer, _output2);

	gpuAssert( cudaGetLastError() );

	gpuAssert( cudaThreadSynchronize() );

	double k2end = systimer();

	gpuAssert( cudaMemcpy(output2, _output2, outsize, cudaMemcpyDeviceToHost) );

	for (unsigned int i = 0; i < nmessages; i++)

		assert( output2[i] == ans );

	fprintf(stdout,"k2 throughput = %f Mb/s\n", 1e-6 * float(datasize) / (k2end-k2start));

yields this on a GTX470.

avidday@cuda:~$ nvcc -arch=sm_20 -Xptxas="-v" -o comp comp.cu

ptxas info    : Compiling entry function '_Z2k2PjS_' for 'sm_20'

ptxas info    : Used 12 registers, 128+0 bytes smem, 48 bytes cmem[0], 12 bytes cmem[16]

ptxas info    : Compiling entry function '_Z2k1PjS_' for 'sm_20'

ptxas info    : Used 10 registers, 48 bytes cmem[0]

avidday@cuda:~$ ./comp

k1 throughput = 3133.366099 Mb/s

k2 throughput = 13027.027200 Mb/s

Four times performance difference for the same serial algorithm, with the only difference being memory access patterns used to load the message from global memory. On an older card without the luxury of L1 and L2 cache and relaxed coalescing rules, the difference will be larger. I believe there are still wins to be had with your code.

tera · January 30, 2011, 3:56pm

I’ve quickly gone over the kernel to indicate where I think some gains could be had. I’ve almost certainly made mistakes, so better check yourself.

// call with a blocksize of (4, 12, 1) on compute capability 1.x

// or (4, 36, 1) on compute capability 2.x devices

_global__ void LD(char *data_lst, char *str, int *str_len_p, int *result){

        char ch;

        char ch2;

        int i;

        int j;

        int tmp_min;

        int str_len = *str_len_p;

#if __CUDA_ARCH__ < 200

        int matrix_buffer[160][48];

#else

        int matrix_buffer[160][144];

#endif

        int thread_id = (((blockIdx.y * gridDim.x) + blockIdx.x) * blockDim.x + threadIdx.x) * blockDim.y + threadIdx.y;

        int tid = threadIdx.x * blockDim.y + threadIdx.y;

        int lst_msg_id = thread_id*161;

        int msg_length = data_lst[lst_msg_id] & 0xff;

        int m1;

        int m2;

        int m3;

// init matrix buffer

        for(i = 0; i<str_len; i++) matrix_buffer[i][tid] = i+1;

// y axis 

        for(i = 0; i < msg_length; i++){

                ch = data_lst[lst_msg_id + i + 1];

                // x axis

                m1 = i;

                tmp_min = i+1;

#pragma unroll 2

                for(j = 0; j <str_len; j++){

                        ch2 = str[j];

m3 = m1;

                        m1 = matrix_buffer[j][tid];

                        m2 = tmp_min;

                        // calculate min        

                        if(ch == ch2) tmp_min = m3; else tmp_min = min(m1, m2, m3) + 1;

// set current matrix buffer value

                        matrix_buffer[j][tid] = tmp_min;

}

        }

        result[thread_id] = tmp_min;

}

dfranusic · January 30, 2011, 4:10pm

Levenshtein_distance

…this is the actual algorithm…the only difference is that in my case, I am not comparing two strings but one string(str) against a list of strings(data_lst)…that’s why I think CUDA is an excellent choice since it can do a lot of those comparisons in parallel…
also…one more difference from the original algorithm…the actual LD algorithm keep the entire matrix in the memory…in my case, i only have two rows(the current one and the one before - matrix_buffer[2][161])

from Wikipedia under possible improvements
- We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.

P.S.
Thank you very much for all your help!