Memory problem? ...incredible slowdown

Hi …i have a very strange problem…well since I’m new to CUDA maybe it’s not all that strange and i’m sure there’s a very simple answer to my problem…
I have a case where i execute a grid of 5x40 x 500 threads…and each thread fetches an array of ints from global memory starting at its own position(based on thread_id*SOME_LENGTH, running in a loop reading next int in every iteration).
The problem I’m having is that the kernel runtime is around 0.5msec if i don’t write anything back as a result…if i do, runtime jumps to around 3000msec…

the problematic line(tmp_min is a local variable that gets calculated in each thread separately):

  1. result[thread_id] = tmp_min;

…without it, runtime = 0.5msec
…with it, runtime = 2850msec

Can somebody help?..i have tried everything, even wrote 3 completely different versions of algorithm but i keep ending up with the same result…
Thank you

I am testing this on 330M on my laptop, these are the results of memorybandidthtest

Running on…

Device 0: GeForce GT 330M
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2465.1

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2401.2

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 20501.2

[bandwidthTest] - Test results:

If you don’t write back anything, the compiler optimizes away the whole kernel, resulting in excellent runtimes. The longer time is the real time your kernel needs to get it’s job done.

Probably compiler optimization. The compiler is smart enough to realize that if you don’t write tmp_min to global or shared memory, then all the code used to produce it is redundant and can be removed. If that code makes up the bulk of the kernel execution time, your kernel will run much faster without it.

You should be able to confirm this by compiling the “slow” and “fast” case to PTX and examining the output. Even if you don’t understand much of the assembler output, there should be a pretty obvious different in instruction count and the absence of things like looping branches.

Also…the slower version of kernel completely freezes my laptop…is it possible that i’m facing some driver issues?

No. The freeze happens because you kernel takes 3 seconds to run, and during those 3 seconds the GPU is unavailable to the display manager to do any display refreshing, so the display looks frozen until the kernel finishes. Be aware that there is a watchdog timer that will kick in after about 5 seconds and kill you kernel if it is taking too long. You will get a launch timeoout error returned by the runtime, and if you are using windows, an system error will pop up telling you the GPU driver was reset.

If the responsiveness of the system is an issue during runs or you start hitting the watchdog timer limit, consider designing “re-entrant” code or structuring the algorithm in such a way so you can use multiple kernel launches to cover the input data set. Kernel launch overhead isn’t large, so running the same kernel many times to process a give set of inputs isn’t much less efficient that one gigantic kernel call, and it will let the display manager steal time between kernels at shorter intervals, reducing the screen freeze effect as well.

device void min2(int a, int b, int c, int *res){
*res = a;
if(b < *res) { *res = b; }
if(c < *res) { *res = c; }

this is the function which gets called in a loop…and this is the line that calls it.
1. if(ch == ch2) tmp_min = m3; else min2(m1+1, m2+1, m3+1, &tmp_min);

…and this is the last line which gets called outside of loop, when everything is done…
2. result[thread_id] = tmp_min;

…on gtx470 with grid 50x40 x 500threads i am getting 2s for kernel execution time…
if i comment out either of those two lines(line 1. or line 2.)…the execution time drops to a few msec…any ideas on how to resolve this…i was hoping to accomplish runtime of around 5msec …with those lines included of course :)…??

As was pointed out to you (twice) the execution speed difference is due to compiler optimization. “Resolving this”, by which I presume you mean making the code go faster, requires analysis of the code and the algorithm which it implements. That needs far more detail than you have provided.

This is the function …

__global__ void LD2(int *data_lst, int *str, int *str_len, int *result){

        char ch;

        char ch2;

        int i;

        int j;

        int z;

        int thread_id = (blockIdx.y * 5*500) + (blockIdx.x*500) + threadIdx.x;

        int msg_length = data_lst[thread_id*41] >> 24  & 0xff;

        int lst_msg_id = thread_id*41;

        int matrix_buffer[2][41]; // 2 rows, 161 columns

        int tmp_min;

        int m1;

        int m2;

        int m3;

        int y_byte_pos;

        int x_byte_pos;

        int x_byte_counter;

        int y_byte_counter;

        int x_packet_len = (int)ceil((float)*str_len/4);

        int x_msg;

        int y_msg;

        // init buffer matrix

        for(i = 0; i<=x_packet_len; i++) matrix_buffer[0][i] = 0x00000000 + ((i*4) << 24) + ((i*4+1) << 16) + ((i*4+2) << 8) + (i*4+3);

        matrix_buffer[1][0] = 0x010000000;

// y axis

        i = 0;

        j = 0;

        y_byte_pos = 0;

        x_byte_pos = 0;

        x_byte_counter = 0;

        y_byte_counter = 0;

        //printf("%s\n", "M1 = UP, M2 = LEFT, M3 = UP LEFT");

        while(y_byte_counter < msg_length){

                //get packed int if positioned on first packed byte

                if(y_byte_pos == 0) y_msg = data_lst[lst_msg_id + i + 1];

                // get packed byte

                ch = ((y_msg >> (y_byte_pos)*8) & 0x000000ff);

                // x axis

while(x_byte_counter < *str_len){

                        //get packed int if positioned on first packed byte

                        if(x_byte_pos == 0) x_msg = str[j];

                        // get packed byte

                        ch2 = (x_msg >> (x_byte_pos)*8) & 0x000000ff;

                        m1 = ((matrix_buffer[0][(x_byte_pos < 3 ? j : j + 1)] >> ((4-x_byte_pos-2)*8)) & 0x000000ff); //U

                        m2 = ((matrix_buffer[1][(x_byte_pos < 3 ? j : j )] >> ((4-x_byte_pos-1)*8)) & 0x000000ff); //L

                        m3 = ((matrix_buffer[0][(x_byte_pos < 4 ? j : j + 1)] >> ((4-x_byte_pos-1)*8)) & 0x000000ff); //UL

if(ch == ch2) tmp_min = m3; else min2(m1+1, m2+1, m3+1, &tmp_min);

                        // set current matrix buffer position value to tmp_min

                        matrix_buffer[1][(x_byte_pos < 3 ? j : j + 1)] &= ~((0x000000ff << ((x_byte_pos < 3 ? 2-x_byte_pos : x_byte_pos))*8));

                        matrix_buffer[1][(x_byte_pos < 3 ? j : j + 1)] |=     ((tmp_min << ((x_byte_pos < 3 ? 2-x_byte_pos : x_byte_pos))*8));

// next packed byte                     


                        // next packed int

                        if(x_byte_pos == 4){ j++; x_byte_pos = 0; }

                        //x  byte counter



                // update matrix buffer

                // move row 1 to row 0, update row 1

                for(z = 0; z<=x_packet_len; z++) matrix_buffer[0][z] = matrix_buffer[1][z];

                matrix_buffer[1][0] = ((y_byte_counter+2)<<24); //i + 1;

// reset x axis counters

                j = 0;

                x_byte_pos = 0;

                x_byte_counter = 0;

// next packed byte


                // next packed int

                if(y_byte_pos == 4){ i++; y_byte_pos = 0; }

                // y byte counter



  result[thread_id] = tmp_min;


There looks to be a lot of low hanging fruit in that code, to say the least. In no particular order:


The CUDA execution model works like a 32 wide SIMD machine, the warp size is 32. It is wasteful to use blocks that are not sized round multiples of 32. You might also look up the built in gridDim variable in the documentation.

You are doing whatever it is this code does using a local memory array, that will be very slow.

You also waste an unknown number of cycles and local memory accesses to initialize that local memory to the same starting value in every thread. Might it be more efficient to compute that once, load it into constant memory and then copy it?

You might also want to think about whether there might be more parallelism to be extracted from the code if you have a block process a “packet” or whatever is the input to the kernel, rather than using one input per thread.

Your min2 function will probably be faster if you construct it using the built in cuda min function, which compiles to a hardware instruction without branching. And don’t use pointers in that case, because it isn’t necessary and will probably be more difficult for the compiler to inline that if you don’t use them.

That should keep you busy for a while.

Thanks…well I’ll try to somehow optimize the algorithm a bit more…

  1. I have tried with 32x32x128 for example…will keep that in mind, thnx.
  2. The problem is that I have one big input and each thread reads a portion of that input and does what it does on that portion…the input for this test is 160Mb…so using shared memory is not possible…or is it?..t
  3. I have tried that…even tried commenting out that part just for benchmark purposes and haven’t achieved any improvements
  4. I am working on that but the whole thing is already completely parallel …anyway…i am contemplating on that :)
  5. Yes i have already tried that, and yes, I got some improvements but all I got was a couple of msec…still, thanks for the hint

Thank you for all your help.

part of profiler output…based on 32x32 x 128 threads configuration
occupancy=[ 0.750 ] gld_incoherent=[ 0 ] gst_incoherent=[ 0 ] cta_launched=[ 504 ] warp_serialize=[ 1021 ]

Your code currently uses an 328 byte array per thread to work on. This is stored in local memory, which is very slow. If you can move that to shared memory, there is probably an order of magnitude speed up to be had.

Your definition of parallelism must be a bit different to mine. As far as I can tell, processing of each packet or message or whatever the input represents is completely serial. And very slow as a result. You haven’t given any indication of typical value of msg_length might be, but if it is larger than 32 words, then there I guess that are far more efficient ways to do this.

By parallelism I mean that each thread is doing the same thing only on a different part of main input(160Mb), the 328 byte array is a local buffer for each thread…it would be better if i could somehow have it in faster memory…each thread needs to have it’s own.
Typical msg_length is 160 bytes, so what i have is one big input and many many threads working in parallel, doing the same calculations on a different portion of local memory input buffer. I’m using packed integers(packing 4 bytes per int), that’s why you see all those bitwise operators. How would you solve the problem of having to process one large input by having a number of threads process different portions simultaneously?

If each word in a given msg can be unpacked into bytes independently, then you could have each thread in a block or (or individual warps with a larger block) operate on its own word in shared memory to do the unpacking, then use a parallel reduction to calculate the minimum of value of the buffer to store in the output. This also has the advantage that the memory loads could be coalesced, which also should give some speed up.

All the shifts, masking off of bytes and bookkeeping on byte positions within words look like your code could achieve whatever it does much easier and faster if it operated on an array of bytes instead.

Maybe that would even give us a chance to understand what it is doing.

Don’t you think that he would gone the easier way and create array of bytes instead of shifting and masking if that would really be faster?

It is obvious that he knows what he is doing (levenshtein distance algorithm a bit different in terms of memory usage – packing), most of the proposed optimizations will not give him much (for example switching min function with integrated ones gives 14ms increase on 100k records), so I think the only remaining question remaining is if he can achieve 5-10ms number on levenshtein algorithm or not? So if anyone has any propositions let him know, I certainly don’t.

I don’t believe it is faster. The first post shows the GPU has effective global memory bandwidth of 20.5 Gb/s, and if the current code takes 2 seconds to process 160Mb of inputs, that implies it is hitting 80 Mb/s or 0.39% of global memory bandwidth. I would suggest that all of the integer unpacking and book keeping code is counter productive in this case, particularly on a pre-Fermi GPU which has limited integer arithmetic peformance.

hmm…why am getting only 80Mbps then?..any ideas?

  1. packed byte version: Elapsed=[2857.410400msec]

  2. normal char array version: Elapsed=[3203.520752msec]

method using byte array

_global__ void LD(char *data_lst, char *str, int *str_len, int *result){

        char ch;

        char ch2;

        int i;

        int j;

        int z;

        int tmp_min;

        int matrix_buffer[2][161];

        int thread_id = (blockIdx.y * 5*500) + (blockIdx.x*500) + threadIdx.x;

        int lst_msg_id = thread_id*161;

        int msg_length = data_lst[lst_msg_id] & 0xff;

        int m1;

        int m2;

        int m3;

// init matrix buffer

        for(i = 0; i<=*str_len; i++) matrix_buffer[0][i] = i;

        matrix_buffer[1][0] = 1;

// y axis 

        for(i = 0; i < msg_length; i++){

                ch = data_lst[lst_msg_id + i + 1];

                // x axis

                for(j = 0; j <*str_len; j++){

                        ch2 = str[j];

m1 = matrix_buffer[0][j+1]; //U

                        m2 = matrix_buffer[1][j]; //L

                        m3 = matrix_buffer[0][j]; //UL

                        // calculate min        

                        if(ch == ch2) tmp_min = m3; else tmp_min = min(m1+1, m2+1, m3+1);

// set current matrix buffer value

                        matrix_buffer[1][j+1] = tmp_min;


                // update atrix buffer

                for(z = 0; z<161; z++) matrix_buffer[0][z] = matrix_buffer[1][z];

                matrix_buffer[1][0] = i+2;


        result[thread_id] = tmp_min;


…I’ve also noticed that if I unroll the inner loop

#pragma unroll 140

for(j = 0; j <*str_len; j++){

…i get from 3203.520752msec down to 2496.515625msec …

Dunno, I’m not really an expert on cuda, just starting, but couldn’t it be that it just takes more time for each thread to finish. So in that case there isn’t that much memory bandwith required? Maybe I am wrong…also this code takes around 250ms on gtx470…which is still high I guess…

Any thoughts?

Yes, the cache on 2.x devices is supposed to help a lot. The challenge is to make it work well on 1.x devices.