strange problem

I wrote this very simple code for searching a word of 4 character in a file of 768 byte, with 768 threads (each thread starts searching from a different character in the file, and it checks 4 character. for example the thread n° 100 checks the character n° 100, 101, 102, 103, and at the end if the counter is 4, the word is found). It simply loads the word in the shared memory (each threads loads one byte) and starts the searching. It works well, but if I put the word in the bytes n°509-512 (but also 510-513, 511-514 and 508-511) ,then the thread n°509 should found it, the word is not found!! but if its lenght is < 4 then the threads n°509 found the word!! beacuse it’s counter don’t go over 3! I absolutely don’t understand why. Please help me!!

shared char memr[768];

int counter = 0;
int i;
int tid = blockDim.x*blockIdx.x+threadIdx.x;

memr[tid] = file[tid];
__syncthreads();

for(i=0; i<4; i++){
if(memr[tid+i] == word[i]){
++contatore;
}
}

P.S. if I change the instruction “if(memr[tid+i] == c[i])” in “if(file[tid+i] == c[i])”, for searching the word from the global memory and not form the shared memory, it works all well!

I wrote this very simple code for searching a word of 4 character in a file of 768 byte, with 768 threads (each thread starts searching from a different character in the file, and it checks 4 character. for example the thread n° 100 checks the character n° 100, 101, 102, 103, and at the end if the counter is 4, the word is found). It simply loads the word in the shared memory (each threads loads one byte) and starts the searching. It works well, but if I put the word in the bytes n°509-512 (but also 510-513, 511-514 and 508-511) ,then the thread n°509 should found it, the word is not found!! but if its lenght is < 4 then the threads n°509 found the word!! beacuse it’s counter don’t go over 3! I absolutely don’t understand why. Please help me!!

shared char memr[768];

int counter = 0;
int i;
int tid = blockDim.x*blockIdx.x+threadIdx.x;

memr[tid] = file[tid];
__syncthreads();

for(i=0; i<4; i++){
if(memr[tid+i] == word[i]){
++contatore;
}
}

P.S. if I change the instruction “if(memr[tid+i] == c[i])” in “if(file[tid+i] == c[i])”, for searching the word from the global memory and not form the shared memory, it works all well!

I don’t think shared memory will work for byte sized types, access needs to be using word length types. You will probably need to either work with ints or pack into char4 vector types.

I don’t think shared memory will work for byte sized types, access needs to be using word length types. You will probably need to either work with ints or pack into char4 vector types.

How do you spawn the threads?

Shared memory is only shared by threads within a block, so, if you have more than one block, you end up with multiple instances of ‘memr’, each only partially filled, and, naturally, words that cross the boundaries won’t be found.

Also, you have a buffer overrun, your 766’th to 768’th thread will try to access memory locations past the end of ‘memr’.

How do you spawn the threads?

Shared memory is only shared by threads within a block, so, if you have more than one block, you end up with multiple instances of ‘memr’, each only partially filled, and, naturally, words that cross the boundaries won’t be found.

Also, you have a buffer overrun, your 766’th to 768’th thread will try to access memory locations past the end of ‘memr’.

cuda3.2 works. In G4.3 of programming guide,

A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads.

cuda3.2 works. In G4.3 of programming guide,

A bank conflict only occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. If two or more threads access any bytes within the same 32-bit word, there is no bank conflict between these threads.

I have problem with 509’th threads (and maybe also with some other, I haven’t tried all the possibilities).

But the shared memory isn’t shared within a multiprocessor(that have one or more blocks)?? You say that in my code every thread allocate 768 bytes?? And if I want 768 thread shared within the multiprocessor??

However I don’t have any error in compile time and in run time.

@avidday

I tried to work with integer, but the problem is the same

I have problem with 509’th threads (and maybe also with some other, I haven’t tried all the possibilities).

But the shared memory isn’t shared within a multiprocessor(that have one or more blocks)?? You say that in my code every thread allocate 768 bytes?? And if I want 768 thread shared within the multiprocessor??

However I don’t have any error in compile time and in run time.

@avidday

I tried to work with integer, but the problem is the same

Not every thread. Every thread block. So, for example, if you launch 768 threads by writing “kernel<<<3, 256>>> (…)” (3 blocks of 256 threads each), you get three instances of memr.

Not every thread. Every thread block. So, for example, if you launch 768 threads by writing “kernel<<<3, 256>>> (…)” (3 blocks of 256 threads each), you get three instances of memr.

Now I understood, thanks a lot External Image

Now I understood, thanks a lot External Image