Wrong output when adding blocks what am I doing wrong?

DenisR · November 27, 2007, 8:31am

So, I am having some trouble with a big kernel called from Matlab (unspecified launch error), so I have made a smaller & simpler version and put that in the SDK directory to test if I am doing things ok and slowly expand on things until I have what I need.

The simple code is :

for (i = 1; i < array_size; i++) {

  for (k = 1;k < vec_size; k++)

    out_array[index] += in_array[index] * in_vec[k];

}

(In my kernel I take index = blockIdx.x;, so I let each block calculate one value in the output array (and it also only needs 1 value from the input array)

Now, when I run my code with blockDim.x = 256 everything is running ok, and the results are the same as a C-version. But when blockDim.x = 1536 the results are not the same.

I have attatched my code, as I am completely baffled since the maximum gridsize is 65535, so I am still far from that… Does anyone have clue what I am doing wrong?

Thanks,

Denis
test.tar.gz (6.99 KB)

DenisR · November 27, 2007, 9:12am

Okay, it is always useful to run in debug mode ;)

In the second case, I get my famous unspecified launch failure. Does anybody have a clue why? Because changing the number of blocks does not change my kernel in any way.

Thanks,
Denis

jordyvaneijk · November 27, 2007, 9:36am

So, I am having some trouble with a big kernel called from Matlab (unspecified launch error), so I have made a smaller & simpler version and put that in the SDK directory to test if I am doing things ok and slowly expand on things until I have what I need.

The simple code is :
for (i = 1; i < array_size; i++) {

  for (k = 1;k < vec_size; k++)

    out_array[index] += in_array[index] * in_vec[k];

}
(In my kernel I take index = blockIdx.x;, so I let each block calculate one value in the output array (and it also only needs 1 value from the input array)

Now, when I run my code with blockDim.x = 256 everything is running ok, and the results are the same as a C-version. But when blockDim.x = 1536 the results are not the same.

I have attatched my code, as I am completely baffled since the maximum gridsize is 65535, so I am still far from that… Does anyone have clue what I am doing wrong?

Thanks,

Denis

[snapback]285206[/snapback]

Maybe its a good idea to make the .cubin file to see if you are not using to many registers. the Max is 8192. I also had similar problems.

DenisR · November 27, 2007, 10:31am

No, only 6 registers. I had still a bug in the kernel. I will show here the code (too simple, right)

#define SDATA( index)      CUT_BANK_CHECKER(sdata, index)

#define NUM_THREADS 256

////////////////////////////////////////////////////////////////////////////////

//! Simple test kernel for device functionality

//! @param g_idata  input data in global memory

//! @param g_ivec   input data in global memory

//! @param num_adds input parameter

//! @param g_odata  output data in global memory

/////////////////////////////////////////////////////////////////////////////////

__global__ void

testKernel( float* g_idata, float* g_ivec, unsigned int num_adds, float* g_odata) 

{

  __shared__  float sdata[NUM_THREADS];

  __shared__  float inp;

  

  const unsigned int tid = threadIdx.x;

  const unsigned int index = blockIdx.x;

 // read in input data from global memory

  // use the bank checker macro to check for bank conflicts during host

  // emulation

 if (tid==0)

     inp = g_idata[index];

 SDATA(tid) = 0.0f;

 __syncthreads();

 for (unsigned int offset = 0; offset < num_adds; offset += NUM_THREADS) {

   unsigned int arr_index = tid + offset;

   if (arr_index < num_adds)

      SDATA(tid) += inp * g_ivec[arr_index];

  }

    __syncthreads();

  // do a parallel reduction

    for(unsigned int s=NUM_THREADS/2; s>0; s>>=1) {

        if (tid < s) {

            sdata[tid] += sdata[tid + s];

        }

        __syncthreads();

    }

 // write data to global memory

  if (tid==0)

    g_odata[index] = SDATA(0);

}

And I call it like this :

   // setup execution parameters

    dim3  grid( 256 or 1536, 1, 1);

    dim3  threads( NUM_THREADS, 1, 1);

   // execute the kernel

    testKernel<<< grid, threads>>>( d_idata, d_ivec, num_adds, d_odata);

Hmm, I guess I have to try and speed it up in another way. I think I need to load the vector also in shared memory, and just process everything in 1 block. Now I read in the complete vector in every block.

MisterAnderson42 · November 27, 2007, 2:19pm

How long is the kernel execution taking? Nearly 5s? You might be hitting the timeout with the big computation.

Here are a few tips to improve performance:

Don’t have a block size of 1. Make the block size variable and benchmark for performance (64 is a good starting point, 128 or 256 might turn out to be optimal). You will need to use index = blockIdx.x * blockDim.x + threadIdx.x to get coalesced memory reads
Don’t do this: out_array[index] += . It wastes memory bandwidth and you probably aren’t even guarunteed correct results with it. Instead read in out_array[index] at the beginning, increment the value in a register and write it once at the end.
How big is in_vec[k]? Is it constant over the life of your program? Depending on the answers to these questions, in_vec would be much better in shared, constant, or texture memory. ANYTHING BUT global memory since as you have it now, the read is uncoalesced and dropping your performance by a factor of 20

Edit: addition to 1) Blocks are processed in warps of 32 threads so it makes absolutely no sense to have a block size that is not a multiple of 32. You can add an if (index >= array_size) around the meat of the kernel to avoid processing off the end of the array, which incidentally would also give you an unspecified launch failure.

DenisR · November 28, 2007, 2:07pm

How long is the kernel execution taking? Nearly 5s? You might be hitting the timeout with the big computation.

Here are a few tips to improve performance:

Don’t have a block size of 1. Make the block size variable and benchmark for performance (64 is a good starting point, 128 or 256 might turn out to be optimal). You will need to use index = blockIdx.x * blockDim.x + threadIdx.x to get coalesced memory reads

Don’t do this: out_array[index] += . It wastes memory bandwidth and you probably aren’t even guarunteed correct results with it. Instead read in out_array[index] at the beginning, increment the value in a register and write it once at the end.

3) How big is in_vec[k]? Is it constant over the life of your program? Depending on the answers to these questions, in_vec would be much better in shared, constant, or texture memory. ANYTHING BUT global memory since as you have it now, the read is uncoalesced and dropping your performance by a factor of 20

Edit: addition to 1) Blocks are processed in warps of 32 threads so it makes absolutely no sense to have a block size that is not a multiple of 32. You can add an if (index >= array_size) around the meat of the kernel to avoid processing off the end of the array, which incidentally would also give you an unspecified launch failure.

[snapback]285317[/snapback]

I have more than 1 block (I had every block process a seperate element of the matrix), and also do not do += … on global memory. But I make my mistake in (3) with in_vec[k]. That is now global memory, and I read it in every thread… (stupid me) The only problem is that it will be about 5000 elements big, so I am still thinking of how to do that part smarter without using up enormous amounts of shared memory.

It is of constant size & dows not change. I guess the size prevents me from putting it in shared memory, so that leaves const memory / texture memory. I will look up const memory, since all my input arrays are constant throughout the program.

Luckily my home computer broke down, so I got myself a laptop with an 8600M GT, so I can try things at home now too after I install linux next to this vista thingie…

MisterAnderson42 · November 28, 2007, 2:44pm

I see I was mistaken on 1 and 2 as you say. I was basing those comments off of the first code snippet you posted. I didn’t see the 2nd code posted before I did.

Constant memory might give you some benefit to 3), but not much. Your “simple” code had an uncoalesced read, but now that I see your real kernel the way you are accessing g_ivec IS coalesced. You can double check if you like by running your code through the CUDA 1.1 profiler with the proper counter activated. The only uncoalesced accesses you have in the entire kernel are the g_idata and g_odata ones, but they are only 1 per block and should not change performance significantly.

That takes us back to my first question then, is the execution time about 5 s before you get the unspecified launch failure?

DenisR · November 28, 2007, 3:25pm

Sorry, thought I wrote that earlier, but do not find it in the thread, the cudaTimer around the kernel call is telling me there has been 4500 msec past before the error.

the strange thing is, the normal C-code on CPU takes much less time… I have run the PROFILER from 1.1 btw, and will post the results on friday when I am back with the system. There were quite some uncoalesced reads in the output which quite surprised me.

One problem that I see is that I am loading in the g_ivec in every block, so that is not really optimal, since that is my large vector.

DenisR · November 30, 2007, 11:40am

Now I get really confused. I will attatch my code. But here is some output (and Profiler results that go with it) There is a maximum number of blocks that I can process apparantly, with 896, it still works.

…/…/bin/linux/release/test
num_cells : 256
num_adds : 5000
num_threads = 256
Processing time: 120.695999 (ms)
Reference processing time: 2.668000 (ms)
Test PASSED
num_cells : 896
num_adds : 5000
num_threads = 256
Processing time: 2.370000 (ms)
Reference processing time: 11.072000 (ms)
Test PASSED

Press ENTER to exit…

more cuda_profile.log
method=[ memcopy ] gputime=[ 3.008 ]
method=[ memcopy ] gputime=[ 9.056 ]
method=[ Z10testKernelPfS_jS ] gputime=[ 618.080 ] cputime=[ 633.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 113 ] gld_coherent=[ 39765 ] gst_incoherent=[ 226 ] gst_coherent=[ 56 ]
method=[ memcopy ] gputime=[ 2.720 ]
method=[ memcopy ] gputime=[ 4.288 ]
method=[ memcopy ] gputime=[ 9.056 ]
method=[ Z10testKernelPfS_jS ] gputime=[ 2129.792 ] cputime=[ 2143.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 427 ] gld_coherent=[ 140559 ] gst_incoherent=[ 854 ] gst_coherent=[ 88 ]
method=[ memcopy ] gputime=[ 3.552 ]

As far as I can see, I do my timing like I should do it, so why can the first kernel invocation take more time according to the timer & less according to the profile?

And why does it not work with 1024 blocks?

Anybody want to test my code and tell me what I am doing wrong? I am completely blank right now.

Thanks a lot!
Denis
test.tar.gz (3.65 KB)

AndreiB · November 30, 2007, 12:03pm

It has been reported on this firum that first invcation is much slower than following ones. This may be caused by converting cubin code to actua device code. Not much information about this, however :(

Which function call fails for 1024? Is it kernel invocation or something else?

MisterAnderson42 · November 30, 2007, 2:17pm

The driver context needs to be initialized too. And the device code needs to be copied from the host to the device so it can execute it.

DenisR · December 2, 2007, 6:59pm

Yeah, after writing it I thought of the fact that the first call takes the timing-hit and I should run it once before to get accurate timing.

But the strange thing that is happening is that I can run the code with 256, 512 (even 896) blocks, but with 1024 the kernel fails to run. The code that is attatched to my previous post tries to run the kernel for 256, 512, 1024 (and 2048) blocks, but the latter 2 fail to run.

So for 256 blocks it takes 120 ms (including the first-invocation penalty)
for 512 blocks it takes 2.7 msec
And then for 1024 it fails to run… It will probably not be the 5 sec limitation ;)

Any other reasons it may fail? The kernel is exactly the same, I am just increasing the number of blocks.

EDIT:
I have modified my code to try N*16 blocks and it starts to fail with 1024 blocks. Here is an output snippet at the end in debug mode, the call with 1008 blocks takes 3.7 msec, so I am certainly not being bitten by a 5 sec limitation, but I have not found any other possible reasons for the unspecified launch failure in the forums/internet/documentation:

num_cells : 960
num_adds : 5000
num_threads = 512
Processing time: 3.569000 (ms)
Reference processing time: 44.217999 (ms)
Test PASSED

num_cells : 976
num_adds : 5000
num_threads = 512
Processing time: 3.619000 (ms)
Reference processing time: 45.848000 (ms)
Test PASSED

num_cells : 992
num_adds : 5000
num_threads = 512
Processing time: 3.675000 (ms)
Reference processing time: 44.811001 (ms)
Test PASSED

num_cells : 1008
num_adds : 5000
num_threads = 512
Processing time: 3.720000 (ms)
Reference processing time: 46.703999 (ms)
Test PASSED

num_cells : 1024
num_adds : 5000
num_threads = 512
Cuda error: Kernel execution failed in file ‘test.cu’ in line 136 : unspecified launch failure.

DenisR · December 3, 2007, 9:26pm

I have checked my code for memory-allocation trouble (to rule out that I am writing to memory that has not been allocated) but I could not find a mistake.

So I am still baffled as to why I get a Kernel Launch Failure with 1024 blocks, while it works great with 1-1023 blocks.

DenisR · December 4, 2007, 9:03am

It is good that I do not have a webcam, otherwise you would see me sitting with a brown paper bag…
It turned out I made 4 times as many blocks as I needed, so I was indeed writing in memory I was not allowed. I guess there is a minimum amount of memory allocated with CudaMalloc, so it worked until I hit 1024 blocks…

So this turned out to be a brown-paper-bag for me…

Topic		Replies	Views
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2243	October 12, 2021
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7913	June 14, 2013
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3018	November 13, 2017
threads and blocks CUDA Programming and Performance	3	1350	May 7, 2012
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10599	December 24, 2008
Kernel launches blocking when 1024 kernels in a queue CUDA Programming and Performance	3	2646	July 8, 2014
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16294	January 30, 2011
Launching Kernel Fail CUDA Programming and Performance	15	3400	May 28, 2014
syncthreads problem I guess this is a syncthreads problem CUDA Programming and Performance	9	5130	October 12, 2008

Wrong output when adding blocks what am I doing wrong?

Related topics