So, I am having some trouble with a big kernel called from Matlab (unspecified launch error), so I have made a smaller & simpler version and put that in the SDK directory to test if I am doing things ok and slowly expand on things until I have what I need.
The simple code is :
for (i = 1; i < array_size; i++) {
for (k = 1;k < vec_size; k++)
out_array[index] += in_array[index] * in_vec[k];
}
(In my kernel I take index = blockIdx.x;, so I let each block calculate one value in the output array (and it also only needs 1 value from the input array)
Now, when I run my code with blockDim.x = 256 everything is running ok, and the results are the same as a C-version. But when blockDim.x = 1536 the results are not the same.
I have attatched my code, as I am completely baffled since the maximum gridsize is 65535, so I am still far from that… Does anyone have clue what I am doing wrong?
In the second case, I get my famous unspecified launch failure. Does anybody have a clue why? Because changing the number of blocks does not change my kernel in any way.
No, only 6 registers. I had still a bug in the kernel. I will show here the code (too simple, right)
#define SDATA( index) CUT_BANK_CHECKER(sdata, index)
#define NUM_THREADS 256
////////////////////////////////////////////////////////////////////////////////
//! Simple test kernel for device functionality
//! @param g_idata input data in global memory
//! @param g_ivec input data in global memory
//! @param num_adds input parameter
//! @param g_odata output data in global memory
/////////////////////////////////////////////////////////////////////////////////
__global__ void
testKernel( float* g_idata, float* g_ivec, unsigned int num_adds, float* g_odata)
{
__shared__ float sdata[NUM_THREADS];
__shared__ float inp;
const unsigned int tid = threadIdx.x;
const unsigned int index = blockIdx.x;
// read in input data from global memory
// use the bank checker macro to check for bank conflicts during host
// emulation
if (tid==0)
inp = g_idata[index];
SDATA(tid) = 0.0f;
__syncthreads();
for (unsigned int offset = 0; offset < num_adds; offset += NUM_THREADS) {
unsigned int arr_index = tid + offset;
if (arr_index < num_adds)
SDATA(tid) += inp * g_ivec[arr_index];
}
__syncthreads();
// do a parallel reduction
for(unsigned int s=NUM_THREADS/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write data to global memory
if (tid==0)
g_odata[index] = SDATA(0);
}
Hmm, I guess I have to try and speed it up in another way. I think I need to load the vector also in shared memory, and just process everything in 1 block. Now I read in the complete vector in every block.
How long is the kernel execution taking? Nearly 5s? You might be hitting the timeout with the big computation.
Here are a few tips to improve performance:
Don’t have a block size of 1. Make the block size variable and benchmark for performance (64 is a good starting point, 128 or 256 might turn out to be optimal). You will need to use index = blockIdx.x * blockDim.x + threadIdx.x to get coalesced memory reads
Don’t do this: out_array[index] += . It wastes memory bandwidth and you probably aren’t even guarunteed correct results with it. Instead read in out_array[index] at the beginning, increment the value in a register and write it once at the end.
How big is in_vec[k]? Is it constant over the life of your program? Depending on the answers to these questions, in_vec would be much better in shared, constant, or texture memory. ANYTHING BUT global memory since as you have it now, the read is uncoalesced and dropping your performance by a factor of 20
Edit: addition to 1) Blocks are processed in warps of 32 threads so it makes absolutely no sense to have a block size that is not a multiple of 32. You can add an if (index >= array_size) around the meat of the kernel to avoid processing off the end of the array, which incidentally would also give you an unspecified launch failure.
I have more than 1 block (I had every block process a seperate element of the matrix), and also do not do += … on global memory. But I make my mistake in (3) with in_vec[k]. That is now global memory, and I read it in every thread… (stupid me) The only problem is that it will be about 5000 elements big, so I am still thinking of how to do that part smarter without using up enormous amounts of shared memory.
It is of constant size & dows not change. I guess the size prevents me from putting it in shared memory, so that leaves const memory / texture memory. I will look up const memory, since all my input arrays are constant throughout the program.
Luckily my home computer broke down, so I got myself a laptop with an 8600M GT, so I can try things at home now too after I install linux next to this vista thingie…
I see I was mistaken on 1 and 2 as you say. I was basing those comments off of the first code snippet you posted. I didn’t see the 2nd code posted before I did.
Constant memory might give you some benefit to 3), but not much. Your “simple” code had an uncoalesced read, but now that I see your real kernel the way you are accessing g_ivec IS coalesced. You can double check if you like by running your code through the CUDA 1.1 profiler with the proper counter activated. The only uncoalesced accesses you have in the entire kernel are the g_idata and g_odata ones, but they are only 1 per block and should not change performance significantly.
That takes us back to my first question then, is the execution time about 5 s before you get the unspecified launch failure?
Sorry, thought I wrote that earlier, but do not find it in the thread, the cudaTimer around the kernel call is telling me there has been 4500 msec past before the error.
the strange thing is, the normal C-code on CPU takes much less time… I have run the PROFILER from 1.1 btw, and will post the results on friday when I am back with the system. There were quite some uncoalesced reads in the output which quite surprised me.
One problem that I see is that I am loading in the g_ivec in every block, so that is not really optimal, since that is my large vector.
Now I get really confused. I will attatch my code. But here is some output (and Profiler results that go with it) There is a maximum number of blocks that I can process apparantly, with 896, it still works.
As far as I can see, I do my timing like I should do it, so why can the first kernel invocation take more time according to the timer & less according to the profile?
And why does it not work with 1024 blocks?
Anybody want to test my code and tell me what I am doing wrong? I am completely blank right now.
It has been reported on this firum that first invcation is much slower than following ones. This may be caused by converting cubin code to actua device code. Not much information about this, however :(
Which function call fails for 1024? Is it kernel invocation or something else?
Yeah, after writing it I thought of the fact that the first call takes the timing-hit and I should run it once before to get accurate timing.
But the strange thing that is happening is that I can run the code with 256, 512 (even 896) blocks, but with 1024 the kernel fails to run. The code that is attatched to my previous post tries to run the kernel for 256, 512, 1024 (and 2048) blocks, but the latter 2 fail to run.
So for 256 blocks it takes 120 ms (including the first-invocation penalty)
for 512 blocks it takes 2.7 msec
And then for 1024 it fails to run… It will probably not be the 5 sec limitation ;)
Any other reasons it may fail? The kernel is exactly the same, I am just increasing the number of blocks.
EDIT:
I have modified my code to try N*16 blocks and it starts to fail with 1024 blocks. Here is an output snippet at the end in debug mode, the call with 1008 blocks takes 3.7 msec, so I am certainly not being bitten by a 5 sec limitation, but I have not found any other possible reasons for the unspecified launch failure in the forums/internet/documentation:
I have checked my code for memory-allocation trouble (to rule out that I am writing to memory that has not been allocated) but I could not find a mistake.
So I am still baffled as to why I get a Kernel Launch Failure with 1024 blocks, while it works great with 1-1023 blocks.
It is good that I do not have a webcam, otherwise you would see me sitting with a brown paper bag…
It turned out I made 4 times as many blocks as I needed, so I was indeed writing in memory I was not allowed. I guess there is a minimum amount of memory allocated with CudaMalloc, so it worked until I hit 1024 blocks…
So this turned out to be a brown-paper-bag for me…