help with a problem sizing threads and blocks

I’m receiving a “ERROR: Kernel renderKernel failed” when nP exceeds 512, if nP > 512 it limits the number of threads to 512 and increases the number of blocks to two. Then I get the error message

This is a code snippet

global void rPolyKernel(float* Areal, float* Aimag,
float* Breal, float* Bimag,
float* Creal, float* Cimag,
size_t nS, size_t nP)
{
unsigned i = blockIdx.x * blockDim.x + threadIdx.x;

Creal[i*nP] = Areal[i];
Cimag[i*nP] = Aimag[i];

for (size_t j = 1; j < nP; j++)
{
    //C= A*B
    Creal[i*nP + j] = (Creal[i*nP + j-1] * Breal[i]) - 
                  (Cimag[i*nP + j-1] * Bimag[i]);
				
    Cimag[i*nP + j] = (Creal[i*nP + j-1] * Bimag[i]) + 
                  (Cimag[i*nP + j-1] * Breal[i]);
}

}

mathutils::splitComplex* scatsD = new mathutils::splitComplex;; 
scatsD->realp = new float[numP];
scatsD->imagp = new float[numP];

cudaMalloc((void**)&Areal, memSizeS);
cudaMalloc((void**)&Aimag, memSizeS);
cudaMalloc((void**)&Breal, memSizeS);
cudaMalloc((void**)&Bimag, memSizeS);
cudaMalloc((void**)&Creal, memSizeSBP);
cudaMalloc((void**)&Cimag, memSizeSBP);
cudaMalloc((void**)&Dreal, memSizeP);
cudaMalloc((void**)&Dimag, memSizeP);

cudaMemcpy(Areal, scatsA->realp, memSizeS,   cudaMemcpyHostToDevice);
cudaMemcpy(Aimag, scatsA->imagp, memSizeS,   cudaMemcpyHostToDevice);
cudaMemcpy(Breal, scatsB->realp, memSizeS,   cudaMemcpyHostToDevice);
cudaMemcpy(Bimag, scatsB->imagp, memSizeS,   cudaMemcpyHostToDevice);
cudaMemcpy(Creal, scatsC->realp, memSizeSBP, cudaMemcpyHostToDevice);
cudaMemcpy(Cimag, scatsC->imagp, memSizeSBP, cudaMemcpyHostToDevice);
cudaMemcpy(Dreal, scatsD->realp, memSizeP,   cudaMemcpyHostToDevice);
cudaMemcpy(Dimag, scatsD->imagp, memSizeP,   cudaMemcpyHostToDevice);


if (numP > 512)
{
    numThreads = 512;

if ( numP % 512 == 0)
{
    numBlocks = numP/numThreads;
}
else
{
    numBlocks = numP/numThreads + 1;
}
    
}
else
{
    numThreads = numP;
    numBlocks = numT;
}



rPolyKernel<<<numBlocks, numThreads>>>(Areal, Aimag,
                                       Breal, Bimag,
                                       Creal, Cimag,
                                       numS,  numP);

You probably need to put [font=“Courier New”]if (i<nP) {…};[/font] around your kernel to avoid out of bounds memory accesses.

Here is what I’m trying to accomplish

let A be two arrays of floats of length nS. S represents an array of complex numbers.

A->real[0, 1, …, nS-1]
A->imag[0, 1, …, nS-1]

Let A(j) = M(j) + N(j) for j = [0, 1, …, nS]

let B two arrays of floats of length nS. B represents an array of complex numbers

B->real[0, 1, …, nP-1]
B->imag[0, 1, …, nP-1]

Let B(j) = R(j) + Si(j) for j = [0, 1, …, nS]

C is an array of complex numbers that represents a [nP x nS] matrix of complex numbers
FIRST COL
C[0] = A[0]*B[0]^0 = A[0]
C[1] = A[0]*B[0]^1 = C[0]*B[0]
C[2] = A[0]*B[0]^2 = = C[1]*B[0]

C[nP-1] = A[0]*B[0]^(nP-1) = C[nP-2]*B[0]

SECOND COL
C[nP] = A[1]*B[1]^(0) = A[1]
C[nP+1] = A[1]*B[1]^(1) = C[Np]*B[1]

C[2nP-1] = A[1]*B[1]^(nP-1) = C[2nP-2]*B[1]

and so forth for nS COLS

so if nS = 4 and nP = 512 nPolyKernel<<<4, 512>>> works fine
but if nS = 4 and nP = 1024 nPolyKernel<<<2, 512>>> ERROR 4

the suggestion above has no effect

Ok, for nP a multiple of 512 this indeed doesn’t matter (although it sets a trap for the future that the code looks like it’s prepared to handle the case, but misses the if).

The only things I see is that while you write about C being a nS x nP matrix (and the kernel indeed taking two parameters), the code actually treats C as a square matrix and disregards nS. Also, the kernel invocation for nS=4 and nP=512 should be nPolyKernel<<<1, 512>>>.

C is not a square matrix
Here is the output from the code

nS = 4, nP = 8
A(0) = 1 + 1i B(0) = 0.9 + 0.1i
A(1) = 2 + -1i B(1) = 0.9 + -0.1i
A(2) = 3 + 1i B(2) = 0.9 + 0.1i
A(3) = 4 + -1i B(3) = 0.9 + -0.1i

C(0) = 1 + 1i
C(1) = 0.8 + 1i
C(2) = 0.62 + 0.98i
C(3) = 0.46 + 0.944i
C(4) = 0.3196 + 0.8956i
C(5) = 0.19808 + 0.838i
C(6) = 0.0944719 + 0.774008i
C(7) = 0.00762393 + 0.706054i
C(8) = 2 + -1i
C(9) = 1.7 + -1.1i
C(10) = 1.42 + -1.16i
C(11) = 1.162 + -1.186i
C(12) = 0.9272 + -1.1836i
C(13) = 0.71612 + -1.15796i
C(14) = 0.528712 + -1.11378i
C(15) = 0.364463 + -1.05527i
C(16) = 3 + 1i
C(17) = 2.6 + 1.2i
C(18) = 2.22 + 1.34i
C(19) = 1.864 + 1.428i
C(20) = 1.5348 + 1.4716i
C(21) = 1.23416 + 1.47792i
C(22) = 0.962952 + 1.45354i
C(23) = 0.721302 + 1.40448i
C(24) = 4 + -1i
C(25) = 3.5 + -1.3i
C(26) = 3.02 + -1.52i
C(27) = 2.566 + -1.67i
C(28) = 2.1424 + -1.7596i
C(29) = 1.7522 + -1.79788i
C(30) = 1.39719 + -1.79331i
C(31) = 1.07814 + -1.7537i

Are you using CUDA 3.2, 64-bit OS ? Fermi - GPU ?