Correct output with emulation mode, wrong with GPU/Execution

Hello Everyone,

          I have written a kernel which reflects correct result to host with emulation mode, but when i execute on device it returns all zero bits.
          Please suggest, what is going wrong with this case.

Here is my kernel
const int BLOCK_SIZE= 8;

global void CUDA_substitution_kernel(int *d_sub,int *xor1key,int s1[16],int s2[16],int s3[16],int s4[16],int s5[16],int s6[16],int s7[16],int s8[16])
{
int index,i,j,k,lb,ub;

    index=blockIdx.x * blockDim.x + threadIdx.x;

//**/ printf(“index=%d\n”,index);
// shared int Stemp[8][6];
int Stemp[8][6];
lb=((4index)+1)-1;
ub=(4
(index+1))-1;
k=0;

//**/ printf(“\nSubstituion Box Temporary:\n”);

// if(index==0)
for(i=0;i<8;i++)
for(j=0;j<6;j++)
{
Stemp[i][j]=xor1key[k++]; //putting 48 bit data into Stemp[8][6]
//**/ printf(“%d\t”,Stemp[i][j]);

            }

    int p,q,multi,l,Boxnum,cross,s;
    int c,a[4],r;   //a[4] to conatain final 4 bit from each SBox and appended to SubChoice[32]
 

    q=0;
    multi=1;
    j=4;
    c=3;

    p=(Stemp[index][0]*2)+(Stemp[index][5]*1);

//**/ printf(“\nindex=%d p=%d\n”,index,p);

     while(j>0) //target to get decimal of 4 bits among six
     {
        k=Stemp[index][j]; //accepts 4 bits one by one reversely
        l=k*multi;    //
        q=q+l;

//**/ printf(“\nFor index=%d k=%d l=%d q=%d multi=%d j=%d\n”,index,k,l,q,multi,j);
multi=multi*2;
j–;

      }

//**/ printf(“\nindex=%d q=%d\n”,index,q);

    //we get final num in q
    Boxnum=index+1;
                    switch(Boxnum)
                    {
                            case 1: cross=s1[p][q]; break;
                            case 2: cross=s2[p][q]; break;
                            case 3: cross=s3[p][q]; break;
                            case 4: cross=s4[p][q]; break;
                            case 5: cross=s5[p][q]; break;
                            case 6: cross=s6[p][q]; break;
                            case 7: cross=s7[p][q]; break;
                            case 8: cross=s8[p][q]; break;
                     }

//**/ printf(“\n index=%d and cross=%d \n”,index,cross);
while(cross>0)
{
r=cross%2;
a[c–]=r;
cross=cross/2;
}
__syncthreads();
while(c>=0)
{
a[c–]=0;
}

       for(l=lb,s=0;l<=ub;l++)
       {
              d_sub[l]=a[s];

//**/ printf(“\n index= %d , d_sub[%d]=%d and s= %d \n”,index,l,d_sub[l],s);
s++;
}

            __syncthreads();

/* printf(“\n\n\nFinal:d_sub\n”);
for(i=0;i<32;i++)
{
printf(“d_sub[%d]=%d\t”,i,d_sub[i]);
}
*/

    return;

}

And this how i call my kernel from main

int xor1key[48]=
{ 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0 };

    dim3 dimblock(BLOCK_SIZE);
    dim3 dimgrid(8/BLOCK_SIZE);

    cudaMallocHost((void **)&SubChoice,32 * sizeof(int));
    memset(SubChoice,0,32 * sizeof(int));

    int* d_sub;
    cudaMalloc((void **)&d_sub,32 * sizeof(int));

    cudaMemcpy(d_sub,SubChoice,32 * sizeof(int),cudaMemcpyHostToDevice);

            printf("\ncuda work\n");
            CUDA_substitution_kernel<<<dimgrid,dimblock>>>(d_sub,xor1key,s1,s2,s3,s4,s5,s6,s7,s8);
            CUT_CHECK_ERROR("kernel error");
            cudaMemcpy(SubChoice,d_sub,32 * sizeof(int),cudaMemcpyDeviceToHost);
            printf("\ncuda done\n");

    printf("\nSubstitution-choice-32-final\n");
    for(i=0;i<32;i++)
    {
            printf("%d\t",SubChoice[i]);
    }
    printf("\n\n");

    cudaFree(d_sub);

This is just the part of encryption algorithm, all the arrays from s1 to s8 are two dimensional arrayof 4*16.

with emulation code i get correct bits as
Substitution-choice-32-final
0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0

but on device i get all zeros.

Thanks in advance.

Regards,
Deepti

Have you tried running this through CUDA-GDB?

–Cliff

Thanks Cliff.

My next question is the basic one, as first time I am going through this.

After inserting break points it shows this warning.

(cuda-gdb) l 99

94

95 int* d_sub;

96 cudaMalloc((void **)&d_sub,32 * sizeof(int));

97

98 cudaMemcpy(d_sub,SubChoice,32 * sizeof(int),cudaMemcpyHostToDevice);

99

100 printf(“\ncuda work\n”);

101 CUDA_substitution_kernel<<<dimgrid,dimblock>>>(d_sub,xor1key,s1,s2,s3,s4,s5,s6,s7,s8);

102 CUT_CHECK_ERROR(“kernel error”);

103 cudaMemcpy(SubChoice,d_sub,32 * sizeof(int),cudaMemcpyDeviceToHost);

(cuda-gdb) b 101

(cuda-gdb) r

Starting program: /data4/deepti/NVIDIA_GPU_Computing_SDK/C/bin/linux/debug/demo

[Thread debugging using libthread_db enabled]

[New process 32607]

[New Thread 47844348397056 (LWP 32607)]

Warning: a GPU was made unavailable to the application due to debugging

constraints. This may change the application behaviour!

Using device 0: Tesla C1060

cuda work

[Switching to Thread 47844348397056 (LWP 32607)]

Breakpoint 1, fnDemoEncrypt () at demo.cu:101

101 CUDA_substitution_kernel<<<dimgrid,dimblock>>>(d_sub,xor1key,s1,s2,s3,s4,s5,s6,s7,s8);

Current language: auto; currently c++

(cuda-gdb) continue

Continuing.

Cuda error: kernel error in file ‘demo.cu’ in line 102 : unspecified launch failure.

Program exited with code 01.[/b]

Many people get the same warning, but i didn’t find the solution , other than this line on nvidia forum itself - “The next version of the CUDA tools, whose beta version will be available next month, fixes the issue.”

Here are vesion for cuda-gdb and nvcc.

cuda-gdb --version

NVIDIA ® CUDA Debugger

BETA release

Portions Copyright © 2008,2009 NVIDIA Corporation

GNU gdb 6.6

Copyright © 2006 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain conditions.

Type “show copying” to see the conditions.

There is absolutely no warranty for GDB. Type “show warranty” for details.

This GDB was configured as “x86_64-unknown-linux-gnu”.

nvcc --version

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

Please suggest me, whether it is the version issue or I am going wrong because at line 102 launch failure is coming.

This has nothing to do with cuda-gdb. I would guess your kernel is never launching because invalid arguments or excessive argument length. But it is impossible to say because the code you posted is incomplete.

I reduce the argument length by defining all the arrays from s1 to s8 within the kernel. But I got the same error - launch failure. External Media

Actually after its successful run, I will merge this code with my actual program. I don’t want to keep array definition in kernel, because I want to call same kernel multiple times also.This time I am attaching my code.

Thanks in advance.

My Makefile contents:

EXECUTABLE := demo

CUFILES := demo.cu

include …/…/common/common.mk

demo.cu (3.11 KB)

demo_kernel.cu (2.68 KB)

  • Deepti

Hmm… I’m not seeing a kernel launch failure. Also one thing I notice right off is that it looks like your arrays are declared in CPU mem rather than in GPU mem (as in with the device qualifier), so the pointer to the arrays are not valid from the GPU’s point of view.

Thanks Everyone,

      The problem is solved. I was passing an array 'xor1key' without allocating memory on device. Since this array is required just for initializing, so didnt gave the memory and thats why kernel is not launching.
  • Deepti