Sequential execution from within the kernel fails

nkumar · March 23, 2012, 12:01pm

Hello,
I am seeing a really strange behavior during execution of a CUDA code. I have given an example code below to illustrate the problem.
The kernel does summation of two arrays by each thread doing the sum of each element of two arrays ‘a’ and ‘b’. If I try to add the elements of the array ‘c’ also within the same kernel by using only one thread, it works fine till about N=2048. When I increase the value to, say N=3200, the cuPrintf is not executed or its value is not displayed.
NOTE: In my actual production code, if I try to do the final computation within the kernel by using only a small subset of the total threads (i.e. let other threads return early) I see this behavior. There the problem is more serious as the final computed result is always zero or ‘nan’ for values exceeding certain threshold number. I will post info about my production code after I receive a reply.

Here is the code in question:
++++++++++++++++++++++++++++++
#include
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cutil_inline_runtime.h>
#include
#include <sys/time.h>
#include “cuPrintf.cu”

#define N 3200

global void
add(int* a, int* b, int* c)
{
int tid = blockIdx.x;
if (tid < N)
{
c[tid] = a[tid] + b[tid];
/* Alternate way of computing final result */
__threadfence();
if (tid !=0)
return;
int sum = 0;
for (int count = 0; count < N; ++count)
{
sum = sum + c[count];
}
cuPrintf(“Dev sum: %d\n”, sum);
}
}

int main()
{
int* a;
int* b;
int* c;
int* dev_a;
int* dev_b;
int* dev_c;
cudaEvent_t start, stop;
float time;
struct timeval tv1, tv2;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaPrintfInit();
a = (int*)malloc(Nsizeof(int));
b = (int)malloc(Nsizeof(int));
c = (int)malloc(N*sizeof(int));

for (int i = 0; i < N; ++i)
{
a[i] = i;
b[i] = i * 2;
c[i] = 0;
}

cutilSafeCall(cudaMalloc( (void**)&dev_a, Nsizeof(int) ));
cutilSafeCall(cudaMalloc( (void**)&dev_b, Nsizeof(int) ));
cutilSafeCall(cudaMalloc( (void**)&dev_c, N*sizeof(int) ));

gettimeofday(&tv1, NULL);
cutilSafeCall(cudaMemcpy(dev_a, a, Nsizeof(int), cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(dev_b, b, Nsizeof(int), cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice));

gettimeofday(&tv2, NULL);
printf(“Time taken in copying to DEvice: %f\n”, (float)(tv2.tv_usec) - (float)(tv1.tv_usec));
cudaEventRecord(start, 0);

add<<<N/16,16>>>(dev_a, dev_b, dev_c);
cudaPrintfDisplay(stdout, true);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);

printf(“Time taken by kernel: %f\n”, time);

cutilSafeCall(cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost));
unsigned int sum = 0;
for(int i = 0; i < N; ++i)
{
sum = sum + c[i];
}
printf(“Sum is: %d\n”, sum);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);

free(a);
free(b);
free(c);
return 0;
}
++++++++++++++++++++++++++++++++

I am using:
CentOS 5.6
Tesla M2070
CUDA release 4.0

Thanks,
Nikhil

Topic		Replies	Views
Sequential loop inside kernel seems to be not working at all, or this is just a synchronization problem ?! CUDA Programming and Performance	2	663	November 6, 2016
Missing Kernel executions CUDA Programming and Performance	2	921	June 27, 2012
QUIT CUDA? Kernel and pinned memory gives strange results CUDA Programming and Performance	6	6784	September 22, 2011
kernel execution and related questions CUDA Programming and Performance	2	2449	December 5, 2009
cuda bug? CUDA Programming and Performance	8	2939	June 3, 2008
Need a little help to understand how thread change/works CUDA Programming and Performance	4	3898	December 10, 2011
not reading all values from array CUDA Programming and Performance	3	703	April 26, 2017
Loop isn't executing inside the kernel CUDA on Windows Subsystem for Linux	0	456	September 9, 2022
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	442	June 11, 2024
Cuda kernel not executing all threads CUDA Programming and Performance	4	1170	December 10, 2022

Sequential execution from within the kernel fails

Related topics