Error question regarding CUDA Sort

#include <iostream>

#include <cuda.h>

#include <time.h>

#include <math.h>

using namespace std;

// 테스트 용이므로 일단 자료 크기는 10000으로

// 1D이니까 그냥 블럭사이즈는 512로

//10만개부터 에러났음. 아마 랜덤 숫자 만들어내는 데, 아니면 GPU메모리 상에서 문제가 발생한 것 같음.

// 만일 화면 데이터를 정렬한다고 하면, 2560x1600 = 4,096,000 픽셀이니까 GPU메모리 상에서의 문제가

// 아니라 랜덤 숫자 만들어내는 곳에서 문제가 발생한 것일 수도...

#define DATASIZE   20100

#define BLOCK_SIZE    512

__global__ void oddevensort(int * input, int * output, unsigned int len)


	//개별 블럭의 좌표

	unsigned int tx = threadIdx.x;

	//전체 이미지의 좌표

	unsigned int x = tx + blockDim.x * blockIdx.x;

	//이동에 쓸 임시 변수

	int temp;

	//자료의 길이만큼 돌리는데, 인덱스(i)가 짝수이면 데이터의 짝수자리와 그 다음 숫자를 비교.

	//인덱스가 홀수이면 데이터의 홀수자리와 그 다음 숫자를 비교해서 정렬한다.

	for (int i = 0 ; i < len; i++)                  //for(int i = 0; i< len; i++)


		if (i % 2 == 0)


			// 길이를 측정안해주면 블럭에 남아있던 자리에 있는 자료가 튀어나올 수 있으니 조심.

			if (input[x] > input[x + 1] && x < len && x % 2 == 0)


				temp = input[x + 1];

				input[x + 1] = input[x];

				input[x] = temp;





			if (input[x] > input[x + 1] && x < len && x % 2 != 0)


				temp = input[x + 1];

				input[x + 1] = input[x];

				input[x] = temp;




	output[x] = input[x];



int main()


	// 테스트에 쓸 숫자 생성

	int TestInput[DATASIZE], TestOutput[DATASIZE];


	for (int i = 0; i < DATASIZE; i++)


		TestInput[i] =   rand() % 500;



	//device 설정

	int *devInput, *devOutput;

	//일단 크기는 아니까

	unsigned int MemDataSize = DATASIZE * sizeof(float);

	// device 자리 잡아주고

	cudaMalloc((void**)&devInput, MemDataSize);

	cudaMalloc((void**)&devOutput, MemDataSize);

	cudaMemset(devOutput, 0, MemDataSize);

	// 자리 잡았으면 복사

	cudaMemcpy(devInput, TestInput, MemDataSize, cudaMemcpyHostToDevice);

	// block 크기 설정

	// 1D 이니까, 그냥 간단하게...

	dim3    dimBlocksize(BLOCK_SIZE);

	dim3    dimGridsize(ceil((DATASIZE - 1) / (float)BLOCK_SIZE) + 1);

	// 일단 Max값과 min값을 알아내야됨.

	// 처음부터 끝까지 휙 둘러보면 되니 이건 CPU에게 맡김.

	oddevensort << < dimGridsize, dimBlocksize >> > (devInput, devOutput, DATASIZE);

	// 결과물 복사

	cudaMemcpy(TestOutput, devOutput, MemDataSize, cudaMemcpyDeviceToHost);

	for (int i = 0; i < DATASIZE; i++)


		cout << TestOutput[i] << ", ";

		if ((i + 1) % 10 == 0)


			cout << endl;



	// 위에 GPU에 마련한 자리 해소. 그때 그때 해놓는 게 편할 듯



	return 0;


If you make more than 100,000 DATASIZE, you will get an error. Please tell me the reason and solution.

What kind of error?

Are you on Windows? If so, check whether you are running into a Windows TDR event:

By default, the Windows graphical subsystem limits kernel execution time to about 2 seconds, before the watchdog timer times out and resets the driver, destroying the CUDA context in the process.

In that case, you can either limit kernels to < 2 second run time (via any combination of: smaller problem size, more highly tuned code, faster hardware), use a GPU supported by the TCC driver (and thus not associated with the Windows GUI) or experiment with changing the Windows TDR settings:

When I compile and run your code with cuda-memcheck, I get all sorts of errors.

no point exploring larger DATASIZE, debug what you have now.

Any time you are having trouble with a CUDA code, its good practice to use proper CUDA error checking, and run your code with cuda-memcheck.

It’s fairly evident you are on windows. You may wish to enable the memory checker in Visual Studio, to get similar information to cuda-memcheck. But cuda-memcheck can be used on windows also.

Having large stack-based variables like this:

int TestInput[DATASIZE], TestOutput[DATASIZE];

is generally a bad idea. Use dynamic allocation instead.

Thank you for the answers.

The type of error is -842150421 -842150421 -842150421 -842150421 …

I did the answer above but I can’t. Maybe it’s a memory problem.

  1. An odd-even sort cannot work correctly without some form of synchronization between the loops. You don’t have any synchronization, so your code cannot work reliably in any fashion.

  2. Even if we have a convenient grid-wide sync mechanism (there are at least 2 possiblities) we would still have to address the question of what happens at threadblock boundaries.

  3. __syncthreads() as the very last line in any kernel serves no useful purpose. I’m not sure why people write code like that.

  4. Creating large stack-based allocations (in host code) is a bad idea. Use dynamic allocation (e.g. malloc, new, std::vector, etc.) instead.

Essentially all of these problems are addressed in this code:

and there is a CUDA sample code demonstrating odd-even sorting:

If you still have problems, make sure you are not running into a WDDM TDR timeout:


Thank you very much. Even-oddsort code it was very helpful. I wish you good luck.