Should a kernel initializing random states with curand_init be so slow?

I run a kernel to initialize a 512^3 grid of random states for curand:

__global__ void curandInit(curandState *state)
{
	int idx = threadIdx.x + blockDim.x * blockIdx.x;
	curand_init(seed, idx, 0, &state[idx]);
}

for some pre-chosen seed. I launch with 512-length blocks (and so a 512^2 grid). It takes 770 seconds. Does this not seem a little ridiculous?

It seems slow. But that’s a lot of states (512^3 = 134 million or so). You don’t mention what hardware you use or what PRNG you specified. You also don’t show the complete code. Is this a release build?

Try the Philox generator, that provides for fast initialization. It also has the second fastest generation after XORWOW, I think.

Sorry, I forgot - this is on a P100, hence my concern. I haven’t specified a generator, but thank you for the suggestions.

did you time a debug build by accident?

I don’t have debug builds, I’m just running the code with nvprof.

Incidentally, I am also finding that drawing random numbers (at least with curand_uniform_double) is dramatically slower in code compiled by CUDA 9.2 than 9.1, 9.0, or 8.0 - by about a factor of 8. The initialization timing is unafffected.

(1) Make sure your performance measurements are reproducible in a standalone application outside the profiler

(2) Double and triple check all relevant software configuration (e.g. driver persistency, or logging) and hardware settings in your environment: It is very unlikely, though not impossible, for NVIDIA engineering to have overlooked a massive performance regression

(3) If, after performing due diligence,massive performance regression persists, consider filing a bug with NVIDIA

Okay, I have made a MWE which does reproduce the strangely(?) long initialization time, but not the slowness of drawing on 9.2. So I started tinkering with my full code’s kernel which calls a function that draws 12 random numbers (per thread); changing the number of draws per thread proportionately changes the kernel time. Thus, I suppose some sort of caching is no longer happening on 9.2? I.e., on 9.2 the global array is being read to (and written to, as the curandState is updated each time a number is drawn) each of the 12 times.

I was able to erase the performance loss by, rather than passing the pointer to the global curandState array to this function, passing the curandState itself, and then at the end set the global array’s value to the (updated) local curandState.

Does this assessment seem reasonable?

In the below MWE I implement both methods (always reading the global array vs. storing the curandState in registers):

// main.cu

// testing curand

#include <stdio.h>
#include <cuda.h>
#include <cuda_profiler_api.h>
#include <curand.h>
#include <curand_kernel.h>

#define N1 128
#define N (N1*N1*N1)
#define randseed 1230123

// wrapper for CUDA API functions to check for errors
#define gpuERR(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, const int line, bool abort=true)
{
	if(code != cudaSuccess) 
	{
		fprintf(stderr,"%s:%d: CUDA API error: %s\n",file,line,cudaGetErrorString(code));
		if(abort)
		{
			cudaDeviceReset();
			exit(code);
		}
	}
}

__global__ void curandInit(curandState *state)
{
	int idx = threadIdx.x + blockDim.x * blockIdx.x;
	if(idx < N) curand_init(randseed, idx, 0, &state[idx]);
}

__global__ void curandDraw(curandState *state, double *out)
{
	int idx = threadIdx.x + blockDim.x * blockIdx.x;
	double rand = 0.;
	if(idx < N) 
	{
		for(int i = 0; i < 12; i++) rand += curand_uniform_double(state + idx) - .5;
		out[idx] = rand;
	}
}

__global__ void curandDraw_fast(curandState *state, double *out)
{
	int idx = threadIdx.x + blockDim.x * blockIdx.x;
	double rand = 0.;
	if(idx < N) 
	{
		curandState tstate = state[idx];
		for(int i = 0; i < 12; i++) rand += curand_uniform_double(&tstate) - .5;
		out[idx] = rand;
		state[idx] = tstate;
	}
}

curandState *randState;
double *out;

int main()
{
	gpuERR( cudaMallocManaged((void**) &randState, sizeof(curandState) * N ) );
	gpuERR( cudaMallocManaged((void**) &out, sizeof(double) * N ) );

	cudaEvent_t t0, tf;
	float time;
	cudaEventCreate(&t0);
	cudaEventCreate(&tf);
	cudaEventRecord(t0,0);

	curandInit<<< N / 512, 512 >>>(randState);
	gpuERR( cudaPeekAtLastError() ); gpuERR( cudaDeviceSynchronize() );
	
	cudaEventRecord(tf,0);
	cudaEventSynchronize(tf);
	cudaEventElapsedTime(&time,t0,tf);
	printf("Initialization took %f seconds\n",time/1000.);
	
	cudaEventRecord(t0,0);
	int nloop = 1000;
	for(int i = 0; i < nloop; i++)
	{
		curandDraw<<< N / 32, 32 >>>(randState, out);
		gpuERR( cudaPeekAtLastError() ); gpuERR( cudaDeviceSynchronize() );
	}
	cudaEventRecord(tf,0);
	cudaEventSynchronize(tf);
	cudaEventElapsedTime(&time,t0,tf);
	printf("Draw kernel took %f ms\n",time/(double)nloop);
	
	cudaEventRecord(t0,0);
	for(int i = 0; i < nloop; i++)
	{
		curandDraw_fast<<< N / 32, 32 >>>(randState, out);
		gpuERR( cudaPeekAtLastError() ); gpuERR( cudaDeviceSynchronize() );
	}
	cudaEventRecord(tf,0);
	cudaEventSynchronize(tf);
	cudaEventElapsedTime(&time,t0,tf);
	printf("Fast draw kernel took %f ms\n",time/(double)nloop);

	cudaEventDestroy(t0);
	cudaEventDestroy(tf);
	cudaDeviceReset();
}

Compile (for Pascal) with

nvcc -dc -m64 -lineinfo -arch=sm_60 main.cu -o main.o
nvcc  main.o -arch=sm_60 -o rand

On my P100, CUDA v9.2 the result is

Initialization took 4.159226 seconds
Draw kernel took 0.551846 ms
Fast draw kernel took 0.703808 ms

So something (register usage?) in my full code’s kernel must be making the difference, because the “fast” kernel is not in fact fast in this MWE. The results are consistent across CUDA versions.

I have since discovered that my above fix does not work. Even though I reset the global state to the local value which was used (and thus updated) in the thread (see line 57), each instance of the kernel produces the same sequence of random numbers. I.e., every time I call curandDraw_fast, the same random values are produced at each thread as previous calls to the kernel. This makes no sense to me.

I don’t witness that behavior with the code you provided in comment 8.

I changed the line of code in your draw fast kernel to this:

for(int i = 0; i < 12; i++) {rand += curand_uniform_double(&tstate) - .5; if ((i==0)&&(idx==0)) printf("%f\n", rand);}

and didn’t observe any patterns in the output:

$ ./t144
Initialization took 3.869868 seconds
Draw kernel took 0.519415 ms
0.254432
0.480150
0.235742
0.271145
0.383922
-0.283426
0.353317
-0.067818
0.108961
0.151046
0.141724
0.058686
-0.273926
-0.288751
-0.265600
0.106202
-0.262321
-0.480962
-0.323642
0.409950
0.413447
-0.009740
-0.058418
-0.073311
-0.404135
-0.197344
-0.423619
-0.219676
-0.140044
-0.159406
-0.047196
-0.327258
-0.333314
-0.087159
-0.162112
-0.090140
0.051705
0.330248
-0.233667
-0.494851
-0.233700
-0.337157
-0.142655
0.076512
0.481324
0.064226
-0.219573
0.086828
-0.022594
0.160498
0.090945
0.043817
-0.427470
-0.078277
0.106263
-0.110973
0.152164
-0.468579
0.358656
0.136153
0.070742
0.235788
-0.115601
0.102296
0.044166
-0.241934
-0.399718
0.081609
-0.333306
-0.429697
-0.206239
-0.055501
-0.086255
-0.108718
0.126634
0.068967
0.280887
-0.409269
-0.174758
0.196415
0.002310
0.342451
0.076953
0.481056
-0.140345
0.129785
0.383440
-0.122864
-0.143088
0.018115
-0.302004
0.359464
-0.033840
-0.005112
0.145533
0.136565
0.290213
-0.292002
0.158456
0.354945
0.406768
-0.181459
0.443567
-0.469035
0.085801
-0.047423
-0.315561
-0.363869
-0.207379
-0.308434
0.468380
-0.132517
0.334825
-0.358398
-0.482861
0.199240
-0.244831
-0.476305
-0.037016
-0.291478
-0.006101
0.199342
-0.016194
0.443472
0.347504
0.261927
-0.189205
-0.293328
0.281962
-0.051866
0.216369
-0.331211
-0.353497
0.446182
0.139139
-0.091227
-0.479185
0.307424
0.193139
-0.475357
-0.204784
-0.310183
-0.444653
0.096959
0.407450
0.400280
-0.323714
0.308113
-0.061610
0.200411
-0.001438
0.171085
0.360167
-0.344397
-0.116415
-0.006442
-0.151812
-0.141332
0.336556
0.088113
-0.152023
0.330021
0.335188
0.138943
-0.273040
0.068964
0.336138
-0.442906
-0.366624
0.171702
0.249512
0.175467
0.192278
0.373944
0.477236
-0.082484
-0.241121
0.227470
-0.498001
0.186681
-0.150172
0.318294
0.184876
0.220352
-0.222860
0.173613
0.143596
-0.278733
-0.145868
0.089585
-0.191609
-0.025599
-0.343290
-0.175902
0.356219
-0.443927
-0.468961
0.046371
-0.183649
0.031109
-0.319568
0.374548
0.143277
-0.143840
0.138287
0.388451
0.177719
-0.429862
0.429399
-0.119153
0.091108
0.285985
-0.369951
-0.456249
0.265256
-0.032840
0.407070
-0.427541
-0.373480
-0.027046
0.404019
-0.387288
0.347482
0.498810
0.124749
-0.459629
-0.285639
0.260283
-0.139314
0.466143
0.342089
-0.024130
0.250381
0.387966
0.487158
0.464135
-0.093861
0.202425
-0.390916
-0.478823
-0.126236
-0.195968
-0.416271
-0.247836
0.466960
-0.317937
0.343566
-0.326403
0.493859
0.331239
0.267819
0.430350
0.399706
0.100491
0.128105
-0.158084
-0.390077
-0.229476
0.299016
0.081832
-0.423998
-0.161197
-0.414676
0.014414
0.372268
-0.413061
0.284039
-0.302660
0.249174
0.400827
-0.058950
-0.044480
0.024679
0.460447
-0.111877
0.120620
0.135036
-0.498771
-0.324111
0.341738
-0.353403
0.171917
-0.408190
-0.114473
0.170948
-0.371433
0.431278
-0.399372
-0.011192
-0.074886
0.262014
0.034662
0.189733
0.491691
-0.063436
-0.369668
0.128167
0.254283
-0.200993
-0.118755
0.339436
-0.192885
0.181423
-0.003678
0.088794
-0.006472
-0.417468
0.434182
0.331750
0.016597
-0.108331
-0.000288
0.157720
-0.486156
-0.069004
0.176525
-0.308721
-0.375654
-0.180811
-0.366828
-0.337537
-0.193444
0.202111
0.193166
-0.307786
-0.090756
-0.315993
-0.048451
0.363610
-0.405718
0.231772
0.468720
0.328538
0.118268
0.154299
0.447421
-0.219422
0.469690
-0.437850
-0.043725
0.001307
0.425522
0.072893
0.137555
0.048809
-0.485274
-0.063265
0.263935
-0.411678
0.201690
0.022843
-0.293158
0.419058
-0.255604
-0.252635
0.193727
0.067367
0.419321
-0.296780
-0.487667
-0.474434
-0.376958
0.454855
0.316018
0.425847
-0.237484
0.471259
-0.179054
0.040747
0.458777
0.367011
-0.386391
0.245962
-0.406444
-0.297108
-0.274599
-0.232452
0.268008
-0.325753
-0.004212
0.435545
-0.328743
0.277363
-0.297676
0.347638
-0.202725
0.187805
-0.472686
0.176993
-0.288686
-0.372050
-0.393081
-0.427131
0.493996
0.415194
0.133361
0.245019
0.444675
0.066273
0.293809
-0.048723
-0.097171
0.253644
0.078863
-0.492975
0.448581
0.174805
0.391672
0.177340
0.116530
0.314582
0.238712
-0.450293
0.290709
-0.274478
0.069266
-0.319543
0.347151
0.371422
-0.395960
-0.482110
0.284428
-0.303197
0.357506
-0.489337
-0.110879
0.472483
0.050603
-0.402977
0.404934
-0.401872
-0.084540
-0.412079
-0.287485
-0.125754
-0.405754
-0.212777
-0.371262
-0.153480
-0.426351
-0.300870
0.418417
0.271637
-0.105063
-0.130321
0.330667
0.110544
-0.109093
0.164932
0.140228
0.414500
0.173068
-0.386533
-0.104722
-0.154805
-0.441535
-0.068342
0.069727
0.173604
0.419824
0.470039
-0.079380
0.246435
-0.103791
-0.497907
-0.295420
0.154576
-0.298694
-0.030008
-0.382016
0.381990
0.343780
0.400424
-0.371243
0.279185
-0.471098
0.465716
-0.124154
0.208164
-0.313506
-0.187128
0.358458
-0.429898
0.192163
0.377424
-0.323086
-0.182122
-0.372185
0.251995
-0.086354
0.338180
0.339806
0.070344
0.277037
0.152607
-0.263169
-0.262908
-0.100752
-0.015261
0.373344
0.333249
-0.043611
0.090385
-0.125464
-0.322370
0.134778
-0.491178
-0.457745
-0.166234
0.165920
-0.487114
-0.000063
-0.090716
0.270139
0.023351
0.426919
-0.074187
-0.280917
-0.098410
0.200487
-0.256988
-0.361365
0.179012
-0.468289
0.438142
-0.097994
-0.425503
-0.066791
0.372060
-0.403911
0.417636
0.003627
0.336361
-0.032861
-0.100288
-0.148208
-0.392776
-0.337015
-0.413874
0.249128
-0.236127
-0.444377
-0.109127
-0.054248
0.199558
-0.395396
0.265575
-0.459768
0.257069
-0.060016
-0.117114
-0.480535
0.431014
0.099994
-0.333722
-0.308605
-0.169497
0.248924
0.372919
0.440998
0.196945
0.219079
-0.303968
-0.379030
-0.457418
-0.148415
-0.018057
-0.230533
0.332789
-0.408732
0.466937
-0.212484
-0.394012
0.245576
-0.217706
0.339293
-0.152930
0.252280
0.335313
0.403877
0.254670
0.279665
-0.034133
-0.071420
0.161613
-0.397135
-0.307027
0.098833
0.024219
-0.017732
0.105384
0.337753
-0.495706
0.417201
0.281291
0.342700
-0.342358
0.092346
0.307098
-0.042362
0.104754
-0.097845
0.229677
0.086869
-0.105407
0.401045
-0.431481
0.411926
0.174836
-0.267432
0.050452
0.047581
-0.152235
-0.461847
-0.375564
0.098516
-0.371466
0.470089
0.335755
0.215223
-0.323820
-0.140284
0.064145
0.029470
0.064537
0.488474
-0.124825
-0.328638
-0.220769
0.080623
-0.497659
0.472834
0.358291
0.088897
-0.015537
0.231562
0.152074
0.271584
-0.421085
0.270073
0.069573
-0.468636
-0.024085
-0.197771
-0.450329
0.444962
0.305250
-0.466055
0.440494
0.209141
0.409724
-0.388384
-0.076962
-0.480255
-0.051325
-0.271341
0.315388
-0.458865
-0.287467
-0.027409
-0.340939
0.463128
0.199187
0.349931
0.497137
0.405218
-0.060850
-0.205671
0.070589
0.493642
-0.283476
0.296999
-0.430289
-0.293330
0.225621
-0.235967
-0.294634
-0.006777
-0.023855
-0.471342
0.227148
0.274792
0.352525
-0.112928
0.375656
-0.105891
-0.066048
0.063208
0.331113
-0.428446
-0.405721
-0.381796
0.121706
-0.314004
-0.273877
0.484054
0.457523
-0.484641
-0.236822
0.481570
-0.355056
-0.160009
-0.418500
0.147714
-0.316031
-0.143576
0.422962
-0.453994
0.461375
0.034270
-0.201136
-0.405556
0.070753
0.264757
0.185434
0.165293
-0.216711
-0.310905
0.344771
-0.018091
0.471089
-0.085502
-0.404293
-0.244556
0.420534
-0.391240
0.322098
-0.282958
-0.131471
0.451069
-0.340133
-0.122081
-0.095658
0.255636
-0.403987
0.205397
-0.032767
0.490919
-0.497427
-0.283121
0.156052
0.386312
0.157536
-0.373780
-0.479166
0.271995
-0.224686
0.137637
0.011261
0.308756
-0.072619
-0.337257
-0.210995
0.358993
0.337976
-0.221586
0.077954
-0.386515
0.310300
-0.489848
0.203684
-0.496103
-0.182000
0.192231
-0.216468
-0.355868
-0.280156
-0.463064
0.440352
-0.170590
0.114757
-0.357692
-0.478157
0.269101
-0.022997
-0.236311
-0.170337
-0.362426
-0.436738
-0.089238
0.319386
0.012416
-0.188145
0.386822
0.164877
-0.264850
-0.211034
-0.334115
-0.174733
0.163156
0.280281
-0.152529
0.013585
0.400874
0.230582
0.174578
0.174571
0.316898
0.269466
0.078417
0.479280
-0.177603
-0.101757
0.493786
0.214483
-0.277533
-0.499541
0.365554
-0.477545
0.056479
0.299854
0.032345
-0.258731
0.405779
-0.357366
-0.460739
-0.492726
0.407390
-0.048636
0.446019
0.055330
0.326518
-0.444401
-0.341943
0.363396
0.312451
0.086660
-0.217533
0.011850
0.403842
0.233884
-0.369777
-0.454769
-0.369805
0.164833
0.494073
-0.163646
0.270536
-0.345083
0.125132
-0.324942
-0.272818
0.160769
-0.206272
0.130475
-0.112110
-0.395193
-0.137619
0.285568
0.399252
-0.247156
0.316599
-0.100191
-0.255546
0.230293
0.464896
0.189753
-0.304772
0.497975
-0.161042
-0.175726
0.216129
-0.166230
-0.443784
-0.152519
-0.220281
0.067646
-0.087349
-0.389251
0.481274
-0.215370
-0.374573
0.229073
0.321152
-0.474816
-0.138685
0.013781
0.373231
0.302161
0.454883
-0.285142
0.060563
-0.247685
0.445111
-0.003627
0.067666
-0.422982
-0.095384
-0.355679
0.115839
0.102655
-0.145974
0.027378
0.497386
-0.400236
0.409284
-0.391120
-0.038925
-0.149363
0.098158
-0.114165
0.459471
-0.371831
0.390304
-0.017673
-0.001583
-0.180428
-0.080482
0.010287
-0.323602
-0.417459
0.083680
-0.496296
-0.062174
-0.150899
0.017588
0.214158
0.495686
0.273252
0.100606
-0.134419
-0.353933
-0.282670
0.485529
0.093296
0.187343
0.121681
0.330330
0.407622
-0.351598
0.053261
-0.486755
-0.332464
-0.296843
0.062726
0.453518
0.364117
-0.397736
0.338348
0.301009
0.478763
0.248835
-0.099125
-0.447925
-0.099891
0.493288
-0.213988
0.131666
-0.448491
0.138108
0.424010
0.389637
-0.469019
-0.112729
-0.437476
0.315424
-0.004335
-0.152775
0.340883
0.104579
0.437755
-0.005632
-0.251491
0.145181
-0.013297
-0.097221
0.481691
-0.326087
0.434664
0.479463
0.472976
-0.484371
-0.112073
0.237779
-0.258461
0.303477
-0.298417
-0.179569
0.482526
-0.378995
0.407500
0.469759
0.393375
0.386734
0.480048
-0.477414
0.008668
0.315613
-0.344589
-0.408609
-0.489845
-0.380185
-0.209526
-0.037200
-0.244076
0.048549
-0.447043
-0.440255
0.488943
-0.305319
Fast draw kernel took 0.695714 ms

Thanks for running the test. I agree with your results. In the full code in which I use this, I was passing the state by value to a function which would draw and process the random numbers. Of course, one should pass by reference so that the correct address is updated (for, as it occurs to me, the same reason the states are passed to curand_uniform_double by reference).