java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

While running the spark-xgboost example we are getting below issue while training xgboost model.

2/11/28 05:17:44 ERROR Executor: Exception in task 3.0 in stage 54.0 (TID 3401)
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
	at ml.dmlc.xgboost4j.java.Rabit.shutdown(Rabit.java:83)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.buildDistributedBooster(GpuXGBoost.scala:327)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.$anonfun$trainOnGpuInternal$1(GpuXGBoost.scala:254)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Here is the setting we did while creating the spark session:

conf = SparkConf()
conf.set("spark.executor.cores",4)
conf.set("spark.executor.memory", "16g")
conf.set("spark.executor.resource.gpu.amount", 1)
conf.set("spark.executor.resource.gpu.discoveryScript", "/rapids/notebooks/getGpusResources.sh")
# # Plugin settings
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", 2)
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.enabled", "true")
conf.set("spark.rapids.sql.explain", "ALL")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set("spark.rapids.sql.csv.read.double.enabled", "true")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Here are the packages/jar files used:

  • pyspark==3.3.0
  • com.nvidia:rapids-4-spark_2.12:22.10.0
  • ai.rapids:cudf:22.10.0
  • com.nvidia:xgboost4j_3.0:1.4.2-0.3.0
  • com.nvidia:xgboost4j-spark_3.0:1.4.2-0.3.0

Here are the hardware specifications:

  • Machine: Ubuntu 18.04
  • Cuda: 11.2
  • GPUs: 4 NVIDIA Tesla V100 GPUs each 16GB.

We even tried changing the configurations but xgboost is not able to use the GPUs.

Could you share the complete log? Thanks

Hi,

Here is the complete log, which I am sharing in a file.
error.docx (282.9 KB)

Thanks

Please share the log as a text file? Sorry for the extra work. My Linux system does not have a good converter for docx file format.

No worry, I will do the same, sorry for the inconvenience caused.

Here is the text file:
error.txt (854.5 KB)

Thanks

Got it thank you

Any findings on the issue?

Thanks

You need use the version 22.04 as follows:

Python 3.8
Spark 3.2.1
rapids-4-spark_2.12-22.04.0.jar
rapids-4-spark-ml_2.12-22.02.0-cuda11.jar
xgboost4j_3.0-1.4.2-0.3.0.jar
xgboost4j-spark_3.0-1.4.2-0.3.0.jar
cudf-22.04.0-cuda11.jar

Here is the complete guide: spark-rapids-examples/python-notebook.md at branch-22.04 · NVIDIA/spark-rapids-examples · GitHub

Attention: Python 3.9 will not work with this sample.

Hello Team,
Even after trying with the shared configurations, we are facing the same issue. Here is the screenshot:

Could you add those two lines into the jupyter notebook and share the full log back? Thanks.

spark = SparkSession.builder.getOrCreate()
spark.setLogLevel(INFO)

Yes working on collecting the complete log.

Here is the logs generated after creating spark session:

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.nvidia#rapids-4-spark_2.12 added as a dependency
com.nvidia#rapids-4-spark-ml_2.12 added as a dependency
ai.rapids#cudf added as a dependency
com.nvidia#xgboost4j_3.0 added as a dependency
com.nvidia#xgboost4j-spark_3.0 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04;1.0
	confs: [default]
	found com.nvidia#rapids-4-spark_2.12;22.04.0 in central
	found ai.rapids#cudf;22.04.0 in central
	found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
	found com.nvidia#rapids-4-spark-ml_2.12;22.02.0 in central
	found com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 in central
	found com.typesafe.akka#akka-actor_2.12;2.5.23 in central
	found com.typesafe#config;1.3.3 in central
	found org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 in central
	found com.fasterxml.jackson.core#jackson-databind;2.10.3 in central
	found com.fasterxml.jackson.core#jackson-annotations;2.10.3 in central
	found com.fasterxml.jackson.core#jackson-core;2.10.3 in central
	found org.scalatest#scalatest_2.12;3.0.5 in central
	found org.scalactic#scalactic_2.12;3.0.5 in central
	found org.scala-lang.modules#scala-xml_2.12;1.0.6 in local-m2-cache
	found com.esotericsoftware#kryo;4.0.2 in central
	found com.esotericsoftware#reflectasm;1.11.3 in central
	found org.ow2.asm#asm;5.0.4 in central
	found com.esotericsoftware#minlog;1.3.0 in local-m2-cache
	found org.objenesis#objenesis;2.5.1 in local-m2-cache
	found org.scala-lang#scala-reflect;2.12.8 in local-m2-cache
	found commons-logging#commons-logging;1.2 in local-m2-cache
	found com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 in central
:: resolution report :: resolve 500ms :: artifacts dl 20ms
	:: modules in use:
	ai.rapids#cudf;22.04.0 from central in [default]
	com.esotericsoftware#kryo;4.0.2 from central in [default]
	com.esotericsoftware#minlog;1.3.0 from local-m2-cache in [default]
	com.esotericsoftware#reflectasm;1.11.3 from central in [default]
	com.fasterxml.jackson.core#jackson-annotations;2.10.3 from central in [default]
	com.fasterxml.jackson.core#jackson-core;2.10.3 from central in [default]
	com.fasterxml.jackson.core#jackson-databind;2.10.3 from central in [default]
	com.nvidia#rapids-4-spark-ml_2.12;22.02.0 from central in [default]
	com.nvidia#rapids-4-spark_2.12;22.04.0 from central in [default]
	com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 from central in [default]
	com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 from central in [default]
	com.typesafe#config;1.3.3 from central in [default]
	com.typesafe.akka#akka-actor_2.12;2.5.23 from central in [default]
	commons-logging#commons-logging;1.2 from local-m2-cache in [default]
	org.objenesis#objenesis;2.5.1 from local-m2-cache in [default]
	org.ow2.asm#asm;5.0.4 from central in [default]
	org.scala-lang#scala-reflect;2.12.8 from local-m2-cache in [default]
	org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 from central in [default]
	org.scala-lang.modules#scala-xml_2.12;1.0.6 from local-m2-cache in [default]
	org.scalactic#scalactic_2.12;3.0.5 from central in [default]
	org.scalatest#scalatest_2.12;3.0.5 from central in [default]
	org.slf4j#slf4j-api;1.7.30 from local-m2-cache in [default]
	:: evicted modules:
	com.nvidia#rapids-4-spark_2.12;22.02.0 by [com.nvidia#rapids-4-spark_2.12;22.04.0] in [default]
	ai.rapids#cudf;22.02.0 by [ai.rapids#cudf;22.04.0] in [default]
	org.scala-lang#scala-reflect;2.12.4 by [org.scala-lang#scala-reflect;2.12.8] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   25  |   0   |   0   |   3   ||   23  |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04
	confs: [default]
	0 artifacts copied, 23 already retrieved (0kB/13ms)
22/12/08 18:24:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 18:24:31 WARN ResourceProfile: The executor resource config for resource: gpu was specified but no corresponding task resource request was specified.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator 22.04.0 using cudf 22.04.0.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
22/12/08 18:24:45 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true

I will be sending the logs generated while training the xgboost model shortly in text file

Hello Team,
What will be the best Cuda version for these configurations?

CUDA 11.x on Enterprise GPU

error1.txt (4.2 MB)
here is the training logs

ERROR Message is “AttributeError: ‘Thread’ object has no attribute ‘isAlive’. Did you mean: ‘is_alive’?”

Please use python 3.8 or below, do not use Python 3.9

Hello Team,

With these configurations, we are able to run our training process, but our prediction step is taking a lot of time to execute. Also, the code is not using all the GPUs. What settings or configurations would you suggest so that the code uses all 4 GPUs that we have?