java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI

anuja.fole · November 28, 2022, 5:33am

While running the spark-xgboost example we are getting below issue while training xgboost model.

2/11/28 05:17:44 ERROR Executor: Exception in task 3.0 in stage 54.0 (TID 3401)
java.lang.NoClassDefFoundError: Could not initialize class ml.dmlc.xgboost4j.java.XGBoostJNI
	at ml.dmlc.xgboost4j.java.Rabit.shutdown(Rabit.java:83)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.buildDistributedBooster(GpuXGBoost.scala:327)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuXGBoost$.$anonfun$trainOnGpuInternal$1(GpuXGBoost.scala:254)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:378)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1508)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1435)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1499)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1322)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:376)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:327)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Here is the setting we did while creating the spark session:

conf = SparkConf()
conf.set("spark.executor.cores",4)
conf.set("spark.executor.memory", "16g")
conf.set("spark.executor.resource.gpu.amount", 1)
conf.set("spark.executor.resource.gpu.discoveryScript", "/rapids/notebooks/getGpusResources.sh")
# # Plugin settings
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", 2)
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.enabled", "true")
conf.set("spark.rapids.sql.explain", "ALL")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set("spark.rapids.sql.csv.read.double.enabled", "true")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

Here are the packages/jar files used:

pyspark==3.3.0
com.nvidia:rapids-4-spark_2.12:22.10.0
ai.rapids:cudf:22.10.0
com.nvidia:xgboost4j_3.0:1.4.2-0.3.0
com.nvidia:xgboost4j-spark_3.0:1.4.2-0.3.0

Here are the hardware specifications:

Machine: Ubuntu 18.04
Cuda: 11.2
GPUs: 4 NVIDIA Tesla V100 GPUs each 16GB.

We even tried changing the configurations but xgboost is not able to use the GPUs.

sluo · November 29, 2022, 12:26pm

Could you share the complete log? Thanks

anuja.fole · November 29, 2022, 1:09pm

Hi,

Here is the complete log, which I am sharing in a file.
error.docx (282.9 KB)

Thanks

bfurtaw · November 29, 2022, 1:30pm

Please share the log as a text file? Sorry for the extra work. My Linux system does not have a good converter for docx file format.

anuja.fole · November 29, 2022, 1:31pm

No worry, I will do the same, sorry for the inconvenience caused.

anuja.fole · November 29, 2022, 1:34pm

Here is the text file:
error.txt (854.5 KB)

Thanks

bfurtaw · November 29, 2022, 1:45pm

Got it thank you

anuja.fole · December 1, 2022, 10:56am

Any findings on the issue?

Thanks

sluo · December 5, 2022, 2:19pm

You need use the version 22.04 as follows:

Python 3.8
Spark 3.2.1
rapids-4-spark_2.12-22.04.0.jar
rapids-4-spark-ml_2.12-22.02.0-cuda11.jar
xgboost4j_3.0-1.4.2-0.3.0.jar
xgboost4j-spark_3.0-1.4.2-0.3.0.jar
cudf-22.04.0-cuda11.jar

Here is the complete guide: spark-rapids-examples/python-notebook.md at branch-22.04 · NVIDIA/spark-rapids-examples · GitHub

Attention: Python 3.9 will not work with this sample.

anuja.fole · December 8, 2022, 11:08am

Hello Team,
Even after trying with the shared configurations, we are facing the same issue. Here is the screenshot:

sluo · December 8, 2022, 12:31pm

Could you add those two lines into the jupyter notebook and share the full log back? Thanks.

spark = SparkSession.builder.getOrCreate()
spark.setLogLevel(INFO)

anuja.fole · December 8, 2022, 12:33pm

Yes working on collecting the complete log.

anuja.fole · December 8, 2022, 12:56pm

Here is the logs generated after creating spark session:

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
com.nvidia#rapids-4-spark_2.12 added as a dependency
com.nvidia#rapids-4-spark-ml_2.12 added as a dependency
ai.rapids#cudf added as a dependency
com.nvidia#xgboost4j_3.0 added as a dependency
com.nvidia#xgboost4j-spark_3.0 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04;1.0
	confs: [default]
	found com.nvidia#rapids-4-spark_2.12;22.04.0 in central
	found ai.rapids#cudf;22.04.0 in central
	found org.slf4j#slf4j-api;1.7.30 in local-m2-cache
	found com.nvidia#rapids-4-spark-ml_2.12;22.02.0 in central
	found com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 in central
	found com.typesafe.akka#akka-actor_2.12;2.5.23 in central
	found com.typesafe#config;1.3.3 in central
	found org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 in central
	found com.fasterxml.jackson.core#jackson-databind;2.10.3 in central
	found com.fasterxml.jackson.core#jackson-annotations;2.10.3 in central
	found com.fasterxml.jackson.core#jackson-core;2.10.3 in central
	found org.scalatest#scalatest_2.12;3.0.5 in central
	found org.scalactic#scalactic_2.12;3.0.5 in central
	found org.scala-lang.modules#scala-xml_2.12;1.0.6 in local-m2-cache
	found com.esotericsoftware#kryo;4.0.2 in central
	found com.esotericsoftware#reflectasm;1.11.3 in central
	found org.ow2.asm#asm;5.0.4 in central
	found com.esotericsoftware#minlog;1.3.0 in local-m2-cache
	found org.objenesis#objenesis;2.5.1 in local-m2-cache
	found org.scala-lang#scala-reflect;2.12.8 in local-m2-cache
	found commons-logging#commons-logging;1.2 in local-m2-cache
	found com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 in central
:: resolution report :: resolve 500ms :: artifacts dl 20ms
	:: modules in use:
	ai.rapids#cudf;22.04.0 from central in [default]
	com.esotericsoftware#kryo;4.0.2 from central in [default]
	com.esotericsoftware#minlog;1.3.0 from local-m2-cache in [default]
	com.esotericsoftware#reflectasm;1.11.3 from central in [default]
	com.fasterxml.jackson.core#jackson-annotations;2.10.3 from central in [default]
	com.fasterxml.jackson.core#jackson-core;2.10.3 from central in [default]
	com.fasterxml.jackson.core#jackson-databind;2.10.3 from central in [default]
	com.nvidia#rapids-4-spark-ml_2.12;22.02.0 from central in [default]
	com.nvidia#rapids-4-spark_2.12;22.04.0 from central in [default]
	com.nvidia#xgboost4j-spark_3.0;1.4.2-0.3.0 from central in [default]
	com.nvidia#xgboost4j_3.0;1.4.2-0.3.0 from central in [default]
	com.typesafe#config;1.3.3 from central in [default]
	com.typesafe.akka#akka-actor_2.12;2.5.23 from central in [default]
	commons-logging#commons-logging;1.2 from local-m2-cache in [default]
	org.objenesis#objenesis;2.5.1 from local-m2-cache in [default]
	org.ow2.asm#asm;5.0.4 from central in [default]
	org.scala-lang#scala-reflect;2.12.8 from local-m2-cache in [default]
	org.scala-lang.modules#scala-java8-compat_2.12;0.8.0 from central in [default]
	org.scala-lang.modules#scala-xml_2.12;1.0.6 from local-m2-cache in [default]
	org.scalactic#scalactic_2.12;3.0.5 from central in [default]
	org.scalatest#scalatest_2.12;3.0.5 from central in [default]
	org.slf4j#slf4j-api;1.7.30 from local-m2-cache in [default]
	:: evicted modules:
	com.nvidia#rapids-4-spark_2.12;22.02.0 by [com.nvidia#rapids-4-spark_2.12;22.04.0] in [default]
	ai.rapids#cudf;22.02.0 by [ai.rapids#cudf;22.04.0] in [default]
	org.scala-lang#scala-reflect;2.12.4 by [org.scala-lang#scala-reflect;2.12.8] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   25  |   0   |   0   |   3   ||   23  |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-187d07ca-4ebb-4af7-9ece-3c7e2f1d8c04
	confs: [default]
	0 artifacts copied, 23 already retrieved (0kB/13ms)
22/12/08 18:24:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 18:24:31 WARN ResourceProfile: The executor resource config for resource: gpu was specified but no corresponding task resource request was specified.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator 22.04.0 using cudf 22.04.0.
22/12/08 18:24:33 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
22/12/08 18:24:45 WARN Plugin: Installing rapids UDF compiler extensions to Spark. The compiler is disabled by default. To enable it, set `spark.rapids.sql.udfCompiler.enabled` to true

I will be sending the logs generated while training the xgboost model shortly in text file

anuja.fole · December 8, 2022, 1:18pm

Hello Team,
What will be the best Cuda version for these configurations?

sluo · December 8, 2022, 1:23pm

CUDA 11.x on Enterprise GPU

anuja.fole · December 8, 2022, 1:45pm

error1.txt (4.2 MB)
here is the training logs

sluo · December 8, 2022, 1:50pm

ERROR Message is “AttributeError: ‘Thread’ object has no attribute ‘isAlive’. Did you mean: ‘is_alive’?”

Please use python 3.8 or below, do not use Python 3.9

anuja.fole · December 16, 2022, 9:54am

Hello Team,

With these configurations, we are able to run our training process, but our prediction step is taking a lot of time to execute. Also, the code is not using all the GPUs. What settings or configurations would you suggest so that the code uses all 4 GPUs that we have?