Spark RAPIDS workload fails to Run on Kubernetes Cluster

Hi,
I am configuring a Spark Deployment to run GPU workloads using RAPIDS. Tried following the documentations available here.

Versions:
Kubernetes: v1.22.12-eks-ba74326 (Running on AWS EKS)
Spark: 3.1.1
RAPIDS Jar: rapids-4-spark_2.12-22.12.0.jar

Current Process:

  1. Created a Spark Cluster with bitnami helm charts on Kubernetes Version v1.22.12 on AWS EKS.
  2. Ran a non-GPU test workload to confirm if the spark cluster is operating as intended. (Running without issues)
  3. Created a Docker Image following the RAPIDS documentation for spark.
  4. Used spark-submit to submit the test workload available on the RAPIDS page. This step failed the the error below.

Error Dump for spark-submit:
23/01/10 08:37:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
23/01/10 08:37:04 INFO SecurityManager: Changing view acls to: spark
23/01/10 08:37:04 INFO SecurityManager: Changing modify acls to: spark
23/01/10 08:37:04 INFO SecurityManager: Changing view acls groups to:
23/01/10 08:37:04 INFO SecurityManager: Changing modify acls groups to:
23/01/10 08:37:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
23/01/10 08:37:04 INFO Utils: Successfully started service ‘driverClient’ on port 45019.
23/01/10 08:37:04 INFO TransportClientFactory: Successfully created connection to katonic-spark-operator-master-svc/172.20.18.180:7077 after 26 ms (0 ms spent in bootstraps)
23/01/10 08:37:04 INFO ClientEndpoint: … waiting before polling master for driver state
23/01/10 08:37:04 INFO ClientEndpoint: Driver successfully submitted as driver-20230110083704-0001
23/01/10 08:37:09 INFO ClientEndpoint: State of driver-20230110083704-0001 is ERROR
23/01/10 08:37:09 ERROR ClientEndpoint: Exception from cluster was: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme “local”
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme “local”
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1980)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:817)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:557)
at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:162)
at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:179)
at org.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:99)
23/01/10 08:37:09 INFO ShutdownHookManager: Shutdown hook called
23/01/10 08:37:09 INFO ShutdownHookManager: Deleting directory /tmp/spark-9f37bad9-c981-4194-a448-2c6936e8f7bd

I can not reproduce your issue, please follow the guide here to use the Dockerfile provided to run the sample code.

Could you share the EXACT command you used with spark-submit?

Hi,

I could run the test job successfully on Vanilla Kubernetes using spark-submit. My issue is using this configuration to run spark on cloud based Kubernetes services like AWS EKS and Azure AKS, they do not expose their Kubernetes API for the spark-submit to use.

Do you have any guide on how we can use in cluster configuration (Run the spark job from inside the cluster) or running it from the code directly (like using pyspark package on Python, etc.)?

Please note that we cannot use the Kubernetes API.

Thanks