See also: How to extract application ID from the PySpark context
What is a Spark session?
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("App ID") \
.getOrCreate()
Get the session's context (what is a Spark context? and detailed documentation).
sc = spark.sparkContext
sc
Get applicationId
from the context.
sc.applicationId
'local-1667946735598'
All in one step:
spark.sparkContext.applicationId
'local-1667946735598'
Note: if you're using the pyspark shell (see using the shell), SparkContext
is created automatically and it can be accessed from the variable called sc
.
Create a Spark session.
spark = SparkSession \
.builder \
.appName("defaultParallelism") \
.getOrCreate()
Check the value of defaultParallelism
:
spark.sparkContext.defaultParallelism
8
To change a property it's necessary to stop and start a new context/session.
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
Default parallelism hasn't changed!
spark.sparkContext.defaultParallelism
8
Stop and start session anew.
spark.stop()
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
spark.sparkContext.defaultParallelism
4
Great! Now the context has been changed (and also the applications's name has been updated).
spark.sparkContext
spark.default.parallelism
?¶This property determines the default number of chunks in which an RDD (Resilient Distributed Dataset) is partitioned.
Unless specified by the user, the value of is set based on the cluster manager:
spark-defaults.conf
¶The file spark-defaults.conf
contains the default Spark configuration properties and it is by default located in Spark's configuration directory $SPARK_HOME/conf
(see Spark Configuration).
The format of spark-defaults.conf
is whitespace-separated lines containing property name and value, for instance:
spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark-defaults.conf
?¶If no file spark-defaults.conf
is contained in Spark's configuration directory, you should find a template configuration file spark-defaults.conf.template
. You can rename this to spark-defaults.conf
and use it as default configuration file.
Let's look for all files called spark-defaults*
in Spark's configuration directory:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark-defaults*"))
['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
We found a template spark-defaults.conf
file.
Save the output from last cell (_
) in the variable conf_file
.
conf_file = _
Look at the contents of spark-defaults.conf.template
with open(conf_file[0], 'r') as f:
print(f.read())
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
As you see, everything is commented out in the template file. To set some of the properties, rename spark-defaults.conf.template
and uncomment and edit the properties you want to set as defaults.
If you want to have your Spark configuration files in a directory other than $SPARK_HOME/conf
, you can set the environment variable SPARK_CONF_DIR
.
Spark will then look in $SPARK_CONF_DIR
for all of its configuration files: spark-defaults.conf
, spark-env.sh
, log4j2.properties
, etc. (see https://spark.apache.org/docs/latest/configuration.html#overriding-configuration-directory).
Here's the list of files in the default Spark configuration directory:
os.listdir(os.path.join(os.environ["SPARK_HOME"],"conf"))
['spark-env.sh.template', 'fairscheduler.xml.template', 'metrics.properties.template', 'workers.template', 'log4j.properties.template', 'spark-defaults.conf.template', 'log4j.properties']
But now assume that you have no spark-defaults.conf
or did not configure Spark anywhere else. Still, Spark has some default values for several properties.
Where are those properties defined and how to get their default values?
Spark's documentation provides the list of all available properties grouped into several categories:
All properties have a default value that should accommodate most situations.
As a beginner you might want to give your application a name by configuring spark.app.name
and perhaps change the default values of the following properties:
spark.master
and spark.submit.deployMode
to define where the application should be deployedspark.driver.memory
and spark.driver.maxResultSize
to control the memory usage of the driverspark.executor.memory
and spark.executor.cores
to control executorsFor instance let's create a new session
spark.stop()
spark = SparkSession \
.builder \
.appName("my_app") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
Show properties included in the Spark context
spark.sparkContext.getConf().getAll()
[('spark.driver.port', '50313'), ('spark.app.id', 'local-1667946737593'), ('spark.sql.warehouse.dir', 'file:/Users/x/Documents/notebooks/spark-warehouse'), ('spark.executor.id', 'driver'), ('spark.app.startTime', '1667946737538'), ('spark.app.name', 'my_app'), ('spark.rdd.compress', 'True'), ('spark.driver.memory', '2g'), ('spark.serializer.objectStreamReset', '100'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.default.parallelism', '4'), ('spark.ui.showConsoleProgress', 'true'), ('spark.driver.host', '192.168.0.199')]