Data Generation and Aggregation with Python's Faker Library and PySpark¶

Explore the capabilities of the Python Faker library (https://faker.readthedocs.io/) for dynamic data generation!

Whether you're a data scientist, engineer, or analyst, this tutorial will guide you through the process of creating realistic and diverse datasets using Faker and then harnessing the distributed computing capabilities of PySpark to aggregate and analyze the generated data. Throughout this guide, you will explore effective techniques for data generation that enhance performance and optimize resource usage. Whether you're working with large datasets or simply seeking to streamline your data generation process, this tutorial offers valuable insights to elevate your skills.

Note: This is not synthetic data, as it is generated using straightforward methods and is unlikely to conform to any real-life distribution. Still, it serves as a valuable resource for testing purposes when authentic data is unavailable.

Install Faker¶

The Python faker module needs to be installed. Note that on Google Colab you can use !pip as well as just pip (no exclamation mark).

In [1]:

!pip install faker

Collecting faker
  Downloading Faker-30.3.0-py3-none-any.whl.metadata (15 kB)
Requirement already satisfied: python-dateutil>=2.4 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (2.9.0.post0)
Requirement already satisfied: typing-extensions in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from faker) (4.12.2)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.4->faker) (1.16.0)
Downloading Faker-30.3.0-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 31.7 MB/s eta 0:00:00
Installing collected packages: faker
Successfully installed faker-30.3.0

Generate a Pandas dataframe with fake data¶

Import Faker and set a random seed ($42$).

In [2]:

from faker import Faker
# Set the seed value of the shared `random.Random` object
# across all internal generators that will ever be created
Faker.seed(42)

fake is a fake data generator with DE_de locale.

In [3]:

fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
# Creates and seeds a unique `random.Random` object for
# each internal generator of this `Faker` instance
fake.seed_instance(42)

With fake you can generate fake data, such as name, email, etc.

In [4]:

print(f"A fake name: {fake.name()}")
print(f"A fake email: {fake.email()}")

A fake name: Aleksandr Weihmann
A fake email: ioannis32@example.net

Import Pandas to save data into a dataframe

In [5]:

# true if running on Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules
if not IN_COLAB:
 !pip install pandas==1.5.3

import pandas as pd

Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (2024.2)
Requirement already satisfied: numpy>=1.20.3 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas==1.5.3) (1.24.4)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas==1.5.3) (1.16.0)
Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.2/12.2 MB 107.9 MB/s eta 0:00:00
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
Successfully installed pandas-1.5.3

The function create_row_faker creates one row of fake data. Here we choose to generate a row containing the following fields:

fake.name()
fake.postcode()
fake.email()
fake.country().

In [6]:

def create_row_faker(num=1):
    fake = Faker('de_DE')
    fake.seed_locale('de_DE', 42)
    fake.seed_instance(42)
    output = [{"name": fake.name(),
               "age": fake.random_int(0, 100),
               "postcode": fake.postcode(),
               "email": fake.email(),
               "nationality": fake.country(),
              } for x in range(num)]
    return output

Generate a single row

In [7]:

create_row_faker()

Out[7]:

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'}]

Generate n=3 rows

In [8]:

create_row_faker(3)

Out[8]:

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'},
 {'name': 'Prof. Kurt Bauer B.A.',
  'age': 91,
  'postcode': '37940',
  'email': 'hildaloechel@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Ekkehart Wiek-Kallert',
  'age': 13,
  'postcode': '61559',
  'email': 'maja07@example.net',
  'nationality': 'Brasilien'}]

Generate a dataframe df_fake of 5000 rows using create_row_faker.

We're using the cell magic %%time to time the operation.

In [9]:

%%time
df_fake = pd.DataFrame(create_row_faker(5000))

CPU times: user 276 ms, sys: 6.14 ms, total: 282 ms
Wall time: 282 ms

View dataframe

In [10]:

df_fake

Out[10]:

	name	age	postcode	email	nationality
0	Aleksandr Weihmann	35	32181	bbeckmann@example.org	Fidschi
1	Prof. Kurt Bauer B.A.	91	37940	hildaloechel@example.com	Guatemala
2	Ekkehart Wiek-Kallert	13	61559	maja07@example.net	Brasilien
3	Annelise Rohleder-Hornig	80	93103	daniel31@example.com	Guatemala
4	Magrit Knappe B.A.	47	34192	gottliebmisicher@example.com	Guadeloupe
...	...	...	...	...	...
4995	Hanno Jopich-Rädel	99	13333	keudelstanislaus@example.org	Syrien
4996	Herr Arno Ebert B.A.	63	36790	josefaebert@example.org	Slowenien
4997	Miroslawa Schüler	22	11118	ruppersbergerbetina@example.org	Republik Moldau
4998	Janusz Nerger	74	33091	ann-kathrinseip@example.net	Belarus
4999	Frau Cathleen Bähr	97	89681	hethurhubertus@example.org	St. Barthélemy

5000 rows × 5 columns

For more fake data generators see Faker's standard providers as well as community providers.

Generate PySpark dataframe with fake data¶

Install PySpark.

In [11]:

!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.3.tar.gz (317.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.3/317.3 MB 107.6 MB/s eta 0:00:00
  Installing build dependencies ... - \ | / done
  Getting requirements to build wheel ... - done
  Preparing metadata (pyproject.toml) ... - done
Collecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (pyproject.toml) ... - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
  Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840626 sha256=c569321f8b984ed317b59e5511ef6bc2d2e0ca78caa0b3fc06c77628726af6b5
  Stored in directory: /home/runner/.cache/pip/wheels/94/3e/42/5eee4ed6246b61022f0335dcf22bb1a4a3915c45c0135cdc6f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.7 pyspark-3.5.3

In [12]:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/13 21:33:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [13]:

df = spark.createDataFrame(create_row_faker(5000))

To avoid getting the warning, either use pyspark.sql.Row and let Spark infer datatypes or create a schema for the dataframe specifying the datatypes of all fields (here's the list of all datatypes).

In [14]:

from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
                     StructField('age',IntegerType()),
                     StructField('postcode',StringType()),
                     StructField('email', StringType()),
                     StructField('nationality',StringType())])

In [15]:

df = spark.createDataFrame(create_row_faker(5000), schema)

In [16]:

df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- postcode: string (nullable = true)
 |-- email: string (nullable = true)
 |-- nationality: string (nullable = true)

Let's generate some more data (dataframe with $5\cdot10^4$ rows). The file will be partitioned by Spark.

In [17]:

%%time
n = 5*10**4
df = spark.createDataFrame(create_row_faker(n), schema)

CPU times: user 3.01 s, sys: 8.67 ms, total: 3.02 s
Wall time: 3.04 s

In [18]:

df.show(10, truncate=False)

+---------------------------------+---+--------+----------------------------+----------------------+
|name                             |age|postcode|email                       |nationality           |
+---------------------------------+---+--------+----------------------------+----------------------+
|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |
|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |
|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |
|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |
|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |
|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |
|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize                |
|Steffen Dörschner                |19 |69166   |qraedel@example.net         |Tunesien              |
|Milos Ullmann                    |14 |51462   |uadler@example.net          |Griechenland          |
|Prof. Urban Döring               |80 |89325   |augustewulff@example.net    |Vereinigtes Königreich|
+---------------------------------+---+--------+----------------------------+----------------------+
only showing top 10 rows

It took a long time (~4 sec. for 50000 rows)!

Can we do better?

The function create_row_faker() returns a list. This is not efficient, what we need is a generator instead.

In [19]:

d = create_row_faker(5)
# what type is d?
type(d)

Out[19]:

list

Let us turn d into a generator

In [20]:

d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} for i in range(5))
# what type is d?
type(d)

Out[20]:

generator

In [21]:

%%time
n = 5*10**4
fake = Faker('de_DE')
fake.seed_locale('de_DE', 42)
fake.seed_instance(42)
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()}
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 3.1 s, sys: 25.7 ms, total: 3.13 s
Wall time: 3.15 s

In [22]:

df.show(10, truncate=False)

+---------------------------------+---+--------+----------------------------+----------------------+
|name                             |age|postcode|email                       |nationality           |
+---------------------------------+---+--------+----------------------------+----------------------+
|Aleksandr Weihmann               |35 |32181   |bbeckmann@example.org       |Fidschi               |
|Prof. Kurt Bauer B.A.            |91 |37940   |hildaloechel@example.com    |Guatemala             |
|Ekkehart Wiek-Kallert            |13 |61559   |maja07@example.net          |Brasilien             |
|Annelise Rohleder-Hornig         |80 |93103   |daniel31@example.com        |Guatemala             |
|Magrit Knappe B.A.               |47 |34192   |gottliebmisicher@example.com|Guadeloupe            |
|Univ.Prof. Gotthilf Wilmsen B.Sc.|29 |56413   |heini76@example.net         |Litauen               |
|Franjo Etzold-Hentschel          |95 |96965   |frederikpechel@example.com  |Belize                |
|Steffen Dörschner                |19 |69166   |qraedel@example.net         |Tunesien              |
|Milos Ullmann                    |14 |51462   |uadler@example.net          |Griechenland          |
|Prof. Urban Döring               |80 |89325   |augustewulff@example.net    |Vereinigtes Königreich|
+---------------------------------+---+--------+----------------------------+----------------------+
only showing top 10 rows

This wasn't faster.

Let us look at how one can leverage Hadoop's parallelism to generate dataframes and speed up the process.

A more efficient way to generate a large amount of records¶

We are going to use Spark's RDD and the function parallelize. In order to do this, we are going to need to extract the Spark context from the current session.

In [23]:

sc = spark.sparkContext
sc

Out[23]:

SparkContext

Spark UI

Version: v3.5.3
Master: local[*]
AppName: Faker demo

In order to decide on the number of partitions, we are going to look at the number of (virtual) CPU's on the local machine. If you have a cluster you can have a larger number of CPUs across multiple nodes but this is not the case here.

The standard Google Colab virtual machine has $2$ virtual CPUs (one CPU with two threads), so that is the maximum parallelization that you can achieve.

Note:

CPUs = threads per core × cores per socket × sockets

In [24]:

!lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('

CPU(s):                             4
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1

Due to the limited number of CPUs on this machine, we'll use only $2$ partitions. Even so, the data generation timing has improved dramatically!

In [25]:

%%time
n = 5*10**4
num_partitions = 2
# Create an empty RDD with the specified number of partitions
empty_rdd = sc.parallelize(range(num_partitions), num_partitions)
# Define a function that will run on each partition to generate the fake data
def generate_fake_data(_):
    fake = Faker('de_DE') # Create a new Faker instance per partition
    fake.seed_locale('de_DE', 42)
    fake.seed_instance(42)
    for _ in range(n // num_partitions):  # Divide work across partitions
        yield {
            "name": fake.name(),
            "age": fake.random_int(0, 100),
            "postcode": fake.postcode(),
            "email": fake.email(),
            "nationality": fake.country()
        }

# Use mapPartitions to generate fake data for each partition
rdd = empty_rdd.mapPartitions(generate_fake_data)
# Convert the RDD to a DataFrame
df = rdd.toDF()

CPU times: user 10.7 ms, sys: 0 ns, total: 10.7 ms
Wall time: 208 ms

I'm convinced that the reason everyone always looks at the first $5$ rows in Spark's RDDs is an homage to the classic jazz piece 🎷🎶.

In [26]:

rdd.take(5)

Out[26]:

[{'name': 'Aleksandr Weihmann',
  'age': 35,
  'postcode': '32181',
  'email': 'bbeckmann@example.org',
  'nationality': 'Fidschi'},
 {'name': 'Prof. Kurt Bauer B.A.',
  'age': 91,
  'postcode': '37940',
  'email': 'hildaloechel@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Ekkehart Wiek-Kallert',
  'age': 13,
  'postcode': '61559',
  'email': 'maja07@example.net',
  'nationality': 'Brasilien'},
 {'name': 'Annelise Rohleder-Hornig',
  'age': 80,
  'postcode': '93103',
  'email': 'daniel31@example.com',
  'nationality': 'Guatemala'},
 {'name': 'Magrit Knappe B.A.',
  'age': 47,
  'postcode': '34192',
  'email': 'gottliebmisicher@example.com',
  'nationality': 'Guadeloupe'}]

In [27]:

df.show()

+---+--------------------+--------------------+--------------------+--------+
|age|               email|                name|         nationality|postcode|
+---+--------------------+--------------------+--------------------+--------+
| 35|bbeckmann@example...|  Aleksandr Weihmann|             Fidschi|   32181|
| 91|hildaloechel@exam...|Prof. Kurt Bauer ...|           Guatemala|   37940|
| 13|  maja07@example.net|Ekkehart Wiek-Kal...|           Brasilien|   61559|
| 80|daniel31@example.com|Annelise Rohleder...|           Guatemala|   93103|
| 47|gottliebmisicher@...|  Magrit Knappe B.A.|          Guadeloupe|   34192|
| 29| heini76@example.net|Univ.Prof. Gotthi...|             Litauen|   56413|
| 95|frederikpechel@ex...|Franjo Etzold-Hen...|              Belize|   96965|
| 19| qraedel@example.net|   Steffen Dörschner|            Tunesien|   69166|
| 14|  uadler@example.net|       Milos Ullmann|        Griechenland|   51462|
| 80|augustewulff@exam...|  Prof. Urban Döring|Vereinigtes König...|   89325|
| 62|polinarosenow@exa...|    Krzysztof Junitz|             Belarus|   15430|
| 16|   ihahn@example.net|Frau Zita Wesack ...|               Samoa|   82489|
| 85|carlokambs@exampl...|    Olaf Jockel MBA.|      Nordmazedonien|   78713|
| 90|karl-heinrichstau...|Prof. Emil Albers...|      Falklandinseln|   31051|
| 60|  bklapp@example.com|Otfried Rudolph-Rust|          Madagaskar|   76311|
| 82|hans-hermannreisi...|      Michail Söding|           Bulgarien|   06513|
| 31|davidssusan@examp...|   Dr. Erna Misicher|       Côte d’Ivoire|   78108|
| 51| bhoerle@example.net|Dipl.-Ing. Jana H...|    Äußeres Ozeanien|   26064|
|  7|bjoernpechel@exam...|Dr. Cordula Hübel...| Trinidad und Tobago|   50097|
| 86|scholldarius@exam...|Herr Konstantinos...|      Republik Korea|   61939|
+---+--------------------+--------------------+--------------------+--------+
only showing top 20 rows

Filter and aggregate with PySpark¶

Show the first five records in the dataframe of fake data.

In [28]:

df.show(n=5, truncate=False)

+---+----------------------------+------------------------+-----------+--------+
|age|email                       |name                    |nationality|postcode|
+---+----------------------------+------------------------+-----------+--------+
|35 |bbeckmann@example.org       |Aleksandr Weihmann      |Fidschi    |32181   |
|91 |hildaloechel@example.com    |Prof. Kurt Bauer B.A.   |Guatemala  |37940   |
|13 |maja07@example.net          |Ekkehart Wiek-Kallert   |Brasilien  |61559   |
|80 |daniel31@example.com        |Annelise Rohleder-Hornig|Guatemala  |93103   |
|47 |gottliebmisicher@example.com|Magrit Knappe B.A.      |Guadeloupe |34192   |
+---+----------------------------+------------------------+-----------+--------+
only showing top 5 rows

Do some data aggregation:

group by postcode
count the number of persons and the average age for each postcode
filter out postcodes with less than 4 persons
sort by average age descending
show the first 5 entries

In [29]:

import pyspark.sql.functions as F
df.groupBy('postcode') \
  .agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \
  .filter('Count>3') \
  .orderBy('Average age', ascending=False) \
  .show(5)

[Stage 6:>                                                          (0 + 2) / 2]

+--------+-----+-----------+
|postcode|Count|Average age|
+--------+-----+-----------+
|   60653|    4|       98.5|
|   59679|    4|       98.5|
|   37287|    4|       98.5|
|   63287|    4|       98.0|
|   37841|    4|       97.5|
+--------+-----+-----------+
only showing top 5 rows

Postcode $18029$ has the highest average age ($91.75$). Show all entries for postcode $18029$ using filter.

In [30]:

df.filter('postcode==18029').show(truncate=False)

[Stage 10:>                                                         (0 + 1) / 1]

+---+-----+----+-----------+--------+
|age|email|name|nationality|postcode|
+---+-----+----+-----------+--------+
+---+-----+----+-----------+--------+

Another example with multiple locales and weights¶

We are going to use multiple locales with weights (following the examples in the documentation).

Here's the list of all available locales.

In [31]:

from faker import Faker
# set a seed for the random generator
Faker.seed(0)

Generate data with locales de_DE and de_AT with weights respectively $5$ and $2$.

The distribution of locales will be:

de_DE - $71.43\%$ of the time ($5 / (5+2)$)
de_AT - $28.57\%$ of the time ($2 / (5+2)$)

In [32]:

from collections import OrderedDict
locales = OrderedDict([
    ('de_DE', 5),
    ('de_AT', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales

Out[32]:

['de_DE', 'de_AT']

In [33]:

fake.seed_locale('de_DE', 0)
fake.seed_locale('de_AT', 0)

In [34]:

fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

Out[34]:

{'current_location': (Decimal('26.547114'), Decimal('-10.243190')),
 'blood_group': 'B-',
 'name': 'Axel Jung',
 'sex': 'M',
 'mail': 'claragollner@gmail.com',
 'birthdate': datetime.date(2004, 1, 27)}

In [35]:

from pyspark.sql.types import *
location = StructField('current_location',
                       StructType([StructField('lat', DecimalType()),
                                   StructField('lon', DecimalType())])
                      )
schema = StructType([StructField('name', StringType()),
                     StructField('birthdate', DateType()),
                     StructField('sex', StringType()),
                     StructField('blood_group', StringType()),
                     StructField('mail', StringType()),
                     location
                     ])

In [36]:

fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                     'mail', 'current_location'])

Out[36]:

{'current_location': (Decimal('79.153888'), Decimal('-0.003034')),
 'blood_group': 'B-',
 'name': 'Dr. Anita Suppan',
 'sex': 'F',
 'mail': 'schauerbenedict@kabsi.at',
 'birthdate': datetime.date(1980, 10, 9)}

In [37]:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Faker demo - part 2") \
    .getOrCreate()

24/10/13 21:34:04 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.

Create dataframe with $5\cdot10^3$ rows.

In [38]:

%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group',
                          'mail', 'current_location'])
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 1.6 s, sys: 0 ns, total: 1.6 s
Wall time: 1.68 s

In [39]:

df.printSchema()

root
 |-- name: string (nullable = true)
 |-- birthdate: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- blood_group: string (nullable = true)
 |-- mail: string (nullable = true)
 |-- current_location: struct (nullable = true)
 |    |-- lat: decimal(10,0) (nullable = true)
 |    |-- lon: decimal(10,0) (nullable = true)

Note how location represents a tuple data structure (a StructType of StructFields).

In [40]:

df.show(n=10, truncate=False)

+---------------------------+----------+---+-----------+-------------------------+----------------+
|name                       |birthdate |sex|blood_group|mail                     |current_location|
+---------------------------+----------+---+-----------+-------------------------+----------------+
|Prof. Valentine Niemeier   |1979-11-12|F  |B-         |maricagotthard@aol.de    |{74, 164}       |
|Magrit Graf                |1943-09-15|F  |A-         |hartungclaudio@web.de    |{-86, -34}      |
|Harriet Weiß-Liebelt       |1960-07-24|F  |AB+        |heserhilma@gmail.com     |{20, 126}       |
|Marisa Heser               |1919-08-27|F  |B-         |meinhard55@web.de        |{73, 169}       |
|Alexa Loidl-Schönberger    |1934-08-30|F  |O-         |hannafroehlich@gmail.com |{-23, -117}     |
|Rosa-Maria Schwital B.Sc.  |1928-02-08|F  |O-         |johannessauer@yahoo.de   |{2, -113}       |
|Herr Roland Caspar B.Sc.   |1932-09-09|M  |O-         |weinholdslawomir@yahoo.de|{24, 100}       |
|Bernard Mude               |1943-03-01|M  |O-         |stollinka@hotmail.de     |{22, -104}      |
|Prof. Violetta Eberl       |1913-09-09|F  |B+         |lars24@chello.at         |{80, -135}      |
|Alexandre Oestrovsky B.Eng.|1925-03-29|M  |A-         |schleichholger@yahoo.de  |{-62, 43}       |
+---------------------------+----------+---+-----------+-------------------------+----------------+
only showing top 10 rows

Save to Parquet¶

Write to parquet file (Parquet is a compressed, efficient columnar data representation compatible with all frameworks in the Hadoop ecosystem):

In [41]:

df.write.mode("overwrite").parquet("fakedata.parquet")

Check the size of parquet file (it is actually a directory containing the partitions):

In [42]:

!du -h fakedata.parquet

212K	fakedata.parquet

In [43]:

!ls -lh fakedata.parquet

total 188K
-rw-r--r-- 1 runner docker   0 Oct 13 21:34 _SUCCESS
-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00000-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet
-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00001-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet
-rw-r--r-- 1 runner docker 39K Oct 13 21:34 part-00002-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet
-rw-r--r-- 1 runner docker 67K Oct 13 21:34 part-00003-d6685452-1d7a-4788-94a6-be51583e7864-c000.snappy.parquet

Stop Spark session¶

Don't forget to close the Spark session when you're done!

Why you should stop your Spark session¶

Even when no jobs are running, the Spark session holds memory resources, that get released only when the session is properly stopped.

In [44]:

# Function to check memory usage
import subprocess

def get_memory_usage_ratio():
    # Run the 'free -h' command
    result = subprocess.run(['free', '-h'], stdout=subprocess.PIPE, text=True)

    # Parse the output
    lines = result.stdout.splitlines()

    # Initialize used and total memory
    used_memory = None
    total_memory = None

    # The second line contains the memory information
    if len(lines) > 1:
        # Split the line into parts
        memory_parts = lines[1].split()
        total_memory = memory_parts[1]  # Total memory
        used_memory = memory_parts[2]   # Used memory

    return used_memory, total_memory

Stop the session and compare.

In [45]:

# Check memory usage before stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used before stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")

Memory used before stopping Spark session: 1.6Gi
Total Memory: 15Gi

In [46]:

# Stop the Spark session
spark.stop()

# Check memory usage after stopping the Spark session
used_memory, total_memory = get_memory_usage_ratio()
print(f"Memory used after stopping Spark session: {used_memory}")
print(f"Total Memory: {total_memory}")

Memory used after stopping Spark session: 1.5Gi
Total Memory: 15Gi

The amount of memory released may not be impressive in this case, but holding onto unnecessary resources is inefficient. Also, memory waste can add up quickly when multiple sessions are running.