Diplomado en Inteligencia Artificial y Aprendizaje Profundo¶

Canalización de datos. La API tf.data¶

Escritor¶

Oleg Jarma, ojarmam@unal.edu.co

Profesores¶

Coordinador¶

Campo Elías Pardo, PhD, cepardot@unal.edu.co

Conferencistas¶

Alvaro Montenegro, PhD, ammontenegrod@unal.edu.co
Daniel Montenegro, Msc, dextronomo@gmail.com

Asesora Medios y Marketing digital¶

Maria del Pilar Montenegro, pmontenegro88@gmail.com

Asistentes¶

Nayibe Yesenia Arias, naariasc@unal.edu.co
Venus Celeste Puertas, vpuertasg@unal.edu.co

Contenido¶

Introducción¶

Basado en tf.data.

La API tf.data permite crear tuberías de entrada complejas a partir de piezas simples y reutilizables. Por ejemplo, la canalización de un modelo de imagen podría agregar datos de archivos en un sistema de archivos distribuido, aplicar perturbaciones aleatorias a cada imagen y fusionar imágenes seleccionadas al azar en un lote para entrenamiento. La canalización de un modelo de texto puede implicar extraer símbolos de datos de texto sin procesar, convertirlos en identificadores incrustados con una tabla de búsqueda y agrupar secuencias de diferentes longitudes.

Importa librerías¶

In [1]:

import tensorflow as tf

import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)

2021-10-27 11:14:36.168920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

Esenciales¶

Para crear una canalización de entrada, debe comenzar con una fuente de datos. Por ejemplo, para construir un Dataset de datos a partir de datos en la memoria, puede usar tf.data.Dataset.from_tensors() o tf.data.Dataset.from_tensor_slices(). Alternativamente, si sus datos de entrada están almacenados en un archivo en el formato TFRecord de TensorFlow puede usar tf.data.TFRecordDataset().

El objeto Dataset es un iterable de Python. Esto hace posible consumir sus elementos usando un bucle for:

In [2]:

dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset

2021-10-27 11:14:38.879085: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-10-27 11:14:38.879804: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-10-27 11:14:39.128895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.129290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-10-27 11:14:39.129308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-10-27 11:14:39.198627: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-10-27 11:14:39.198673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-10-27 11:14:39.234411: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-10-27 11:14:39.243015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-10-27 11:14:39.305256: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-10-27 11:14:39.311792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-10-27 11:14:39.414486: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-10-27 11:14:39.414590: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.415044: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.415320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-10-27 11:14:39.418046: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-27 11:14:39.419480: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.419885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5
coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s
2021-10-27 11:14:39.419905: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-10-27 11:14:39.419924: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-10-27 11:14:39.419935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-10-27 11:14:39.419944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-10-27 11:14:39.419954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-10-27 11:14:39.419964: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-10-27 11:14:39.419974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-10-27 11:14:39.419985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-10-27 11:14:39.420029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.420334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:39.420610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-10-27 11:14:39.421802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-10-27 11:14:40.760544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-27 11:14:40.760561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-10-27 11:14:40.760567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-10-27 11:14:40.762067: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:40.762395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:40.762793: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-27 11:14:40.763065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4849 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-10-27 11:14:40.766370: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

Out[2]:

<TensorSliceDataset shapes: (), types: tf.int32>

In [3]:

len(dataset)

Out[3]:

In [4]:

for elem in dataset:
    print(elem.numpy())

o se pueden crear explícitamente un iterador

In [5]:

it = iter(dataset)
print(next(it).numpy())
print(next(it).numpy())

8
3

Consumo de datos usando reducción: reduce¶

In [6]:

print(dataset.reduce(0, lambda state, value: state+value).numpy())

2021-10-27 11:14:40.877414: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-10-27 11:14:40.880427: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2601325000 Hz

Estructura del conjunto de datos¶

Un conjunto de datos produce una secuencia de elementos , donde cada elemento tiene la misma estructura (anidada) de componentes .

Los componentes individuales de la estructura pueden ser de cualquier tipo representable por tf.TypeSpec, incluidos tf.Tensor , tf.sparse.SparseTensor ,tf.RaggedTensor , tf.TensorArray o tf.data.Dataset.

Las construcciones de Python que se pueden usar para expresar la estructura (anidada) de elementos incluyen tuple , dict , NamedTuple y OrderedDict.

En particular, list no es una construcción válida para expresar la estructura de los elementos del conjunto de datos.

Si desea que una entrada de list se trate como una estructura, debe convertirla en tuple y si desea que una lista de salida, entonces debe empaquetarla explícitamente usando tf.stack.

La propiedad Dataset.element_spec permite inspeccionar el tipo de cada componente del elemento. La propiedad devuelve una estructura anidada de objetos tf.TypeSpec, que coincide con la estructura del elemento, que puede ser un solo componente, una tupla de componentes o una tupla anidada de componentes. Por ejemplo:

In [7]:

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
print(dataset1)

<TensorSliceDataset shapes: (10,), types: tf.float32>

In [8]:

for i in dataset1:
    print(i.numpy())

[0.6143 0.1727 0.6508 0.4864 0.3716 0.9486 0.5665 0.5594 0.0576 0.9023]
[1.1729e-02 9.0851e-01 3.3666e-01 6.3419e-04 2.4943e-01 4.6697e-02
 9.7493e-01 5.1887e-01 6.0174e-01 7.2745e-01]
[0.4474 0.0227 0.0012 0.5103 0.099  0.1649 0.9021 0.2701 0.4714 0.7356]
[0.4228 0.2625 0.7598 0.8483 0.6128 0.3952 0.6487 0.3378 0.3738 0.09  ]

In [9]:

len(dataset1)

Out[9]:

In [10]:

dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random.uniform([4]), #y
     tf.random.uniform([4,100], maxval=100, dtype=tf.int32))) #x

dataset2.element_spec
    

Out[10]:

(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))

In [11]:

len(dataset2)

Out[11]:

In [12]:

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
dataset3.element_spec

Out[12]:

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

In [13]:

len(dataset3)

Out[13]:

In [14]:

type(dataset3)

Out[14]:

tensorflow.python.data.ops.dataset_ops.ZipDataset

In [15]:

i = iter(dataset3)

In [16]:

print(i.next(), "\n")
print(i.next(), "\n")
print(i.next(), "\n")
print(i.next(), "\n")

(<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.6143, 0.1727, 0.6508, 0.4864, 0.3716, 0.9486, 0.5665, 0.5594,
       0.0576, 0.9023], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.89718556>, <tf.Tensor: shape=(100,), dtype=int32, numpy=
array([59, 45, 62, 51, 24, 76, 83, 94, 97, 89, 21, 60, 86, 43, 55, 94, 30,
        1, 59, 35, 67, 99, 61, 63, 30,  0, 65, 89, 81, 35, 10, 67, 15, 20,
       85,  0, 59, 77, 47, 19, 15, 58, 47, 21, 94, 58,  2, 48, 37, 71, 19,
       35, 76,  7, 11, 23, 80, 11, 40, 23, 73, 59, 24, 56,  1, 70, 51, 46,
       87, 98, 20, 30,  4, 55, 18, 20, 43, 56, 43, 98,  3, 54, 82, 31,  6,
       77, 67, 88, 70, 48, 76, 88, 66, 20,  6,  1, 43, 46, 72, 27],
      dtype=int32)>)) 

(<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([1.1729e-02, 9.0851e-01, 3.3666e-01, 6.3419e-04, 2.4943e-01,
       4.6697e-02, 9.7493e-01, 5.1887e-01, 6.0174e-01, 7.2745e-01],
      dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.2631073>, <tf.Tensor: shape=(100,), dtype=int32, numpy=
array([96, 43, 25, 28, 58, 92, 61, 18, 23, 46, 71, 30,  9, 14, 98, 13, 90,
       30, 34, 60, 49, 11, 13, 24, 79, 36, 80, 39, 99, 58, 53, 44, 73, 21,
       17, 81, 59, 37, 45, 36, 32, 99, 63,  2, 43,  0, 23, 42, 88,  5, 25,
       37, 89, 13, 36, 55, 52, 31, 98, 78, 49,  8,  6, 89, 53, 50, 55, 45,
       28, 93, 11, 38, 86, 55, 39, 58, 28,  8, 36, 33, 35, 45, 10, 71, 34,
       22, 61, 26, 28, 30, 49, 78, 55, 75,  9, 80, 59, 78, 30, 77],
      dtype=int32)>)) 

(<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.4474, 0.0227, 0.0012, 0.5103, 0.099 , 0.1649, 0.9021, 0.2701,
       0.4714, 0.7356], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.5659776>, <tf.Tensor: shape=(100,), dtype=int32, numpy=
array([ 2, 62, 96, 82, 22, 31, 97, 29, 14,  0, 28, 85, 36, 19, 88, 26, 95,
       33, 93, 90, 24, 19, 80,  6, 98, 16, 57, 86, 63, 54, 44, 62, 18, 14,
       95,  0, 29, 12,  5, 50, 33, 38, 32, 29, 56, 70, 42, 76, 88, 59, 42,
       51, 53, 47, 28,  9, 39, 51, 15, 17, 21, 67, 98, 97, 97, 70,  6, 41,
       37, 47, 49,  0,  6, 32, 71, 20, 34, 45, 27, 60, 72, 45, 56, 13,  2,
       61, 31,  4, 78, 74,  2, 31, 80, 59, 96, 32, 20, 16, 52, 76],
      dtype=int32)>)) 

(<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.4228, 0.2625, 0.7598, 0.8483, 0.6128, 0.3952, 0.6487, 0.3378,
       0.3738, 0.09  ], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.7929852>, <tf.Tensor: shape=(100,), dtype=int32, numpy=
array([76, 51, 57, 30, 84, 84, 50, 92,  7, 75, 92,  6, 16, 54, 52,  1, 26,
       94, 52, 63, 70, 26, 63, 25, 34, 34, 95, 46, 81, 48,  8, 31, 56, 44,
       31, 51, 27, 37, 69, 60, 71, 67, 59, 92, 80,  9, 43, 44, 17, 27, 23,
        9, 13,  7, 28, 45, 96, 40, 42, 13, 19, 19, 27, 81, 26, 39, 52, 65,
       82, 38, 87, 66,  3, 38, 14, 58, 83,  0, 31,  7, 67, 98, 94, 17, 95,
       93, 59, 63, 63, 75, 21, 34, 81, 14, 36, 96, 11,  0, 18, 15],
      dtype=int32)>))

In [17]:

for a, (b,c) in dataset3:
    print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

In [18]:

# dataset con tensores dispersos
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0],[1, 2]], values=[1, 2], dense_shape=[3, 4]))
dataset4.element_spec

Out[18]:

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

In [19]:

dataset4.element_spec.value_type

Out[19]:

tensorflow.python.framework.sparse_tensor.SparseTensor

Leer datos de entrada¶

Consumir matrices Numpy¶

Si todos sus datos de entrada caben en la memoria, la forma más sencilla de crear un Dataset a partir de ellos es convertirlos en objetos tf.Tensor y usar Dataset.from_tensor_slices() .

In [20]:

train, test = tf.keras.datasets.fashion_mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 4s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

In [21]:

imagenes, labels  = train
imagenes = imagenes /255.

dataset = tf.data.Dataset.from_tensor_slices((imagenes, labels))
dataset

2021-10-27 11:14:46.698987: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory.

Out[21]:

<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>

Consumir generadores de Python¶

In [22]:

def count(stop):
    i=0
    while i<stop:
        yield i
        i+= 1
        
for n in count(5):
    print(n)

El constructor Dataset.from_generator convierte el generador de Python en un tf.data.Dataset completamente funcional.

El constructor toma un invocable como entrada, no un iterador. Esto le permite reiniciar el generador cuando llega al final. Toma un argumento args opcional, que se pasa como argumentos del invocable.

El argumento output_types es necesario porque tf.data crea un tf.Graph internamente y los bordes del gráfico requieren un tf.dtype .

In [23]:

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes=(),)

In [24]:

for count_batch in ds_counter.repeat().batch(10).take(10):
    print(count_batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

El argumento output_shapes no es necesario, pero se recomienda, ya que muchas operaciones de flujo tensorial no admiten tensores con rango desconocido. Si la longitud de un eje en particular es desconocida o variable, output_shapes puede colcarse como None.

También es importante tener en cuenta que output_shapes y output_types siguen las mismas reglas de anidamiento que otros métodos de conjuntos de datos.

Aquí hay un generador de ejemplo que demuestra ambos aspectos, devuelve tuplas de matrices, donde la segunda matriz es un vector con longitud desconocida.

In [25]:

def gen_series():
    i = 0
    while True:
        size = np.random.randint(0,10)
        yield i, np.random.normal(size = (size,))
        i+=1

In [26]:

for i, series in gen_series():
    print(i, ":", str(series))
    if i > 5:
        break

0 : [-0.2311]
1 : [-0.9686 -0.6723  0.6278 -0.761  -0.1297 -0.6381]
2 : []
3 : []
4 : [-0.134]
5 : [-0.8309  0.4063  0.5506 -0.3408]
6 : [-1.3336  1.3925  0.1096 -1.1965 -0.2353  1.4199  0.9267 -0.6835]

La primera salida es un int32 la segunda es un float32.

El primer elemento es un escalar, forma () , y el segundo es un vector de longitud desconocida, forma (None,)

In [27]:

ds_series = tf.data.Dataset.from_generator(
    gen_series,
    output_types=(tf.int32, tf.float32),
    output_shapes=((), (None, )))

ds_series

Out[27]:

<FlatMapDataset shapes: ((), (None,)), types: (tf.int32, tf.float32)>

Ahora se puede utilizar como un tf.data.Dataset normal. Tenga en cuenta que al procesar por lotes un conjunto de datos con una forma variable, debe usar Dataset.padded_batch.

In [28]:

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

ids, sequence_batch = next(iter(ds_series_batch))

print (ids.numpy())
print()
print(sequence_batch.numpy())

[ 1 14 17  7 21 20 19 24 26 23]

[[ 0.872   0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.5344 -0.8575  0.      0.      0.      0.      0.      0.      0.    ]
 [ 1.424   0.0269 -1.2718  0.      0.      0.      0.      0.      0.    ]
 [ 0.0719 -0.701  -1.1496 -0.6294  0.      0.      0.      0.      0.    ]
 [ 0.4681  0.4923 -1.4863  0.1351 -0.5685  0.4341 -0.5812  0.      0.    ]
 [-0.0529  1.4876  0.8683  0.4717 -1.1321  1.6762  0.0838 -1.0582 -0.4435]
 [ 0.0471 -2.5569  0.3248 -1.5189  0.      0.      0.      0.      0.    ]]

In [29]:

it = iter(ds_series_batch)
for i in range(10):
    ids, sequence_batch = next(it)
    print (ids.numpy())
    print()
    print(sequence_batch.numpy())
    print()
    

[ 7 16 12  5 11 15  6 13  2 18]

[[-1.4672 -0.1405  0.      0.      0.      0.      0.      0.    ]
 [-0.6291 -0.648   1.1858 -0.0339  1.0398  0.5672  1.1503 -0.9136]
 [ 0.1728  0.      0.      0.      0.      0.      0.      0.    ]
 [-0.107  -0.8513  0.      0.      0.      0.      0.      0.    ]
 [ 0.9371  0.8138  0.5843  0.2523 -1.4394  1.4096  0.      0.    ]
 [ 0.4807  0.7075 -0.0508 -0.2766  1.1682  0.      0.      0.    ]
 [-0.1746  1.3332 -0.168  -0.2606  0.2766  0.      0.      0.    ]
 [ 0.2357  1.0988 -1.2427  0.9682  0.      0.      0.      0.    ]
 [-0.3176 -0.4375 -1.1106  0.      0.      0.      0.      0.    ]
 [-0.9847 -0.0585  1.9546  0.44   -0.0685 -1.7938  0.      0.    ]]

[25 23  8 20 21 22 19 10  0 26]

[[-0.6809 -0.0455  0.1088 -0.8211  1.2413 -1.5584  0.      0.      0.    ]
 [-0.4125 -1.905  -1.2539  0.0093  1.3305  0.1917  0.1565 -0.7806  0.3128]
 [-2.05   -2.2507 -0.2782  0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-1.4544  0.8821  0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.903   1.2739 -0.1799  0.5255 -0.2661  0.      0.      0.      0.    ]
 [ 0.0865 -0.6743  1.2116  0.4853  1.3283  0.      0.      0.      0.    ]
 [ 0.5732 -1.6206 -1.2255 -1.4218  0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.3569  0.1471  0.      0.      0.      0.      0.      0.      0.    ]]

[32 33 36 29 37 35  9  4 40 43]

[[ 6.5692e-01  9.4795e-01 -2.0083e+00  4.3408e-01 -1.1075e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [ 6.7936e-03  1.6619e+00 -1.8120e+00 -1.1778e-01  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [ 0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [ 2.9384e-01  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [ 2.5649e+00  2.2146e-01 -6.0475e-01  1.5246e+00  6.3542e-01  1.7047e+00
  -4.0017e-01  0.0000e+00  0.0000e+00]
 [ 1.7746e+00 -1.8441e-03  4.8787e-01 -2.8778e-01 -1.6560e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [-5.8409e-01 -1.7815e+00  1.3852e+00 -1.7356e-01  2.2389e-01  1.1106e+00
   1.5763e+00  8.9979e-01 -5.7616e-02]
 [-6.0780e-01 -1.1874e+00  2.7903e+00  8.6838e-01  5.3041e-01 -1.0636e+00
  -4.8197e-01  0.0000e+00  0.0000e+00]
 [ 9.7730e-01 -1.7599e+00 -7.4557e-01  0.0000e+00  0.0000e+00  0.0000e+00
   0.0000e+00  0.0000e+00  0.0000e+00]
 [ 4.2674e-01  1.0619e+00 -2.9815e-01 -9.8932e-01 -1.2172e+00  2.7212e+00
  -1.8119e-01  0.0000e+00  0.0000e+00]]

[42 41 46 45 31 39 52 49 47 58]

[[-0.9908 -0.336   0.4805  0.0548  0.      0.      0.      0.      0.    ]
 [-0.1697  0.9944  0.      0.      0.      0.      0.      0.      0.    ]
 [ 1.391  -1.3834 -0.0722  0.8418 -0.2624 -0.6159  0.3003 -0.4918  0.9743]
 [ 0.0696  1.8761 -0.4255 -2.0904  0.0713  0.      0.      0.      0.    ]
 [ 1.7669  0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 1.0693 -0.394   0.33    0.6187  0.8511 -1.3357 -0.3105  0.      0.    ]
 [ 0.0976 -0.6148  0.      0.      0.      0.      0.      0.      0.    ]
 [-1.2994  0.7602  0.      0.      0.      0.      0.      0.      0.    ]
 [-0.1547  0.539  -0.7047 -0.9207 -0.5796  0.      0.      0.      0.    ]
 [-1.0514 -1.5457  0.      0.      0.      0.      0.      0.      0.    ]]

[57 14 30 60 54 17 50 34 24  3]

[[-1.977   0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-0.3461  0.8693 -0.1612 -0.6677 -1.9281  1.2742  0.2442  0.      0.    ]
 [ 0.1438 -0.5108 -2.0957 -0.7766  1.3033  0.8368  1.4962  1.1181  0.    ]
 [ 0.3655 -1.531   1.7767  1.3026  0.      0.      0.      0.      0.    ]
 [-0.7794  0.2927 -0.1646  0.      0.      0.      0.      0.      0.    ]
 [-0.0701 -0.7325 -0.7293 -1.6565  0.5733 -0.5944 -0.0645  0.      0.    ]
 [ 0.4288 -0.456   0.      0.      0.      0.      0.      0.      0.    ]
 [ 1.4133  0.5523  0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.8531 -2.4146  1.9072 -0.0497  0.2455  0.0722 -0.4228  0.6441  0.    ]
 [-0.7503 -0.1603  1.1878 -1.0804 -0.0119 -1.0453 -0.9199  0.7854  0.2143]]

[64 65 61 68 48 53 72 56 67 78]

[[ 0.8044  0.4001  1.3644 -1.6187 -2.415   0.1207 -1.5868  0.3856 -1.4043]
 [ 1.8468  0.8731 -0.7306  0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.7539 -0.1132 -0.8311 -1.6697 -0.2072 -0.8393  0.      0.      0.    ]
 [ 0.384  -0.8923 -2.187   0.      0.      0.      0.      0.      0.    ]
 [ 0.4833 -0.6216  0.5257  0.5456 -0.1609 -1.3584  0.1504  0.8463  0.    ]
 [-1.0084  0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.4967  0.1975 -1.9036  0.8044  2.2763  0.1914 -0.724  -0.3081  0.    ]
 [-0.9175  1.0792  1.3103 -0.7467  1.4616  0.      0.      0.      0.    ]
 [ 0.9968 -0.6953  0.      0.      0.      0.      0.      0.      0.    ]]

[73 38  1 69 44 80 83 70 27 51]

[[ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 2.3107  0.0715  0.2946 -0.1409 -1.131  -1.5621 -0.7402 -0.0731 -1.1154]
 [-0.3471  2.3058  0.      0.      0.      0.      0.      0.      0.    ]
 [-0.1019 -0.8327 -0.2087  1.6352  0.5381 -0.952   0.9108  1.4301  0.    ]
 [ 1.287  -0.0254  0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.6785  0.763   0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.8218  0.6403 -0.3814 -0.0312  0.6034 -0.6502  0.      0.      0.    ]
 [ 1.2741  0.1714  1.5828  0.3291  0.7437  0.      0.      0.      0.    ]
 [-0.2218  0.4225 -0.0733 -1.2244  1.2675  0.9041 -0.8855 -1.7838  0.    ]
 [-0.5838  0.3734 -0.0028  0.      0.      0.      0.      0.      0.    ]]

[55 82 28 84 71 89 66 94 74 62]

[[ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.0854  0.595   0.      0.      0.      0.      0.      0.      0.    ]
 [-0.9566 -0.4682  0.7301 -0.3302  0.5903  0.7432  0.5291  0.7341  0.    ]
 [-0.9837  2.2034 -1.5052 -0.1325  0.      0.      0.      0.      0.    ]
 [-0.6405  0.6618 -0.1518  0.      0.      0.      0.      0.      0.    ]
 [ 1.8122  0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.3219  0.      0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.6959  0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-0.0814 -1.7332  1.1618  1.9169 -0.502   0.6514  0.3035  1.1991  1.0977]
 [-0.6982  1.1725 -2.9672  1.0565  0.1042  1.8373  0.0698  0.0449  0.    ]]

[ 79  93  92  88  91  77 102  95  76 103]

[[-0.9368  1.6439 -0.3363  0.0307 -0.3371  2.1286  2.2109  0.      0.    ]
 [-0.6054 -0.5188 -0.5866 -2.0569  0.2665 -0.2518 -2.462   0.3667  0.    ]
 [ 0.4067  0.9339 -0.4276  0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-0.6667 -0.3288 -0.063   0.3355 -1.494  -1.313   1.8643  0.1276  0.    ]
 [-0.5116  0.1103 -2.0797  0.      0.      0.      0.      0.      0.    ]
 [-0.3753  0.3928  1.644   0.8229 -0.509  -0.1774  0.9693 -1.4699  0.    ]
 [-1.2648  0.1248 -0.2893 -1.1508  1.5604  0.7645  0.6532  0.9058  1.4649]
 [ 1.0857 -0.5312  0.4447  0.0733  0.2424 -1.2602 -1.0954 -2.6836  2.2212]
 [ 0.1423  0.6116  1.0632 -1.6233 -0.2096  0.      0.      0.      0.    ]]

[ 98 110  90 109  59  63 105 106  99 117]

[[ 0.2846 -0.3919  1.6536 -0.0652 -0.109  -0.9641 -0.3283]
 [ 0.2527  1.8681  0.558   0.1796  1.1727 -1.2765  0.    ]
 [ 0.7116 -0.7849  1.2876 -0.204   0.0144  0.      0.    ]
 [ 0.3741  0.4815  0.      0.      0.      0.      0.    ]
 [ 0.8294  0.6944 -0.1788 -0.2033 -2.1197 -0.4089  0.6421]
 [ 0.4336 -0.3892  2.2207 -0.9811  0.      0.      0.    ]
 [-0.7891  0.      0.      0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.    ]
 [ 0.9166 -0.6065  0.0133  1.3288  0.004  -0.675   0.1131]
 [ 1.211  -1.0916  0.      0.      0.      0.      0.    ]]

Ejemplo realista con imágenes¶

Para obtener un ejemplo más realista, intente tf.data.Dataset preprocessing.image.ImageDataGenerator como un tf.data.Dataset .

Primero descargue los datos:

In [30]:

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    cache_dir='/media/storage', #dirección de extracción
    cache_subdir='Datasets', #carpeta que se crea para la extracción
    untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 24s 0us/step

In [31]:

print(flowers)

/tmp/.keras/Datasets/flower_photos

Cree la image.ImageDataGenerator

In [32]:

image_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

In [33]:

images, labels = next(image_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

In [34]:

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)
float32 (32, 5)

In [35]:

ds = tf.data.Dataset.from_generator(
    lambda: image_gen.flow_from_directory(flowers),
    output_types=(tf.float32, tf.float32),
    output_shapes=([32,256,256,3],[32,5]))

ds.element_spec

Out[35]:

(TensorSpec(shape=(32, 256, 256, 3), dtype=tf.float32, name=None),
 TensorSpec(shape=(32, 5), dtype=tf.float32, name=None))

Consumir datos de texto¶

Muchos conjuntos de datos se distribuyen como uno o más archivos de texto. tf.data.TextLineDataset proporciona una manera fácil de extraer líneas de uno o más archivos de texto.

Dados uno o más nombres de archivo, un TextLineDataset producirá un elemento con valor de cadena por línea de esos archivos.

In [36]:

directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url +file_name)
    for file_name in file_names
]

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step

In [37]:

file_paths

Out[37]:

['/home/thejarmanitor/.keras/datasets/cowper.txt',
 '/home/thejarmanitor/.keras/datasets/derby.txt',
 '/home/thejarmanitor/.keras/datasets/butler.txt']

In [38]:

dataset = tf.data.TextLineDataset(file_paths)

Estas son las primeras líneas del primer archivo:

In [39]:

for line in dataset.take(5):
    print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b'His wrath pernicious, who ten thousand woes'
b"Caused to Achaia's host, sent many a soul"
b'Illustrious into Ades premature,'
b'And Heroes gave (so stood the will of Jove)'

Para alternar líneas entre archivos, use Dataset.interleave . Esto facilita la reproducción aleatoria de archivos. Aquí están la primera, segunda y tercera líneas de cada traducción:

In [40]:

file_ds = tf.data.Dataset.from_tensor_slices(file_paths)

In [41]:

for i in file_ds: print(i.numpy())

b'/home/thejarmanitor/.keras/datasets/cowper.txt'
b'/home/thejarmanitor/.keras/datasets/derby.txt'
b'/home/thejarmanitor/.keras/datasets/butler.txt'

In [42]:

line_ds = file_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(line_ds.take(9)):
    if i%3 ==0:
        print()
    print(line.numpy())

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,"
b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'

b'His wrath pernicious, who ten thousand woes'
b'The vengeance, deep and deadly; whence to Greece'
b'countless ills upon the Achaeans. Many a brave soul did it send'

b"Caused to Achaia's host, sent many a soul"
b'Unnumbered ills arose; which many a soul'
b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'

De manera predeterminada, TextLineDataset produce todas las lineas de cada archivo, lo cual tal vez no sea lo que se quiera. Tal vez el archivo empieza con el encabezado, o contiene comentarios. Para remover o pasarse estas lineas se usan las transformaciones Dataset.skip() o Dataset.filter()

A continuación, trabajamos con el archivo de la tragedia del Titanic. Se salta la primera linea, y filtramos para tener solo a los sobrevivientes

In [43]:

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 1us/step

In [44]:

for line in titanic_lines.take(10):
    print(line.numpy())

b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone'
b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n'
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y'
b'0,male,2.0,3,1,21.075,Third,unknown,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'

In [45]:

def survived(line):
    return tf.not_equal(tf.strings.substr(line,0,1), '0')

survivors=titanic_lines.skip(1).filter(survived)

In [46]:

for line in survivors.take(10):
    print(line.numpy())

b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
b'1,male,28.0,0,0,13.0,Second,unknown,Southampton,y'
b'1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y'
b'1,male,28.0,0,0,35.5,First,A,Southampton,y'
b'1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n'

Consumir Datos CSV¶

El formato CSV es muy popular para guardar datos tabulares en forma de texto.

Ya subimos el archivo del titanic, el cual es csv. Podemos subirlo en este mismo formato usando pandas

In [47]:

df=pd.read_csv(titanic_file)
df.head()

Out[47]:

	survived	sex	age	n_siblings_spouses	fare	class	deck	embark_town	alone
0	0	male	22.0	1	7.2500	Third	unknown	Southampton	n
1	1	female	38.0	1	71.2833	First	C	Cherbourg	n
2	1	female	26.0	0	7.9250	Third	unknown	Southampton	y
3	1	female	35.0	1	53.1000	First	C	Southampton	n
4	0	male	28.0	0	8.4583	Third	unknown	Queenstown	y

Si se tiene suficiente memoria, pueden transformar a diccionario el Dataframe e importar los datos con facilidad

In [48]:

titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

  'survived'          : 0
  'sex'               : b'male'
  'age'               : 22.0
  'n_siblings_spouses': 1
  'parch'             : 0
  'fare'              : 7.25
  'class'             : b'Third'
  'deck'              : b'unknown'
  'embark_town'       : b'Southampton'
  'alone'             : b'n'

Un acercamiento más ameno es cargar desde el disco cuando sea necesario.

el modulo tiene métodos para extraer rgistros de uno o más archivos CSV que cumplan con la RFC 4180

la función experimental.make_csv_dataset es una interfaz para leer conjuntos de archivos CSV, con lo cual podemos hacer inferencia por columna y crear lotes de los datos

Se puede usar el argumento select_columns si solo se necesitan algunas columnas

In [ ]:

titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])

In [ ]:

for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))

Consumir conjuntos de archivos¶

Es normal que los datos estén distribuidos en múltiples archivos, con cada archivo teniendo ejemplos

In [ ]:

flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

Cada directorio de la carpeta raíz contiene un directorio de cada clase

In [52]:

for item in flowers_root.glob("*"):
  print(item.name)

daisy
tulips
roses
sunflowers
dandelion
LICENSE.txt

Cada archivo en los directorios son ejemplos

In [53]:

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())

b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/14313509432_6f2343d6c8_m.jpg'
b'/home/thejarmanitor/.keras/datasets/flower_photos/tulips/8603340662_0779bd87fd.jpg'
b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/19613308325_a67792d889.jpg'
b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/3496258301_ca5f168306.jpg'
b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/8979087213_28f572174c.jpg'

usando tf.io.read_file podemos ler los datos y extraer las etiquetas, obteniendo (imagen, etiqueta)

In [54]:

def process_path(file_path):
  label = tf.strings.split(file_path, os.sep)[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)

WARNING:tensorflow:AutoGraph could not transform <function process_path at 0x7f71ac13fa60> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function process_path at 0x7f71ac13fa60> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

In [55]:

for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xed\x00\xf8Photoshop 3.0\x008BIM\x04\x04\x00\x00\x00\x00\x00\xae\x1c\x01\x00\x00\x02\x00\x04\x1c\x02\x00\x00\x02\x00\x04\x1c\x02\x19\x00\x04blur\x1c\x02\x19\x00\x05bokeh\x1c\x02\x19\x00\nCalifornia\x1c\x02'

b'roses'

Loteo de elementos del dataset¶

Loteo simple¶

La transformación Dataset.batch() es la forma más sencilla de hacer un lote de n elementos consecutivos. Para cada componente, todos los elementos deben tener un tensor de exactamente la misma dimensión

In [56]:

inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])

[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

Loteo de tensores con acolchamiento¶

Con el loteo simple todos los tensores debe tener la misma dimensión, pero esto no va a ser el caso todas las veces. Utilizando Dataset.padded_batch se hace un acolchamiento de los tensores de distintas formas, específicando las dimensiones a las cuales hay que aplicar acolchamiento

In [57]:

dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]

[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]

Flujo de entrenamiento¶

Procesando múltiples epochs¶

La API ofrece dos maneras dde procesar múltiples epochs de los mismos datos.

La primera manera es iterando sobre el el conjunto de datos utilizando Dataset.repeat(). volvemos al ejemplo de texto del Titanic.

In [58]:

def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')

Dataset.repeat hace una concatenación de los argumentos sin señalar el inicio o el final de un epoch. Si aplicamos Dataset.batch Después de esta, se producirán lotes que van más allá de los límites e los epochs.

Si la función repeat no tiene argumentos, se hara una repetición infinita.

In [59]:

titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)

si queremos una separación clara de los epoch, se aplica Dataset.batch antes de repeat

In [60]:

titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)

Si queremos, por ejemplo, recopilar estadísticas al final de cada epoch, podemos hacer una iteración y reiniciar después de cada epoch

In [61]:

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)

(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  2

Mezclar los datos de entrada¶

la transformación Dataser.shuffle() toma una muestra de un tamaño predeterminado y selecciona el siguiente dato del buffer.

Le agregaremos un indice a los datos del titanic para que el efecto sea visible

In [62]:

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

Out[62]:

<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.string)>

In [63]:

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 60  71  23  11 103  25  46  18  48  45 107  17  56  80   5  43  76 106
  39  47]

shuffle no señala el fin de un epoch hasta que el buffer esté vacío. si aplicamos repeat antes de este, se podrá ver el momento en el que termina un epoch y empieza otro

In [64]:

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("esta es la lista de indices cercanos al fin del epoch:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())

esta es la lista de indices cercanos al fin del epoch:

[610 615 266 557 573  40 547 625 362 520]
[616 436 502 602 482 507 593 589 470 523]
[477 569 617 583 517 476 503 543]
[87 17 30 60 32 86 74 64 34 65]
[ 98  80  55   5  69 113   7  46  79 110]

gráficamente se puede apreciar mejor

In [65]:

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

Out[65]:

<matplotlib.legend.Legend at 0x7f715029f460>

si ponemos repeatantes de la mezcla, los límites de los epoch se mantendrán iguales hasta que no hayan más objetos que mezclar

In [66]:

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("esta es la lista de indices cercanos al fin del epoch:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())

esta es la lista de indices cercanos al fin del epoch:

[ 19  18 564 616 620  17 431 612 603 611]
[594  24  15 609 600 491 483 627 624  30]
[ 31 493 525 379 425   5  29 558  14  48]
[ 44 563  97 625 453  42 501 500 486  57]
[  8  33 524  25 610  34  28  55  50 553]
[ 60 574 551 584 598   9 530 554  54  68]
[583  46 537  40 626  13  43 587  22 308]
[ 27 516  41  62 577  88 406  92 595  35]
[ 91  69  84  51   4 602  83 107  36  20]
[ 58  21 532  16 569  81  90 567 535  49]
[562 474 120  94  76   1  66 109 613 575]
[601 129  99 113  23  87 619  85  64 508]
[138 621  78  37 108 517  39 108 239 494]
[131 133 124  72 590  59  71 134 100  96]
[126 454 136  32  98 137   7 114 592 142]

In [67]:

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

Out[67]:

<matplotlib.legend.Legend at 0x7f71502de670>

Preprocesamiento de datos¶

si se quiere aplicar alguna función a los datos en cuestión, se utiliza la transformación Dataset.map(f). Esta toma los objetos t f.Tensor de un solo elemento y saca nuevos objetos en un nuevo conjunto de datos.

Aquí mostramos dos ejemplos muy comunes de pre procesamiento

Decodificando imagenes y cambiar su tamaño¶

Al trabajar con imagenes de la vida cotidiana, lo más probable es que necesitemos estandarizar los tamaños a uno en común.

Utilizaremos la lista de flores para este ejemplo

In [68]:

list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

escribimos una función para manipular datos

In [69]:

# Lee una imagen de un archivo, la decodifica en un tensor y cambia su tamaño
# a una forma predeterminada
def parse_image(filename):
  parts = tf.strings.split(filename, os.sep)
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label

In [70]:

file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)

In [71]:

images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2):
  show(image, label)

WARNING:tensorflow:AutoGraph could not transform <function parse_image at 0x7f71501cbe50> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function parse_image at 0x7f71501cbe50> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

Aplicando funciones de python¶

Por razones de rendimiento, es mejor usar únicamente funciones de Tensorflow para manipular datos, pero a veces es necesario usar herramientas de otros paquetes de python.

Para esto utilizamos tf.py_function() como función en Dataset.map()

Supongamos que queremos hacer una rotación aleatoria en un conjunto dde imágenes. Tensorflow sólo tiene tf.image.rot90, lo cual no sirve para la intención que se tiene. por suerte, el paquete scipy cuenta con scipy.ndimage.rotate

In [72]:

import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image

In [73]:

image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

In [74]:

def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label

In [75]:

rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

Ventaneo De series de tiempo¶

En el caso de modelos de series de tiempo, estos datos están organizados con el axis de tiempo intacto. Muchas veces se le alimentaran secciones de tiempo adyacentes a los modelos como datos. Hay dos maneras de generar estos cortes. La primera es utilizando lotes:

In [76]:

range_ds = tf.data.Dataset.range(100000)

In [77]:

batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]

Para hacer predicciones un paso hacia el futuro, es ideal mover los datos y etiquetas un paso relativo a ellos

In [78]:

def dense_1_step(batch):
  # Se mueven las características y etiquetas un paso hacia la derecha
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())

WARNING:tensorflow:AutoGraph could not transform <function dense_1_step at 0x7f71501ee160> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function dense_1_step at 0x7f71501ee160> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[0 1 2 3 4 5 6 7 8]  =>  [1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]  =>  [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28]  =>  [21 22 23 24 25 26 27 28 29]

Para predecir una ventana completa de tiempo, podemos separar los lotes en dos partes

In [79]:

batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Se toman los primeros 10 pasos
          batch[-5:])   # se toma el residuo

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())

WARNING:tensorflow:AutoGraph could not transform <function label_next_5_steps at 0x7f71a005e700> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function label_next_5_steps at 0x7f71a005e700> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]  =>  [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]  =>  [40 41 42 43 44]

Para permitir que se superpongan las características de un lote y las etiquetas de otro, podemos usar Dataset.zip()

In [80]:

feature_length = 10
label_length = 3

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:label_length])

predicted_steps = tf.data.Dataset.zip((features, labels))

for features, label in predicted_steps.take(5):
  print(features.numpy(), " => ", label.numpy())

WARNING:tensorflow:AutoGraph could not transform <function <lambda> at 0x7f71a005e820> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function <lambda> at 0x7f71a005e820> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12]
[10 11 12 13 14 15 16 17 18 19]  =>  [20 21 22]
[20 21 22 23 24 25 26 27 28 29]  =>  [30 31 32]
[30 31 32 33 34 35 36 37 38 39]  =>  [40 41 42]
[40 41 42 43 44 45 46 47 48 49]  =>  [50 51 52]

Por supuesto, a veces se necesitan más control de las ventanas. Razón por la que se puede usar Dataset.window, pero para usarla correctamente, necesitamos algo de cuidado en lso datos. Esta transformación retorna un conjunto de conjuntos de datos

In [81]:

window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)

<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

¿Pero qué pasó aquí? para ver los datos como un solo conjunto, usamos Dataset.flat_map. Al mismo tiempo casi siempre es necesario hacer lotes

In [82]:

for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')

0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

In [83]:

def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

Haciéndolo todo junto, obtendríamos una función como esta

In [84]:

def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

In [85]:

ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())

[ 0  3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]

Es sencillo extraer etiquetas con estos datos

In [86]:

dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())

[ 0  3  6  9 12 15 18 21 24] => [ 3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]

Remuestreo¶

Es usual encontrarse con datasets desbalanceados a nivel de clases. Es buena idea aquí el hacer un remuestreo del dataset. tf.data da dos métodos para esto

Se usará el dataset de fraude de tarjetas de crédito es perfecto para demostrarlo

In [87]:

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    cache_dir='/media/storage',
    cache_subdir='Datasets',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip
69156864/69155632 [==============================] - 7s 0us/step

In [88]:

creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])

Se revisará ahora la distribución de las clases a clasificar, para ver qué tan sesgados están

In [89]:

def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts

In [90]:

counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)

WARNING:tensorflow:AutoGraph could not transform <function count at 0x7f7150127ca0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function count at 0x7f7150127ca0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
[0.9958 0.0042]

Para poder trabajar con datos desbalanceados, la mejor idea es balancearlos. Aquí algunos métodos para esto

Muestreo de Datasets¶

La forma más sencilla es usar sample_from_datasets. Esto es particularmente mejor cuando se tienen datasets separados por clase. Para este caso se va a filtrar los datos de fraude para esta razón

In [91]:

negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())

In [92]:

for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())

[1 1 1 1 1 1 1 1 1 1]

Se pasarán los datasets, junto con los pesos que se quieren por tf.data.experimental.sample_from_datasets

In [93]:

balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)

Ahora se generarán ejemplos de las clases con una probabilidad 50/50

In [94]:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

[1 0 0 1 0 1 1 1 1 1]
[1 0 1 1 1 0 0 1 0 1]
[1 0 0 1 0 1 1 0 1 1]
[0 1 0 1 0 0 0 0 0 1]
[1 1 0 0 0 1 0 0 0 1]
[0 0 1 0 1 0 1 1 0 1]
[0 0 1 1 1 0 0 0 0 1]
[0 1 1 0 1 1 0 0 0 1]
[0 1 0 1 0 1 0 0 0 0]
[1 0 0 1 0 0 0 0 1 1]

Remuestreo de rechazo¶

Como se dijo, necesitamos que los datasets estén separados por clase. Podemos por supuesto usar Dataset.filter, pero eso haría que los datos se cargaran dos veces.

La función data.experimental.rejection_resample permite rebalancear los datos sin tener que cargarlos otra vez. Esto se logra eliminando elementos del dataset para llegar al balance.

Esta función toma un argumento class_func. Esta función es aplicada a cada elemento del dataset para determinar la clase que tiene.

Los elementos de creditcard_ds ya están separados en pares (features,label). Así que la función solo tiene que retornar la etiqueta

In [95]:

def class_func(features, label):
  return label

De igual forma es necesaria una distribución objetivo y preferiblemente un estimado de esta

In [96]:

resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)

resampler trabaja con las observaciones de manera individual, así que hay que aplicar unbatch() antes.

In [97]:

resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)

WARNING:tensorflow:From /home/thejarmanitor/miniconda3/envs/tf-gpu/lib/python3.9/site-packages/tensorflow/python/data/experimental/ops/resampling.py:152: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

el resampler retorna pares del tipo (class, example) a partír de la salida de class_func. En este caso ya tenemos (feature, label), así que se hace un map para obtener una copia extra de las etiquetas

In [98]:

balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)

In [99]:

for features, labels in balanced_ds.take(10):
  print(labels.numpy())

Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]

[1 1 1 1 0 1 0 0 1 1]
[0 1 0 1 0 0 1 1 0 0]
[0 1 1 0 1 1 0 0 0 1]
[0 1 1 0 1 1 0 1 0 1]
[0 0 1 1 1 1 1 0 0 1]
[0 1 1 0 0 1 0 0 0 0]
[0 0 0 1 1 1 1 1 1 1]
[1 1 1 1 0 1 0 0 1 1]
[0 0 1 0 1 0 0 1 0 1]
[1 0 0 0 0 1 0 0 0 1]

Cuál es el problema de este método? si el desbalanceo es muy grande, se va a perder una cantidad muy grande de datos. ¿Qué es más importante: La cantidad de datos o la cantidad de recursos?

In [ ]: