Basado en tf.data.
La API tf.data
permite crear tuberías de entrada complejas a partir de piezas simples y reutilizables. Por ejemplo, la canalización de un modelo de imagen podría agregar datos de archivos en un sistema de archivos distribuido, aplicar perturbaciones aleatorias a cada imagen y fusionar imágenes seleccionadas al azar en un lote para entrenamiento. La canalización de un modelo de texto puede implicar extraer símbolos de datos de texto sin procesar, convertirlos en identificadores incrustados con una tabla de búsqueda y agrupar secuencias de diferentes longitudes.
import tensorflow as tf
import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
2021-10-27 11:14:36.168920: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Para crear una canalización de entrada, debe comenzar con una fuente de datos. Por ejemplo, para construir un Dataset
de datos a partir de datos en la memoria, puede usar tf.data.Dataset.from_tensors() o tf.data.Dataset.from_tensor_slices(). Alternativamente, si sus datos de entrada están almacenados en un archivo en el formato TFRecord de TensorFlow puede usar tf.data.TFRecordDataset().
El objeto Dataset es un iterable de Python. Esto hace posible consumir sus elementos usando un bucle for:
dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset
2021-10-27 11:14:38.879085: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-10-27 11:14:38.879804: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-10-27 11:14:39.128895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.129290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s 2021-10-27 11:14:39.129308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2021-10-27 11:14:39.198627: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2021-10-27 11:14:39.198673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2021-10-27 11:14:39.234411: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-10-27 11:14:39.243015: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-10-27 11:14:39.305256: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-10-27 11:14:39.311792: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2021-10-27 11:14:39.414486: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2021-10-27 11:14:39.414590: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.415044: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.415320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-10-27 11:14:39.418046: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-10-27 11:14:39.419480: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.419885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2060 computeCapability: 7.5 coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 5.79GiB deviceMemoryBandwidth: 312.97GiB/s 2021-10-27 11:14:39.419905: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2021-10-27 11:14:39.419924: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2021-10-27 11:14:39.419935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10 2021-10-27 11:14:39.419944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-10-27 11:14:39.419954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-10-27 11:14:39.419964: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-10-27 11:14:39.419974: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2021-10-27 11:14:39.419985: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7 2021-10-27 11:14:39.420029: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.420334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:39.420610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-10-27 11:14:39.421802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2021-10-27 11:14:40.760544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-10-27 11:14:40.760561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-10-27 11:14:40.760567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-10-27 11:14:40.762067: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:40.762395: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:40.762793: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-10-27 11:14:40.763065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4849 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5) 2021-10-27 11:14:40.766370: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<TensorSliceDataset shapes: (), types: tf.int32>
len(dataset)
6
for elem in dataset:
print(elem.numpy())
8 3 0 8 2 1
o se pueden crear explícitamente un iterador
it = iter(dataset)
print(next(it).numpy())
print(next(it).numpy())
8 3
print(dataset.reduce(0, lambda state, value: state+value).numpy())
22
2021-10-27 11:14:40.877414: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-10-27 11:14:40.880427: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2601325000 Hz
Un conjunto de datos produce una secuencia de elementos , donde cada elemento tiene la misma estructura (anidada) de componentes .
Los componentes individuales de la estructura pueden ser de cualquier tipo representable por tf.TypeSpec, incluidos tf.Tensor , tf.sparse.SparseTensor ,tf.RaggedTensor , tf.TensorArray o tf.data.Dataset.
Las construcciones de Python que se pueden usar para expresar la estructura (anidada) de elementos incluyen tuple , dict , NamedTuple y OrderedDict.
En particular, list no es una construcción válida para expresar la estructura de los elementos del conjunto de datos.
Si desea que una entrada de list se trate como una estructura, debe convertirla en tuple y si desea que una lista de salida, entonces debe empaquetarla explícitamente usando tf.stack.
La propiedad Dataset.element_spec permite inspeccionar el tipo de cada componente del elemento. La propiedad devuelve una estructura anidada de objetos tf.TypeSpec, que coincide con la estructura del elemento, que puede ser un solo componente, una tupla de componentes o una tupla anidada de componentes. Por ejemplo:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
print(dataset1)
<TensorSliceDataset shapes: (10,), types: tf.float32>
for i in dataset1:
print(i.numpy())
[0.6143 0.1727 0.6508 0.4864 0.3716 0.9486 0.5665 0.5594 0.0576 0.9023] [1.1729e-02 9.0851e-01 3.3666e-01 6.3419e-04 2.4943e-01 4.6697e-02 9.7493e-01 5.1887e-01 6.0174e-01 7.2745e-01] [0.4474 0.0227 0.0012 0.5103 0.099 0.1649 0.9021 0.2701 0.4714 0.7356] [0.4228 0.2625 0.7598 0.8483 0.6128 0.3952 0.6487 0.3378 0.3738 0.09 ]
len(dataset1)
4
dataset2 = tf.data.Dataset.from_tensor_slices(
(tf.random.uniform([4]), #y
tf.random.uniform([4,100], maxval=100, dtype=tf.int32))) #x
dataset2.element_spec
(TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(100,), dtype=tf.int32, name=None))
len(dataset2)
4
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
dataset3.element_spec
(TensorSpec(shape=(10,), dtype=tf.float32, name=None), (TensorSpec(shape=(), dtype=tf.float32, name=None), TensorSpec(shape=(100,), dtype=tf.int32, name=None)))
len(dataset3)
4
type(dataset3)
tensorflow.python.data.ops.dataset_ops.ZipDataset
i = iter(dataset3)
print(i.next(), "\n")
print(i.next(), "\n")
print(i.next(), "\n")
print(i.next(), "\n")
(<tf.Tensor: shape=(10,), dtype=float32, numpy= array([0.6143, 0.1727, 0.6508, 0.4864, 0.3716, 0.9486, 0.5665, 0.5594, 0.0576, 0.9023], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.89718556>, <tf.Tensor: shape=(100,), dtype=int32, numpy= array([59, 45, 62, 51, 24, 76, 83, 94, 97, 89, 21, 60, 86, 43, 55, 94, 30, 1, 59, 35, 67, 99, 61, 63, 30, 0, 65, 89, 81, 35, 10, 67, 15, 20, 85, 0, 59, 77, 47, 19, 15, 58, 47, 21, 94, 58, 2, 48, 37, 71, 19, 35, 76, 7, 11, 23, 80, 11, 40, 23, 73, 59, 24, 56, 1, 70, 51, 46, 87, 98, 20, 30, 4, 55, 18, 20, 43, 56, 43, 98, 3, 54, 82, 31, 6, 77, 67, 88, 70, 48, 76, 88, 66, 20, 6, 1, 43, 46, 72, 27], dtype=int32)>)) (<tf.Tensor: shape=(10,), dtype=float32, numpy= array([1.1729e-02, 9.0851e-01, 3.3666e-01, 6.3419e-04, 2.4943e-01, 4.6697e-02, 9.7493e-01, 5.1887e-01, 6.0174e-01, 7.2745e-01], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.2631073>, <tf.Tensor: shape=(100,), dtype=int32, numpy= array([96, 43, 25, 28, 58, 92, 61, 18, 23, 46, 71, 30, 9, 14, 98, 13, 90, 30, 34, 60, 49, 11, 13, 24, 79, 36, 80, 39, 99, 58, 53, 44, 73, 21, 17, 81, 59, 37, 45, 36, 32, 99, 63, 2, 43, 0, 23, 42, 88, 5, 25, 37, 89, 13, 36, 55, 52, 31, 98, 78, 49, 8, 6, 89, 53, 50, 55, 45, 28, 93, 11, 38, 86, 55, 39, 58, 28, 8, 36, 33, 35, 45, 10, 71, 34, 22, 61, 26, 28, 30, 49, 78, 55, 75, 9, 80, 59, 78, 30, 77], dtype=int32)>)) (<tf.Tensor: shape=(10,), dtype=float32, numpy= array([0.4474, 0.0227, 0.0012, 0.5103, 0.099 , 0.1649, 0.9021, 0.2701, 0.4714, 0.7356], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.5659776>, <tf.Tensor: shape=(100,), dtype=int32, numpy= array([ 2, 62, 96, 82, 22, 31, 97, 29, 14, 0, 28, 85, 36, 19, 88, 26, 95, 33, 93, 90, 24, 19, 80, 6, 98, 16, 57, 86, 63, 54, 44, 62, 18, 14, 95, 0, 29, 12, 5, 50, 33, 38, 32, 29, 56, 70, 42, 76, 88, 59, 42, 51, 53, 47, 28, 9, 39, 51, 15, 17, 21, 67, 98, 97, 97, 70, 6, 41, 37, 47, 49, 0, 6, 32, 71, 20, 34, 45, 27, 60, 72, 45, 56, 13, 2, 61, 31, 4, 78, 74, 2, 31, 80, 59, 96, 32, 20, 16, 52, 76], dtype=int32)>)) (<tf.Tensor: shape=(10,), dtype=float32, numpy= array([0.4228, 0.2625, 0.7598, 0.8483, 0.6128, 0.3952, 0.6487, 0.3378, 0.3738, 0.09 ], dtype=float32)>, (<tf.Tensor: shape=(), dtype=float32, numpy=0.7929852>, <tf.Tensor: shape=(100,), dtype=int32, numpy= array([76, 51, 57, 30, 84, 84, 50, 92, 7, 75, 92, 6, 16, 54, 52, 1, 26, 94, 52, 63, 70, 26, 63, 25, 34, 34, 95, 46, 81, 48, 8, 31, 56, 44, 31, 51, 27, 37, 69, 60, 71, 67, 59, 92, 80, 9, 43, 44, 17, 27, 23, 9, 13, 7, 28, 45, 96, 40, 42, 13, 19, 19, 27, 81, 26, 39, 52, 65, 82, 38, 87, 66, 3, 38, 14, 58, 83, 0, 31, 7, 67, 98, 94, 17, 95, 93, 59, 63, 63, 75, 21, 34, 81, 14, 36, 96, 11, 0, 18, 15], dtype=int32)>))
for a, (b,c) in dataset3:
print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))
shapes: (10,), (), (100,) shapes: (10,), (), (100,) shapes: (10,), (), (100,) shapes: (10,), (), (100,)
# dataset con tensores dispersos
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0],[1, 2]], values=[1, 2], dense_shape=[3, 4]))
dataset4.element_spec
SparseTensorSpec(TensorShape([3, 4]), tf.int32)
dataset4.element_spec.value_type
tensorflow.python.framework.sparse_tensor.SparseTensor
Si todos sus datos de entrada caben en la memoria, la forma más sencilla de crear un Dataset a partir de ellos es convertirlos en objetos tf.Tensor y usar Dataset.from_tensor_slices() .
train, test = tf.keras.datasets.fashion_mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 4s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz 8192/5148 [===============================================] - 0s 0us/step Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 0s 0us/step
imagenes, labels = train
imagenes = imagenes /255.
dataset = tf.data.Dataset.from_tensor_slices((imagenes, labels))
dataset
2021-10-27 11:14:46.698987: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 376320000 exceeds 10% of free system memory.
<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>
def count(stop):
i=0
while i<stop:
yield i
i+= 1
for n in count(5):
print(n)
0 1 2 3 4
El constructor Dataset.from_generator
convierte el generador de Python en un tf.data.Dataset
completamente funcional.
El constructor toma un invocable como entrada, no un iterador. Esto le permite reiniciar el generador cuando llega al final. Toma un argumento args opcional, que se pasa como argumentos del invocable.
El argumento output_types es necesario porque tf.data crea un tf.Graph internamente y los bordes del gráfico requieren un tf.dtype .
ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes=(),)
for count_batch in ds_counter.repeat().batch(10).take(10):
print(count_batch.numpy())
[0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 0 1 2 3 4] [ 5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 0 1 2 3 4] [ 5 6 7 8 9 10 11 12 13 14] [15 16 17 18 19 20 21 22 23 24]
El argumento output_shapes
no es necesario, pero se recomienda, ya que muchas operaciones de flujo tensorial no admiten tensores con rango desconocido. Si la longitud de un eje en particular es desconocida o variable, output_shapes puede colcarse como None.
También es importante tener en cuenta que output_shapes
y output_types
siguen las mismas reglas de anidamiento que otros métodos de conjuntos de datos.
Aquí hay un generador de ejemplo que demuestra ambos aspectos, devuelve tuplas de matrices, donde la segunda matriz es un vector con longitud desconocida.
def gen_series():
i = 0
while True:
size = np.random.randint(0,10)
yield i, np.random.normal(size = (size,))
i+=1
for i, series in gen_series():
print(i, ":", str(series))
if i > 5:
break
0 : [-0.2311] 1 : [-0.9686 -0.6723 0.6278 -0.761 -0.1297 -0.6381] 2 : [] 3 : [] 4 : [-0.134] 5 : [-0.8309 0.4063 0.5506 -0.3408] 6 : [-1.3336 1.3925 0.1096 -1.1965 -0.2353 1.4199 0.9267 -0.6835]
La primera salida es un int32 la segunda es un float32.
El primer elemento es un escalar, forma () , y el segundo es un vector de longitud desconocida, forma (None,)
ds_series = tf.data.Dataset.from_generator(
gen_series,
output_types=(tf.int32, tf.float32),
output_shapes=((), (None, )))
ds_series
<FlatMapDataset shapes: ((), (None,)), types: (tf.int32, tf.float32)>
Ahora se puede utilizar como un tf.data.Dataset normal. Tenga en cuenta que al procesar por lotes un conjunto de datos con una forma variable, debe usar Dataset.padded_batch.
ds_series_batch = ds_series.shuffle(20).padded_batch(10)
ids, sequence_batch = next(iter(ds_series_batch))
print (ids.numpy())
print()
print(sequence_batch.numpy())
[ 1 14 17 7 21 20 19 24 26 23] [[ 0.872 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.5344 -0.8575 0. 0. 0. 0. 0. 0. 0. ] [ 1.424 0.0269 -1.2718 0. 0. 0. 0. 0. 0. ] [ 0.0719 -0.701 -1.1496 -0.6294 0. 0. 0. 0. 0. ] [ 0.4681 0.4923 -1.4863 0.1351 -0.5685 0.4341 -0.5812 0. 0. ] [-0.0529 1.4876 0.8683 0.4717 -1.1321 1.6762 0.0838 -1.0582 -0.4435] [ 0.0471 -2.5569 0.3248 -1.5189 0. 0. 0. 0. 0. ]]
it = iter(ds_series_batch)
for i in range(10):
ids, sequence_batch = next(it)
print (ids.numpy())
print()
print(sequence_batch.numpy())
print()
[ 7 16 12 5 11 15 6 13 2 18] [[-1.4672 -0.1405 0. 0. 0. 0. 0. 0. ] [-0.6291 -0.648 1.1858 -0.0339 1.0398 0.5672 1.1503 -0.9136] [ 0.1728 0. 0. 0. 0. 0. 0. 0. ] [-0.107 -0.8513 0. 0. 0. 0. 0. 0. ] [ 0.9371 0.8138 0.5843 0.2523 -1.4394 1.4096 0. 0. ] [ 0.4807 0.7075 -0.0508 -0.2766 1.1682 0. 0. 0. ] [-0.1746 1.3332 -0.168 -0.2606 0.2766 0. 0. 0. ] [ 0.2357 1.0988 -1.2427 0.9682 0. 0. 0. 0. ] [-0.3176 -0.4375 -1.1106 0. 0. 0. 0. 0. ] [-0.9847 -0.0585 1.9546 0.44 -0.0685 -1.7938 0. 0. ]] [25 23 8 20 21 22 19 10 0 26] [[-0.6809 -0.0455 0.1088 -0.8211 1.2413 -1.5584 0. 0. 0. ] [-0.4125 -1.905 -1.2539 0.0093 1.3305 0.1917 0.1565 -0.7806 0.3128] [-2.05 -2.2507 -0.2782 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [-1.4544 0.8821 0. 0. 0. 0. 0. 0. 0. ] [ 0.903 1.2739 -0.1799 0.5255 -0.2661 0. 0. 0. 0. ] [ 0.0865 -0.6743 1.2116 0.4853 1.3283 0. 0. 0. 0. ] [ 0.5732 -1.6206 -1.2255 -1.4218 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.3569 0.1471 0. 0. 0. 0. 0. 0. 0. ]] [32 33 36 29 37 35 9 4 40 43] [[ 6.5692e-01 9.4795e-01 -2.0083e+00 4.3408e-01 -1.1075e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [ 6.7936e-03 1.6619e+00 -1.8120e+00 -1.1778e-01 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [ 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [ 2.9384e-01 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [ 2.5649e+00 2.2146e-01 -6.0475e-01 1.5246e+00 6.3542e-01 1.7047e+00 -4.0017e-01 0.0000e+00 0.0000e+00] [ 1.7746e+00 -1.8441e-03 4.8787e-01 -2.8778e-01 -1.6560e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [-5.8409e-01 -1.7815e+00 1.3852e+00 -1.7356e-01 2.2389e-01 1.1106e+00 1.5763e+00 8.9979e-01 -5.7616e-02] [-6.0780e-01 -1.1874e+00 2.7903e+00 8.6838e-01 5.3041e-01 -1.0636e+00 -4.8197e-01 0.0000e+00 0.0000e+00] [ 9.7730e-01 -1.7599e+00 -7.4557e-01 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00] [ 4.2674e-01 1.0619e+00 -2.9815e-01 -9.8932e-01 -1.2172e+00 2.7212e+00 -1.8119e-01 0.0000e+00 0.0000e+00]] [42 41 46 45 31 39 52 49 47 58] [[-0.9908 -0.336 0.4805 0.0548 0. 0. 0. 0. 0. ] [-0.1697 0.9944 0. 0. 0. 0. 0. 0. 0. ] [ 1.391 -1.3834 -0.0722 0.8418 -0.2624 -0.6159 0.3003 -0.4918 0.9743] [ 0.0696 1.8761 -0.4255 -2.0904 0.0713 0. 0. 0. 0. ] [ 1.7669 0. 0. 0. 0. 0. 0. 0. 0. ] [ 1.0693 -0.394 0.33 0.6187 0.8511 -1.3357 -0.3105 0. 0. ] [ 0.0976 -0.6148 0. 0. 0. 0. 0. 0. 0. ] [-1.2994 0.7602 0. 0. 0. 0. 0. 0. 0. ] [-0.1547 0.539 -0.7047 -0.9207 -0.5796 0. 0. 0. 0. ] [-1.0514 -1.5457 0. 0. 0. 0. 0. 0. 0. ]] [57 14 30 60 54 17 50 34 24 3] [[-1.977 0. 0. 0. 0. 0. 0. 0. 0. ] [-0.3461 0.8693 -0.1612 -0.6677 -1.9281 1.2742 0.2442 0. 0. ] [ 0.1438 -0.5108 -2.0957 -0.7766 1.3033 0.8368 1.4962 1.1181 0. ] [ 0.3655 -1.531 1.7767 1.3026 0. 0. 0. 0. 0. ] [-0.7794 0.2927 -0.1646 0. 0. 0. 0. 0. 0. ] [-0.0701 -0.7325 -0.7293 -1.6565 0.5733 -0.5944 -0.0645 0. 0. ] [ 0.4288 -0.456 0. 0. 0. 0. 0. 0. 0. ] [ 1.4133 0.5523 0. 0. 0. 0. 0. 0. 0. ] [ 0.8531 -2.4146 1.9072 -0.0497 0.2455 0.0722 -0.4228 0.6441 0. ] [-0.7503 -0.1603 1.1878 -1.0804 -0.0119 -1.0453 -0.9199 0.7854 0.2143]] [64 65 61 68 48 53 72 56 67 78] [[ 0.8044 0.4001 1.3644 -1.6187 -2.415 0.1207 -1.5868 0.3856 -1.4043] [ 1.8468 0.8731 -0.7306 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.7539 -0.1132 -0.8311 -1.6697 -0.2072 -0.8393 0. 0. 0. ] [ 0.384 -0.8923 -2.187 0. 0. 0. 0. 0. 0. ] [ 0.4833 -0.6216 0.5257 0.5456 -0.1609 -1.3584 0.1504 0.8463 0. ] [-1.0084 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.4967 0.1975 -1.9036 0.8044 2.2763 0.1914 -0.724 -0.3081 0. ] [-0.9175 1.0792 1.3103 -0.7467 1.4616 0. 0. 0. 0. ] [ 0.9968 -0.6953 0. 0. 0. 0. 0. 0. 0. ]] [73 38 1 69 44 80 83 70 27 51] [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 2.3107 0.0715 0.2946 -0.1409 -1.131 -1.5621 -0.7402 -0.0731 -1.1154] [-0.3471 2.3058 0. 0. 0. 0. 0. 0. 0. ] [-0.1019 -0.8327 -0.2087 1.6352 0.5381 -0.952 0.9108 1.4301 0. ] [ 1.287 -0.0254 0. 0. 0. 0. 0. 0. 0. ] [ 0.6785 0.763 0. 0. 0. 0. 0. 0. 0. ] [ 0.8218 0.6403 -0.3814 -0.0312 0.6034 -0.6502 0. 0. 0. ] [ 1.2741 0.1714 1.5828 0.3291 0.7437 0. 0. 0. 0. ] [-0.2218 0.4225 -0.0733 -1.2244 1.2675 0.9041 -0.8855 -1.7838 0. ] [-0.5838 0.3734 -0.0028 0. 0. 0. 0. 0. 0. ]] [55 82 28 84 71 89 66 94 74 62] [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.0854 0.595 0. 0. 0. 0. 0. 0. 0. ] [-0.9566 -0.4682 0.7301 -0.3302 0.5903 0.7432 0.5291 0.7341 0. ] [-0.9837 2.2034 -1.5052 -0.1325 0. 0. 0. 0. 0. ] [-0.6405 0.6618 -0.1518 0. 0. 0. 0. 0. 0. ] [ 1.8122 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.3219 0. 0. 0. 0. 0. 0. 0. 0. ] [ 0.6959 0. 0. 0. 0. 0. 0. 0. 0. ] [-0.0814 -1.7332 1.1618 1.9169 -0.502 0.6514 0.3035 1.1991 1.0977] [-0.6982 1.1725 -2.9672 1.0565 0.1042 1.8373 0.0698 0.0449 0. ]] [ 79 93 92 88 91 77 102 95 76 103] [[-0.9368 1.6439 -0.3363 0.0307 -0.3371 2.1286 2.2109 0. 0. ] [-0.6054 -0.5188 -0.5866 -2.0569 0.2665 -0.2518 -2.462 0.3667 0. ] [ 0.4067 0.9339 -0.4276 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [-0.6667 -0.3288 -0.063 0.3355 -1.494 -1.313 1.8643 0.1276 0. ] [-0.5116 0.1103 -2.0797 0. 0. 0. 0. 0. 0. ] [-0.3753 0.3928 1.644 0.8229 -0.509 -0.1774 0.9693 -1.4699 0. ] [-1.2648 0.1248 -0.2893 -1.1508 1.5604 0.7645 0.6532 0.9058 1.4649] [ 1.0857 -0.5312 0.4447 0.0733 0.2424 -1.2602 -1.0954 -2.6836 2.2212] [ 0.1423 0.6116 1.0632 -1.6233 -0.2096 0. 0. 0. 0. ]] [ 98 110 90 109 59 63 105 106 99 117] [[ 0.2846 -0.3919 1.6536 -0.0652 -0.109 -0.9641 -0.3283] [ 0.2527 1.8681 0.558 0.1796 1.1727 -1.2765 0. ] [ 0.7116 -0.7849 1.2876 -0.204 0.0144 0. 0. ] [ 0.3741 0.4815 0. 0. 0. 0. 0. ] [ 0.8294 0.6944 -0.1788 -0.2033 -2.1197 -0.4089 0.6421] [ 0.4336 -0.3892 2.2207 -0.9811 0. 0. 0. ] [-0.7891 0. 0. 0. 0. 0. 0. ] [ 0. 0. 0. 0. 0. 0. 0. ] [ 0.9166 -0.6065 0.0133 1.3288 0.004 -0.675 0.1131] [ 1.211 -1.0916 0. 0. 0. 0. 0. ]]
Para obtener un ejemplo más realista, intente tf.data.Dataset
preprocessing.image.ImageDataGenerator
como un tf.data.Dataset
.
Primero descargue los datos:
flowers = tf.keras.utils.get_file(
'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
cache_dir='/media/storage', #dirección de extracción
cache_subdir='Datasets', #carpeta que se crea para la extracción
untar=True)
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz 228818944/228813984 [==============================] - 24s 0us/step
print(flowers)
/tmp/.keras/Datasets/flower_photos
Cree la image.ImageDataGenerator
image_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)
images, labels = next(image_gen.flow_from_directory(flowers))
Found 3670 images belonging to 5 classes.
print(images.dtype, images.shape)
print(labels.dtype, labels.shape)
float32 (32, 256, 256, 3) float32 (32, 5)
ds = tf.data.Dataset.from_generator(
lambda: image_gen.flow_from_directory(flowers),
output_types=(tf.float32, tf.float32),
output_shapes=([32,256,256,3],[32,5]))
ds.element_spec
(TensorSpec(shape=(32, 256, 256, 3), dtype=tf.float32, name=None), TensorSpec(shape=(32, 5), dtype=tf.float32, name=None))
Muchos conjuntos de datos se distribuyen como uno o más archivos de texto. tf.data.TextLineDataset
proporciona una manera fácil de extraer líneas de uno o más archivos de texto.
Dados uno o más nombres de archivo, un TextLineDataset
producirá un elemento con valor de cadena por línea de esos archivos.
directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']
file_paths = [
tf.keras.utils.get_file(file_name, directory_url +file_name)
for file_name in file_names
]
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt 819200/815980 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt 811008/809730 [==============================] - 0s 0us/step Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt 811008/807992 [==============================] - 0s 0us/step
file_paths
['/home/thejarmanitor/.keras/datasets/cowper.txt', '/home/thejarmanitor/.keras/datasets/derby.txt', '/home/thejarmanitor/.keras/datasets/butler.txt']
dataset = tf.data.TextLineDataset(file_paths)
Estas son las primeras líneas del primer archivo:
for line in dataset.take(5):
print(line.numpy())
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;" b'His wrath pernicious, who ten thousand woes' b"Caused to Achaia's host, sent many a soul" b'Illustrious into Ades premature,' b'And Heroes gave (so stood the will of Jove)'
Para alternar líneas entre archivos, use Dataset.interleave
. Esto facilita la reproducción aleatoria de archivos. Aquí están la primera, segunda y tercera líneas de cada traducción:
file_ds = tf.data.Dataset.from_tensor_slices(file_paths)
for i in file_ds: print(i.numpy())
b'/home/thejarmanitor/.keras/datasets/cowper.txt' b'/home/thejarmanitor/.keras/datasets/derby.txt' b'/home/thejarmanitor/.keras/datasets/butler.txt'
line_ds = file_ds.interleave(tf.data.TextLineDataset, cycle_length=3)
for i, line in enumerate(line_ds.take(9)):
if i%3 ==0:
print()
print(line.numpy())
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;" b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse," b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought' b'His wrath pernicious, who ten thousand woes' b'The vengeance, deep and deadly; whence to Greece' b'countless ills upon the Achaeans. Many a brave soul did it send' b"Caused to Achaia's host, sent many a soul" b'Unnumbered ills arose; which many a soul' b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'
De manera predeterminada, TextLineDataset
produce todas las lineas de cada archivo, lo cual tal vez no sea lo que se quiera. Tal vez el archivo empieza con el encabezado, o contiene comentarios. Para remover o pasarse estas lineas se usan las transformaciones Dataset.skip()
o Dataset.filter()
A continuación, trabajamos con el archivo de la tragedia del Titanic. Se salta la primera linea, y filtramos para tener solo a los sobrevivientes
titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv 32768/30874 [===============================] - 0s 1us/step
for line in titanic_lines.take(10):
print(line.numpy())
b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone' b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n' b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n' b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y' b'1,female,35.0,1,0,53.1,First,C,Southampton,n' b'0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y' b'0,male,2.0,3,1,21.075,Third,unknown,Southampton,n' b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n' b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n' b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
def survived(line):
return tf.not_equal(tf.strings.substr(line,0,1), '0')
survivors=titanic_lines.skip(1).filter(survived)
for line in survivors.take(10):
print(line.numpy())
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n' b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y' b'1,female,35.0,1,0,53.1,First,C,Southampton,n' b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n' b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n' b'1,female,4.0,1,1,16.7,Third,G,Southampton,n' b'1,male,28.0,0,0,13.0,Second,unknown,Southampton,y' b'1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y' b'1,male,28.0,0,0,35.5,First,A,Southampton,y' b'1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n'
El formato CSV es muy popular para guardar datos tabulares en forma de texto.
Ya subimos el archivo del titanic, el cual es csv. Podemos subirlo en este mismo formato usando pandas
df=pd.read_csv(titanic_file)
df.head()
survived | sex | age | n_siblings_spouses | parch | fare | class | deck | embark_town | alone | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | male | 22.0 | 1 | 0 | 7.2500 | Third | unknown | Southampton | n |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | First | C | Cherbourg | n |
2 | 1 | female | 26.0 | 0 | 0 | 7.9250 | Third | unknown | Southampton | y |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 | First | C | Southampton | n |
4 | 0 | male | 28.0 | 0 | 0 | 8.4583 | Third | unknown | Queenstown | y |
Si se tiene suficiente memoria, pueden transformar a diccionario el Dataframe e importar los datos con facilidad
titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))
for feature_batch in titanic_slices.take(1):
for key, value in feature_batch.items():
print(" {!r:20s}: {}".format(key, value))
'survived' : 0 'sex' : b'male' 'age' : 22.0 'n_siblings_spouses': 1 'parch' : 0 'fare' : 7.25 'class' : b'Third' 'deck' : b'unknown' 'embark_town' : b'Southampton' 'alone' : b'n'
Un acercamiento más ameno es cargar desde el disco cuando sea necesario.
el modulo tiene métodos para extraer rgistros de uno o más archivos CSV que cumplan con la RFC 4180
la función experimental.make_csv_dataset
es una interfaz para leer conjuntos de archivos CSV, con lo cual podemos hacer inferencia por columna y crear lotes de los datos
Se puede usar el argumento select_columns
si solo se necesitan algunas columnas
titanic_batches = tf.data.experimental.make_csv_dataset(
titanic_file, batch_size=4,
label_name="survived", select_columns=['class', 'fare', 'survived'])
for feature_batch, label_batch in titanic_batches.take(1):
print("'survived': {}".format(label_batch))
for key, value in feature_batch.items():
print(" {!r:20s}: {}".format(key, value))
Es normal que los datos estén distribuidos en múltiples archivos, con cada archivo teniendo ejemplos
flowers_root = tf.keras.utils.get_file(
'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
untar=True)
flowers_root = pathlib.Path(flowers_root)
Cada directorio de la carpeta raíz contiene un directorio de cada clase
for item in flowers_root.glob("*"):
print(item.name)
daisy tulips roses sunflowers dandelion LICENSE.txt
Cada archivo en los directorios son ejemplos
list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))
for f in list_ds.take(5):
print(f.numpy())
b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/14313509432_6f2343d6c8_m.jpg' b'/home/thejarmanitor/.keras/datasets/flower_photos/tulips/8603340662_0779bd87fd.jpg' b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/19613308325_a67792d889.jpg' b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/3496258301_ca5f168306.jpg' b'/home/thejarmanitor/.keras/datasets/flower_photos/dandelion/8979087213_28f572174c.jpg'
usando tf.io.read_file
podemos ler los datos y extraer las etiquetas, obteniendo (imagen, etiqueta)
def process_path(file_path):
label = tf.strings.split(file_path, os.sep)[-2]
return tf.io.read_file(file_path), label
labeled_ds = list_ds.map(process_path)
WARNING:tensorflow:AutoGraph could not transform <function process_path at 0x7f71ac13fa60> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function process_path at 0x7f71ac13fa60> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
for image_raw, label_text in labeled_ds.take(1):
print(repr(image_raw.numpy()[:100]))
print()
print(label_text.numpy())
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xed\x00\xf8Photoshop 3.0\x008BIM\x04\x04\x00\x00\x00\x00\x00\xae\x1c\x01\x00\x00\x02\x00\x04\x1c\x02\x00\x00\x02\x00\x04\x1c\x02\x19\x00\x04blur\x1c\x02\x19\x00\x05bokeh\x1c\x02\x19\x00\nCalifornia\x1c\x02' b'roses'
La transformación Dataset.batch()
es la forma más sencilla de hacer un lote de n
elementos consecutivos. Para cada componente, todos los elementos deben tener un tensor de exactamente la misma dimensión
inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)
for batch in batched_dataset.take(4):
print([arr.numpy() for arr in batch])
[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])] [array([4, 5, 6, 7]), array([-4, -5, -6, -7])] [array([ 8, 9, 10, 11]), array([ -8, -9, -10, -11])] [array([12, 13, 14, 15]), array([-12, -13, -14, -15])]
Con el loteo simple todos los tensores debe tener la misma dimensión, pero esto no va a ser el caso todas las veces. Utilizando Dataset.padded_batch
se hace un acolchamiento de los tensores de distintas formas, específicando las dimensiones a las cuales hay que aplicar acolchamiento
dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))
for batch in dataset.take(2):
print(batch.numpy())
print()
[[0 0 0] [1 0 0] [2 2 0] [3 3 3]] [[4 4 4 4 0 0 0] [5 5 5 5 5 0 0] [6 6 6 6 6 6 0] [7 7 7 7 7 7 7]]
La API ofrece dos maneras dde procesar múltiples epochs de los mismos datos.
La primera manera es iterando sobre el el conjunto de datos utilizando Dataset.repeat()
. volvemos al ejemplo de texto del Titanic.
def plot_batch_sizes(ds):
batch_sizes = [batch.shape[0] for batch in ds]
plt.bar(range(len(batch_sizes)), batch_sizes)
plt.xlabel('Batch number')
plt.ylabel('Batch size')
Dataset.repeat
hace una concatenación de los argumentos sin señalar el inicio o el final de un epoch. Si aplicamos
Dataset.batch
Después de esta, se producirán lotes que van más allá de los límites e los epochs.
Si la función repeat
no tiene argumentos, se hara una repetición infinita.
titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)
si queremos una separación clara de los epoch, se aplica Dataset.batch
antes de repeat
titanic_batches = titanic_lines.batch(128).repeat(3)
plot_batch_sizes(titanic_batches)
Si queremos, por ejemplo, recopilar estadísticas al final de cada epoch, podemos hacer una iteración y reiniciar después de cada epoch
epochs = 3
dataset = titanic_lines.batch(128)
for epoch in range(epochs):
for batch in dataset:
print(batch.shape)
print("End of epoch: ", epoch)
(128,) (128,) (128,) (128,) (116,) End of epoch: 0 (128,) (128,) (128,) (128,) (116,) End of epoch: 1 (128,) (128,) (128,) (128,) (116,) End of epoch: 2
la transformación Dataser.shuffle()
toma una muestra de un tamaño predeterminado y selecciona el siguiente dato del buffer.
Le agregaremos un indice a los datos del titanic para que el efecto sea visible
lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()
dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset
<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.string)>
n,line_batch = next(iter(dataset))
print(n.numpy())
[ 60 71 23 11 103 25 46 18 48 45 107 17 56 80 5 43 76 106 39 47]
shuffle
no señala el fin de un epoch hasta que el buffer esté vacío. si aplicamos repeat
antes de este, se podrá ver el momento en el que termina un epoch y empieza otro
dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)
print("esta es la lista de indices cercanos al fin del epoch:\n")
for n, line_batch in shuffled.skip(60).take(5):
print(n.numpy())
esta es la lista de indices cercanos al fin del epoch: [610 615 266 557 573 40 547 625 362 520] [616 436 502 602 482 507 593 589 470 523] [477 569 617 583 517 476 503 543] [87 17 30 60 32 86 74 64 34 65] [ 98 80 55 5 69 113 7 46 79 110]
gráficamente se puede apreciar mejor
shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()
<matplotlib.legend.Legend at 0x7f715029f460>
si ponemos repeat
antes de la mezcla, los límites de los epoch se mantendrán iguales hasta que no hayan más objetos que mezclar
dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)
print("esta es la lista de indices cercanos al fin del epoch:\n")
for n, line_batch in shuffled.skip(55).take(15):
print(n.numpy())
esta es la lista de indices cercanos al fin del epoch: [ 19 18 564 616 620 17 431 612 603 611] [594 24 15 609 600 491 483 627 624 30] [ 31 493 525 379 425 5 29 558 14 48] [ 44 563 97 625 453 42 501 500 486 57] [ 8 33 524 25 610 34 28 55 50 553] [ 60 574 551 584 598 9 530 554 54 68] [583 46 537 40 626 13 43 587 22 308] [ 27 516 41 62 577 88 406 92 595 35] [ 91 69 84 51 4 602 83 107 36 20] [ 58 21 532 16 569 81 90 567 535 49] [562 474 120 94 76 1 66 109 613 575] [601 129 99 113 23 87 619 85 64 508] [138 621 78 37 108 517 39 108 239 494] [131 133 124 72 590 59 71 134 100 96] [126 454 136 32 98 137 7 114 592 142]
repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()
<matplotlib.legend.Legend at 0x7f71502de670>
si se quiere aplicar alguna función a los datos en cuestión, se utiliza la transformación Dataset.map(f)
. Esta toma los objetos t f.Tensor
de un solo elemento y saca nuevos objetos en un nuevo conjunto de datos.
Aquí mostramos dos ejemplos muy comunes de pre procesamiento
Al trabajar con imagenes de la vida cotidiana, lo más probable es que necesitemos estandarizar los tamaños a uno en común.
Utilizaremos la lista de flores para este ejemplo
list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))
escribimos una función para manipular datos
# Lee una imagen de un archivo, la decodifica en un tensor y cambia su tamaño
# a una forma predeterminada
def parse_image(filename):
parts = tf.strings.split(filename, os.sep)
label = parts[-2]
image = tf.io.read_file(filename)
image = tf.image.decode_jpeg(image)
image = tf.image.convert_image_dtype(image, tf.float32)
image = tf.image.resize(image, [128, 128])
return image, label
file_path = next(iter(list_ds))
image, label = parse_image(file_path)
def show(image, label):
plt.figure()
plt.imshow(image)
plt.title(label.numpy().decode('utf-8'))
plt.axis('off')
show(image, label)
images_ds = list_ds.map(parse_image)
for image, label in images_ds.take(2):
show(image, label)
WARNING:tensorflow:AutoGraph could not transform <function parse_image at 0x7f71501cbe50> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function parse_image at 0x7f71501cbe50> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Por razones de rendimiento, es mejor usar únicamente funciones de Tensorflow para manipular datos, pero a veces es necesario usar herramientas de otros paquetes de python.
Para esto utilizamos tf.py_function()
como función en Dataset.map()
Supongamos que queremos hacer una rotación aleatoria en un conjunto dde imágenes. Tensorflow sólo tiene tf.image.rot90
, lo cual no sirve para la intención que se tiene. por suerte, el paquete scipy cuenta con scipy.ndimage.rotate
import scipy.ndimage as ndimage
def random_rotate_image(image):
image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
return image
image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
def tf_random_rotate_image(image, label):
im_shape = image.shape
[image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
image.set_shape(im_shape)
return image, label
rot_ds = images_ds.map(tf_random_rotate_image)
for image, label in rot_ds.take(2):
show(image, label)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
En el caso de modelos de series de tiempo, estos datos están organizados con el axis de tiempo intacto. Muchas veces se le alimentaran secciones de tiempo adyacentes a los modelos como datos. Hay dos maneras de generar estos cortes. La primera es utilizando lotes:
range_ds = tf.data.Dataset.range(100000)
batches = range_ds.batch(10, drop_remainder=True)
for batch in batches.take(5):
print(batch.numpy())
[0 1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24 25 26 27 28 29] [30 31 32 33 34 35 36 37 38 39] [40 41 42 43 44 45 46 47 48 49]
Para hacer predicciones un paso hacia el futuro, es ideal mover los datos y etiquetas un paso relativo a ellos
def dense_1_step(batch):
# Se mueven las características y etiquetas un paso hacia la derecha
return batch[:-1], batch[1:]
predict_dense_1_step = batches.map(dense_1_step)
for features, label in predict_dense_1_step.take(3):
print(features.numpy(), " => ", label.numpy())
WARNING:tensorflow:AutoGraph could not transform <function dense_1_step at 0x7f71501ee160> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function dense_1_step at 0x7f71501ee160> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert [0 1 2 3 4 5 6 7 8] => [1 2 3 4 5 6 7 8 9] [10 11 12 13 14 15 16 17 18] => [11 12 13 14 15 16 17 18 19] [20 21 22 23 24 25 26 27 28] => [21 22 23 24 25 26 27 28 29]
Para predecir una ventana completa de tiempo, podemos separar los lotes en dos partes
batches = range_ds.batch(15, drop_remainder=True)
def label_next_5_steps(batch):
return (batch[:-5], # Se toman los primeros 10 pasos
batch[-5:]) # se toma el residuo
predict_5_steps = batches.map(label_next_5_steps)
for features, label in predict_5_steps.take(3):
print(features.numpy(), " => ", label.numpy())
WARNING:tensorflow:AutoGraph could not transform <function label_next_5_steps at 0x7f71a005e700> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function label_next_5_steps at 0x7f71a005e700> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert [0 1 2 3 4 5 6 7 8 9] => [10 11 12 13 14] [15 16 17 18 19 20 21 22 23 24] => [25 26 27 28 29] [30 31 32 33 34 35 36 37 38 39] => [40 41 42 43 44]
Para permitir que se superpongan las características de un lote y las etiquetas de otro, podemos usar Dataset.zip()
feature_length = 10
label_length = 3
features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:label_length])
predicted_steps = tf.data.Dataset.zip((features, labels))
for features, label in predicted_steps.take(5):
print(features.numpy(), " => ", label.numpy())
WARNING:tensorflow:AutoGraph could not transform <function <lambda> at 0x7f71a005e820> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function <lambda> at 0x7f71a005e820> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert [0 1 2 3 4 5 6 7 8 9] => [10 11 12] [10 11 12 13 14 15 16 17 18 19] => [20 21 22] [20 21 22 23 24 25 26 27 28 29] => [30 31 32] [30 31 32 33 34 35 36 37 38 39] => [40 41 42] [40 41 42 43 44 45 46 47 48 49] => [50 51 52]
Por supuesto, a veces se necesitan más control de las ventanas. Razón por la que se puede usar Dataset.window
, pero para usarla correctamente, necesitamos algo de cuidado en lso datos. Esta transformación retorna un conjunto de conjuntos de datos
window_size = 5
windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
print(sub_ds)
<_VariantDataset shapes: (), types: tf.int64> <_VariantDataset shapes: (), types: tf.int64> <_VariantDataset shapes: (), types: tf.int64> <_VariantDataset shapes: (), types: tf.int64> <_VariantDataset shapes: (), types: tf.int64>
¿Pero qué pasó aquí? para ver los datos como un solo conjunto, usamos Dataset.flat_map
. Al mismo tiempo casi siempre es necesario hacer lotes
for x in windows.flat_map(lambda x: x).take(30):
print(x.numpy(), end=' ')
0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
def sub_to_batch(sub):
return sub.batch(window_size, drop_remainder=True)
for example in windows.flat_map(sub_to_batch).take(5):
print(example.numpy())
[0 1 2 3 4] [1 2 3 4 5] [2 3 4 5 6] [3 4 5 6 7] [4 5 6 7 8]
Haciéndolo todo junto, obtendríamos una función como esta
def make_window_dataset(ds, window_size=5, shift=1, stride=1):
windows = ds.window(window_size, shift=shift, stride=stride)
def sub_to_batch(sub):
return sub.batch(window_size, drop_remainder=True)
windows = windows.flat_map(sub_to_batch)
return windows
ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)
for example in ds.take(10):
print(example.numpy())
[ 0 3 6 9 12 15 18 21 24 27] [ 5 8 11 14 17 20 23 26 29 32] [10 13 16 19 22 25 28 31 34 37] [15 18 21 24 27 30 33 36 39 42] [20 23 26 29 32 35 38 41 44 47] [25 28 31 34 37 40 43 46 49 52] [30 33 36 39 42 45 48 51 54 57] [35 38 41 44 47 50 53 56 59 62] [40 43 46 49 52 55 58 61 64 67] [45 48 51 54 57 60 63 66 69 72]
Es sencillo extraer etiquetas con estos datos
dense_labels_ds = ds.map(dense_1_step)
for inputs,labels in dense_labels_ds.take(3):
print(inputs.numpy(), "=>", labels.numpy())
[ 0 3 6 9 12 15 18 21 24] => [ 3 6 9 12 15 18 21 24 27] [ 5 8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32] [10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]
Es usual encontrarse con datasets desbalanceados a nivel de clases. Es buena idea aquí el hacer un remuestreo del dataset. tf.data
da dos métodos para esto
Se usará el dataset de fraude de tarjetas de crédito es perfecto para demostrarlo
zip_path = tf.keras.utils.get_file(
origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
fname='creditcard.zip',
cache_dir='/media/storage',
cache_subdir='Datasets',
extract=True)
csv_path = zip_path.replace('.zip', '.csv')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip 69156864/69155632 [==============================] - 7s 0us/step
creditcard_ds = tf.data.experimental.make_csv_dataset(
csv_path, batch_size=1024, label_name="Class",
# Set the column types: 30 floats and an int.
column_defaults=[float()]*30+[int()])
Se revisará ahora la distribución de las clases a clasificar, para ver qué tan sesgados están
def count(counts, batch):
features, labels = batch
class_1 = labels == 1
class_1 = tf.cast(class_1, tf.int32)
class_0 = labels == 0
class_0 = tf.cast(class_0, tf.int32)
counts['class_0'] += tf.reduce_sum(class_0)
counts['class_1'] += tf.reduce_sum(class_1)
return counts
counts = creditcard_ds.take(10).reduce(
initial_state={'class_0': 0, 'class_1': 0},
reduce_func = count)
counts = np.array([counts['class_0'].numpy(),
counts['class_1'].numpy()]).astype(np.float32)
fractions = counts/counts.sum()
print(fractions)
WARNING:tensorflow:AutoGraph could not transform <function count at 0x7f7150127ca0> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert WARNING: AutoGraph could not transform <function count at 0x7f7150127ca0> and will run it as-is. Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index' To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert [0.9958 0.0042]
Para poder trabajar con datos desbalanceados, la mejor idea es balancearlos. Aquí algunos métodos para esto
La forma más sencilla es usar sample_from_datasets
. Esto es particularmente mejor cuando se tienen datasets separados por clase. Para este caso se va a filtrar los datos de fraude para esta razón
negative_ds = (
creditcard_ds
.unbatch()
.filter(lambda features, label: label==0)
.repeat())
positive_ds = (
creditcard_ds
.unbatch()
.filter(lambda features, label: label==1)
.repeat())
for features, label in positive_ds.batch(10).take(1):
print(label.numpy())
[1 1 1 1 1 1 1 1 1 1]
Se pasarán los datasets, junto con los pesos que se quieren por tf.data.experimental.sample_from_datasets
balanced_ds = tf.data.experimental.sample_from_datasets(
[negative_ds, positive_ds], [0.5, 0.5]).batch(10)
Ahora se generarán ejemplos de las clases con una probabilidad 50/50
for features, labels in balanced_ds.take(10):
print(labels.numpy())
[1 0 0 1 0 1 1 1 1 1] [1 0 1 1 1 0 0 1 0 1] [1 0 0 1 0 1 1 0 1 1] [0 1 0 1 0 0 0 0 0 1] [1 1 0 0 0 1 0 0 0 1] [0 0 1 0 1 0 1 1 0 1] [0 0 1 1 1 0 0 0 0 1] [0 1 1 0 1 1 0 0 0 1] [0 1 0 1 0 1 0 0 0 0] [1 0 0 1 0 0 0 0 1 1]
Como se dijo, necesitamos que los datasets estén separados por clase. Podemos por supuesto usar Dataset.filter
, pero eso haría que los datos se cargaran dos veces.
La función data.experimental.rejection_resample
permite rebalancear los datos sin tener que cargarlos otra vez. Esto se logra eliminando elementos del dataset para llegar al balance.
Esta función toma un argumento class_func
. Esta función es aplicada a cada elemento del dataset para determinar la clase que tiene.
Los elementos de creditcard_ds
ya están separados en pares (features,label)
. Así que la función solo tiene que retornar la etiqueta
def class_func(features, label):
return label
De igual forma es necesaria una distribución objetivo y preferiblemente un estimado de esta
resampler = tf.data.experimental.rejection_resample(
class_func, target_dist=[0.5, 0.5], initial_dist=fractions)
resampler
trabaja con las observaciones de manera individual, así que hay que aplicar unbatch()
antes.
resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)
WARNING:tensorflow:From /home/thejarmanitor/miniconda3/envs/tf-gpu/lib/python3.9/site-packages/tensorflow/python/data/experimental/ops/resampling.py:152: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20. Instructions for updating: Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:
el resampler retorna pares del tipo (class, example)
a partír de la salida de class_func
. En este caso ya tenemos (feature, label)
, así que se hace un map para obtener una copia extra de las etiquetas
balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)
for features, labels in balanced_ds.take(10):
print(labels.numpy())
Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1] Proportion of examples rejected by sampler is high: [0.995800793][0.995800793 0.00419921894][0 1]
[1 1 1 1 0 1 0 0 1 1] [0 1 0 1 0 0 1 1 0 0] [0 1 1 0 1 1 0 0 0 1] [0 1 1 0 1 1 0 1 0 1] [0 0 1 1 1 1 1 0 0 1] [0 1 1 0 0 1 0 0 0 0] [0 0 0 1 1 1 1 1 1 1] [1 1 1 1 0 1 0 0 1 1] [0 0 1 0 1 0 0 1 0 1] [1 0 0 0 0 1 0 0 0 1]
Cuál es el problema de este método? si el desbalanceo es muy grande, se va a perder una cantidad muy grande de datos. ¿Qué es más importante: La cantidad de datos o la cantidad de recursos?