The TFRecord Format

The TFRecord Format 1

The TFRecord Format The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently. It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data). You can easily create a TFRecord file using the tf.io.TFRecordWriter class:

with tf.io.TFRecordWriter("my_data.tfrecord") as f:    
f.write(b"This is the first record")    
f.write(b"And this is the second record") 

And you can then use a tf.data.TFRecordDataset to read one or more TFRecord files:

filepaths = ["my_data.tfrecord"] 
dataset = tf.data.TFRecordDataset(filepaths)
 for item in dataset:    
print(item) 

This will output:

tf.Tensor(b'This is the first record', shape=(), dtype=string) 
tf.Tensor(b'And this is the second record', shape=(), dtype=string)

By default, a TFRecordDataset will read files one by one, but you can make it read multiple files in parallel and interleave their records by setting num_parallel_reads. Alternatively, you could obtain the same result by using list_files() and interleave() as we did earlier to read multiple CSV files.

Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection. You can create a compressed TFRecord file by setting the options argument:

options = tf.io.TFRecordOptions(compression_type="GZIP") 
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:  
[...] 

When reading a compressed TFRecord file, you need to specify the compression type:

dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],                                  compression_type="GZIP") 

A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, TFRecord files usually contain serialized protocol buffers (also called protobufs). This is a portable, extensible, and efficient binary format developed at Google back in 2001 and made open source in 2008; protobufs are now widely used, in particular in gRPC, Google’s remote procedure call system. They are defined using a simple language that looks like this:

syntax = "proto3"; 
message Person {  
string name = 1;  
int32 id = 2;  
repeated string email = 3; 
} 

This definition says we are using version 3 of the protobuf format, and it specifies that each Person object6 may (optionally) have a name of type string, an id of type int32, and zero or more email fields, each of type string. The numbers 1, 2, and 3 are the field identifiers: they will be used in each record’s binary representation. Once you have a definition in a .proto file, you can compile it. This requires protoc, the protobuf compiler, to generate access classes in Python (or some other language). Note that the protobuf definitions we will use have already been compiled for you, and their Python classes are part of TensorFlow, so you will not need to use protoc.

All you need to know is how to use protobuf access classes in Python. To illustrate the basics, let’s look at a simple example that uses the access classes generated for the Person protobuf (the code is explained in the comments):

>>> from person_pb2 import Person  # import the generated access class 
>>> person = Person(name="Al", id=123, email=["[email protected]"])  # create a Person 
>>> print(person)  # display the Person 
name: "Al" 
id: 123 
email: "[email protected]" 
>>> person.name  # read a field 
"Al"
>>> person.name = "Alice"  # modify a field 
>>> person.email[0]  # repeated fields can be accessed like arrays 
"[email protected]" 
>>> person.email.append("[email protected]")  # add an email address 
>>> s = person.SerializeToString()  # serialize the object to a byte string 
>>> s 
b'\n\x05Alice\x10{\x1a\[email protected]\x1a\[email protected]' 
>>> person2 = Person()  # create a new Person 
>>> person2.ParseFromString(s)  # parse the byte string (27 bytes long) 
27 
>>> person == person2  # now they are equal 
True 

In short, we import the Person class generated by protoc, we create an instance and play with it, visualizing it and reading and writing some fields, then we serialize it using the SerializeToString() method. This is the binary data that is ready to be saved or transmitted over the network. When reading or receiving this binary data, we can parse it using the ParseFromString() method, and we get a copy of the object that was serialized.7

We could save the serialized Person object to a TFRecord file, then we could load and parse it: everything would work fine. However, SerializeToString() and ParseFrom String() are not TensorFlow operations (and neither are the other operations in this code), so they cannot be included in a TensorFlow Function (except by wrapping them in a tf.py_function() operation, which would make the code slower and less portable). Fortunately, TensorFlow does include special protobuf definitions for which it provides parsing operations.

TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats, or a list of integers. Here is the protobuf definition:

syntax = "proto3"; 
message BytesList { repeated bytes value = 1; } 
message FloatList { repeated float value = 1 [packed = true]; } 
message Int64List { repeated int64 value = 1 [packed = true]; } 
message Feature { 
 oneof kind {        
BytesList bytes_list = 1;        
FloatList float_list = 2;        
Int64List int64_list = 3; 
  } 
}; 
message Features { map<string, Feature> feature = 1; }; 
message Example { Features features = 1; }; 

The definitions of BytesList, FloatList, and Int64List are straightforward enough. Note that [packed = true] is used for repeated numerical fields, for a more efficient encoding. A Feature contains either a BytesList, a FloatList, or an Int64List. A Features (with an s) contains a dictionary that maps a feature name to the corresponding feature value. And finally, an Example contains only a Features object.8 Here is how you could create a tf.train.Example representing the same person as earlier and write it to a TFRecord file:

from tensorflow.train import BytesList, FloatList, Int64List

from tensorflow.train import Feature, Features, Example

person_example = 
Example(    
features=Features(        
feature={ 
"name": Feature(bytes_list=BytesList(value=[b"Alice"])),            
"id": Feature(int64_list=Int64List(value=[123])),            
"emails": Feature(bytes_list=BytesList(value=[b"[email protected]", 
  b"[email protected]"]))
 })) 

The code is a bit verbose and repetitive, but it’s rather straightforward (and you could easily wrap it inside a small helper function). Now that we have an Example protobuf, we can serialize it by calling its SerializeToString() method, then write the resulting data to a TFRecord file:

with tf.io.TFRecordWriter("my_contacts.tfrecord") as f: 
 f.write(person_example.SerializeToString())

Normally you would write much more than one Example! Typically, you would create a conversion script that reads from your current format (say, CSV files), creates an Example protobuf for each instance, serializes them, and saves them to several TFRecord files, ideally shuffling them in the process. This requires a bit of work, so once again make sure it is really necessary (perhaps your pipeline works fine with CSV files).

Now that we have a nice TFRecord file containing a serialized Example, let’s try to load it.

Loading and Parsing Examples

To load the serialized Example protobufs, we will use a tf.data.TFRecordDataset once again, and we will parse each Example using tf.io.parse_single_example(). This is a TensorFlow operation, so it can be included in a TF Function. It requires at least two arguments: a string scalar tensor containing the serialized data, and a description of each feature. The description is a dictionary that maps each feature name to either a tf.io.FixedLenFeature descriptor indicating the feature’s shape, type, and default value, or a tf.io.VarLenFeature descriptor indicating only the type (if the length of the feature’s list may vary, such as for the “emails” feature).

The following code defines a description dictionary, then it iterates over the TFRecord Dataset and parses the serialized Example protobuf this dataset contains:

feature_description = {    
"name": tf.io.FixedLenFeature([], tf.string, default_value=""),    
"id": tf.io.FixedLenFeature([], tf.int64, default_value=0),    
"emails": tf.io.VarLenFeature(tf.string),
}
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]): 
parsed_example = tf.io.parse_single_example(serialized_example,
feature_description)

The fixed-length features are parsed as regular tensors, but the variable-length features are parsed as sparse tensors. You can convert a sparse tensor to a dense tensor using tf.sparse.to_dense(), but in this case it is simpler to just access its values:

>>> tf.sparse.to_dense(parsed_example["emails"], default_value=b"") 
<tf.Tensor: [...] dtype=string, numpy=array([b'[email protected]', b'[email protected]'], [...])>
>>> parsed_example["emails"].values
<tf.Tensor: [...] dtype=string, numpy=array([b'[email protected]', b'[email protected]'], [...])> 

A BytesList can contain any binary data you want, including any serialized object. For example, you can use tf.io.encode_jpeg() to encode an image using the JPEG format and put this binary data in a BytesList. Later, when your code reads the TFRecord, it will start by parsing the Example, then it will need to call tf.io.decode_jpeg() to parse the data and get the original image (or you can use tf.io.decode_image(), which can decode any BMP, GIF, JPEG, or PNG image). You can also store any tensor you want in a BytesList by serializing the tensor using tf.io.serialize_tensor() then putting the resulting byte string in a BytesList feature. Later, when you parse the TFRecord, you can parse this data using tf.io.parse_tensor().

Instead of parsing examples one by one using tf.io.parse_single_example(), you may want to parse them batch by batch using tf.io.parse_example():

dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10) 
for serialized_examples in dataset: 
 parsed_examples = tf.io.parse_example(serialized_examples,                                          feature_description) 

As you can see, the Example protobuf will probably be sufficient for most use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists. For example, suppose you want to classify text documents. Each document may be represented as a list of sentences, where each sentence is represented as a list of words. And perhaps each document also has a list of comments, where each comment is represented as a list of words. There may be some contextual data too, such as the document’s author, title, and publication date. TensorFlow’s SequenceExample protobuf is designed for such use cases.

Handling Lists of Lists Using the SequenceExample Protobuf

Here is the definition of the SequenceExample protobuf:

message FeatureList { repeated Feature feature = 1; }; 
message FeatureLists { map<string, FeatureList> feature_list = 1; }; 
message SequenceExample { 
 Features context = 1;    
FeatureLists feature_lists = 2; 
}; 

A SequenceExample contains a Features object for the contextual data and a FeatureLists object that contains one or more named FeatureList objects (e.g., a FeatureList named “content” and another named “comments”). Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each Feature would represent a sentence or a comment, perhaps in the form of a list of word identifiers). Building a SequenceExample, serializing it, and parsing it is similar to building, serializing, and parsing an Example, but you must use tf.io.parse_single_sequence_example() to parse a single SequenceExample or tf.io.parse_sequence_example() to parse a batch. Both functions return a tuple containing the context features (as a dictionary) and the feature lists (also as a dictionary). If the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using tf.RaggedTensor.from_sparse() (see the notebook for the full code):

parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example( 
 serialized_sequence_example, context_feature_descriptions, 
sequence_feature_descriptions)
parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists["content"])

Now that you know how to efficiently store, load, and parse data, the next step is to prepare it so that it can be fed to a neural network.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

[ad_2]

Source link

Most Popular