TPP Topic 16: The Power of Plain Text

See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.

Challenge 1

Design a small address book database (name, phone number, and so on) using a straightforward binary representation in your language of choice. Do this before reading the rest of this challenge.

  • Translate that format into a plain-text format using XML or JSON.
  • For each version, add a new, variable-length field called directions in which you might enter directions to each person’s house.

What issues come up regarding versioning and extensibility? Which form was easier to modify? What about converting existing data?

Full code can be found on GitHub.

Version 1

Data Model

Each address book record is represented by a Person class containing basic personal information and address fields. A unique Id is also provided for each record using a UUID. Storing addresses universally is quite complex, however as this is not a challenge about data modelling I have assumed a very basic model of a UK address:

# address_book/models.py

from dataclasses import dataclass, field
from uuid import uuid4


@dataclass
class Person:
    first_name: str
    last_name: str
    phone_number: str
    house_number: str
    street: str
    town: str
    postcode: str
    id: str = field(default_factory=lambda: str(uuid4()))

I’m using Python 3.7 Dataclasses because Person is mainly (apart from Id generation) a Data Transfer Object(DTO). Usage:

>>> Person("Ben", "Steadman", "+1-087-184-1440", "1", "A Road", "My Town", "CB234")
Person(first_name='Ben', last_name='Steadman', phone_number='+1-087-184-1440', house_number='1', street='A Road', town='My Town', postcode='CB234', id='a14fe77b-b5d2-46e7-b42c-9392b4bbec28')

To aid testing, generate_people will generate arbitrary People instances using the excellent Faker library:

# address_book/models.py

from faker import Faker

fake = Faker("en_GB")


def generate_people(n: int) -> Iterable[Person]:
    for _ in range(n):
        yield Person(
            fake.first_name(),
            fake.last_name(),
            fake.phone_number(),
            fake.building_number(),
            fake.street_name(),
            fake.city(),
            fake.postcode(),
        )

Usage:

>>> list(generate_people(2))
[
    Person(
        first_name="Victor",
        last_name="Pearce",
        phone_number="01184960739",
        house_number="2",
        street="Mohamed divide",
        town="Charleneburgh",
        postcode="LS7 0DJ",
        id="cb242277-44dd-4836-98c7-ddbe10183fb4",
    ),
    Person(
        first_name="Stanley",
        last_name="Ashton",
        phone_number="(0131) 496 0908",
        house_number="2",
        street="Karen bridge",
        town="Port Gailland",
        postcode="L3J 2YF",
        id="ef85cfd1-08eb-4629-8747-3d8be1580fc7",
    ),
]

Binary Representation

As this challenge is about data formats and not building a database, I’m interpreting address book database as a file containing a list of address book records - not a DBMS.

To convert between the Person class and a binary representation the Python struct can be used.

Performs conversions between Python values and C structs represented as Python bytes opjects.

Python struct documentation

Person can be represented using the following Struct:

import struct

PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s")

Which corresponds to the following C struct:

struct Person {
    char first_name[50];
    char last_name[50];
    char phone_number[30];
    char house_number[10];
    char street[50];
    char town[50];
    char postcode[10];
    char id[36];
};

Binary packing/unpacking usage:

>>> as_bytes = PersonStruct.pack(b'Ben', b'Steadman', b'+44(0)116 4960124', b'1', b'A Road', b'My Town', b'CB234', b'b36cb798-946e-4dca-b89c-f393616feb7b')
>>> as_bytes
b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00b36cb798-946e-4dca-b89c-f393616feb7b'
>>> PersonStruct.unpack(as_bytes)(b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'1\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', b'CB234\x00\x00\x00\x00\x00', b'b36cb798-946e-4dca-b89c-f393616feb7b')
  • Note how the values of the tuple returned from PersonStruct.unpack are padded with \x00 (null bytes) due to the struct format specifying a larger length string than the original values provided. These will need to be removed during unpacking into Person objects.

To provide a higher level of abstraction over these raw bytes, the conversion functionality can be wrapped up into some functions which deal with Person objects:

# address_book/binary.py

import struct
from dataclasses import astuple

from models import Person

PersonStruct = struct.Struct("50s50s30s10s50s50s10s36s")


def from_bytes(buffer: bytes) -> Person:
    return Person(
        *(
            # remove null bytes added by string packing
            x.decode("utf-8").rstrip("\x00")
            for x in PersonStruct.unpack(buffer)
        )
    )


def to_bytes(p: Person) -> bytes:
    return PersonStruct.pack(
        *(s.encode("utf-8") for s in astuple(p))
    )

Usage:

>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234")
>>> as_bytes = to_bytes(me)
>>> me_again = from_bytes(me)
>>> me == me_again
True

These Person conversion functions can be used in higher level functions to read and write an entire address book database:

# address_book/binary.py

from functools import partial
from pathlib import Path
from typing import Iterable, List

def read_address_book(db: Path) -> List[Person]:
    people = []
    with db.open("rb") as f:
        for chunk in iter(partial(f.read, PersonStruct.size), b""):
            people.append(from_bytes(chunk))
    return people


def write_address_book(db: Path, people: Iterable[Person]):
    with db.open("wb") as f:
        f.write(b"".join(to_bytes(p) for p in people))

Usage:

>>> people = list(generate_people(50))
>>> db = Path("data/address-book.bin")
>>> write_address_book(db, people)
>>> people_again = read_address_book(db)
>>> people == people_again
True

Plain Text Representation

I’ve chosen JSON as the plain text format due to the excellent Python standard library json module making it easy to work with. Using the same Person model, the functions from_dict and to_dict are analogous to from_bytes and to_bytes respectively as the json module converts JSON objects to and from Python dictionaries.

# address_book/plain_text.py

from dataclasses import asdict

from .models import Person


def from_dict(d: dict) -> Person:
    return Person(**d)


def to_dict(p: Person) -> dict:
    return asdict(p)

Usage:

>>> me = Person("Ben", "Steadman", "+44(0)116 4960124", "1", "A Road", "My Town", "CB234")
>>> as_dict = to_dict(me)
>>> me_again = from_dict(as_dict)
>>> me == me_again
True

These can then be used to create JSON versions of read_address_book and write_address_book:

# address_book/plain_text.py

import json
from functools import partial
from pathlib import Path
from typing import Iterable, List


def read_address_book(db: Path) -> List[Person]:
    with db.open() as f:
        return [from_dict(d) for d in json.load(f)]


def write_address_book(db: Path, people: Iterable[Person]):
    with db.open("w") as f:
        json.dump([to_dict(p) for p in people], f)

Usage:

>>> people = list(generate_people(50))
>>> db = Path("data/address-book.json")
>>> write_address_book(db, people)
>>> people_again = read_address_book(db)
>>> people == people_again
True

Tests

Each implementation is also covered by a set of simple unit tests, asserting the correctness of the conversions to and from their respective formats:

import pytest

from address_book import binary, plain_text
from address_book.models import Person, generate_people


@pytest.mark.parametrize("p", generate_people(50))
def test_to_bytes_inverts_from_bytes(p):
    p_bytes = binary.to_bytes(p)
    p_again = binary.from_bytes(p_bytes)
    assert p == p_again


@pytest.mark.parametrize("p", generate_people(50))
def test_to_dict_inverts_from_dict(p):
    p_dict = plain_text.to_dict(p)
    p_again = plain_text.from_dict(p_dict)
    assert p == p_again


@pytest.mark.parametrize(
    "module,fname", [(binary, "address-book.bin"), (plain_text, "address-book.json")]
)
def test_write_address_book_inverts_read_address_book(module, fname, tmp_path):
    db = tmp_path / fname
    # sanity check
    assert db.exists() is False

    people = list(generate_people(50))
    module.write_address_book(db, people)

    assert db.exists() is True
    assert db.stat().st_size > 0

    people_again = module.read_address_book(db)

    assert people == people_again

Version 2 (variable length directions)

Adding the additional directions field to the model is simple enough:

from dataclasses import dataclass, field
from typing import Iterable
from uuid import uuid4

from faker import Faker

fake = Faker("en_GB")


@dataclass
class Person:
    first_name: str
    last_name: str
    phone_number: str
    house_number: str
    street: str
    town: str
    postcode: str
    directions: str # new
    id: str = field(default_factory=lambda: str(uuid4()))


def generate_people(n: int) -> Iterable[Person]:
    for _ in range(n):
        yield Person(
            fake.first_name(),
            fake.last_name(),
            fake.phone_number(),
            fake.building_number(),
            fake.street_name(),
            fake.city(),
            fake.postcode(),
            # new
            fake.text(),  # random latin is about as useful as most directions
        )

Binary Representation

Since the struct module deals with C structs, strings are represented as C char arrays of a fixed length specified in the format string i.e. struct.pack("11s", "hello world"). To achieve this in generality is quite an involved process and if you need to this for a real application, using a third party library such as NetStruct would be recommended. For the purpose of this challenge, however, I won’t be using it and nor will I be implementing a general solution - the code for packing/unpacking records is very tightly coupled to the structure of the records and I would not recommend following this approach in a real application. However, it does demonstrate the difficulties that can arise when using binary formats.

Since the size of the directions field is variable, the complete format string for packing/unpacking of records using struct must be dynamically created:

>>> me = Person(
    "Ben",
    "Steadman",
    "+44(0)116 4960124",
    "1",
    "A Road",
    "My Town",
    "CB234",
    "Take a left at the roundabout",
)
>>> fmt = "50s50s30s10s50s50s10s{}s36s".format(len(me.directions))
>>> struct.pack(fmt, *(s.encode("utf-8") for s in astuple(me)))
b'Ben\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00Steadman\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00+44(0)116 4960124\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00\x00A Road\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00My Town\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00CB234\x00\x00\x00\x00\x00Take a left at the roundaboutbfe3c3e5-8b65-4e49-8d26-3981257a0dee'

Furthermore, since each packed record will be of a different size the database file cannot cannot simply be read in equal sized chunks and passed to from_bytes as in the first implementation. To solve this, each record is preceded by it’s size in bytes. This value can be used to determine the next chunk size to read from the file and pass to from_bytes. :

# address_book/binary.py

PERSON_STRUCT_FMT = "50s50s30s10s50s50s10s{}s36s"


def to_bytes(p: Person) -> Tuple[bytes, int]:
    # dynamically add size to format for variable length directions field
    fmt = PERSON_STRUCT_FMT.format(len(p.directions))
    return (
        struct.pack(fmt, *(s.encode("utf-8") for s in astuple(p))),
        struct.calcsize(fmt),
    )


RecordSizeStruct = struct.Struct("I")


def write_address_book(db: Path, people: Iterable[Person]):
    with db.open("wb") as f:
        records_with_sizes = (
            RecordSizeStruct.pack(size) + p_bytes
            for p_bytes, size in (to_bytes(p) for p in people)
        )
        f.write(b"".join(records_with_sizes))

to_bytes still receives a buffer of bytes representing an entire packed record, however to handle the variable length directions field it needs to calculate the position within buffer at which the directions field must begin, split it accordingly and unpack each section individually:

# address_book/binary.py

def from_bytes(buffer: bytes) -> Person:
    # calculate sizes of non-variable formats
    before_fmt, after_fmt = PERSON_STRUCT_FMT.split("{}s")
    before_start = struct.calcsize(before_fmt)
    after_start = len(buffer) - struct.calcsize(after_fmt)

    before, direction, after = (
        buffer[:before_start],
        buffer[before_start:after_start],
        buffer[after_start:],
    )

    # dynamically build struct format string for variable length field
    direction_fmt = "{}s".format(len(direction))
    data = (
        struct.unpack(before_fmt, before)
        + struct.unpack(direction_fmt, direction)
        + struct.unpack(after_fmt, after)
    )
    return Person(*(x.decode("utf-8").rstrip("\x00") for x in data))


def read_address_book(db: Path)->List[Person]:
    people = []
    with db.open("rb") as f:
        while True:
            # each record preceded by its size in bytes, use to determine number
            # of bytes to read from db for the entire record
            size_buf = f.read(RecordSizeStruct.size)
            if not size_buf:
                break  # reached end of db
            record_size = RecordSizeStruct.unpack(size_buf)[0]
            people.append(from_bytes(f.read(record_size)))
    return people

A slight adjustment to the tests is needed to account for to_bytes now returning a tuple:

@pytest.mark.parametrize("p", generate_people(50))
def test_to_bytes_inverts_from_bytes(p):
    p_bytes, size = binary.to_bytes(p)
    p_again = binary.from_bytes(p_bytes)
    assert p == p_again

Plain Text Representation

Other than the changes to the Person class, no further changes are required to support the new variable length field.

Summary

Though I already agreed with the authors preference for plain text formats, this challenge certainly demonstrated that for most cases plain text is the appropriate format to use.

The binary representation is more difficult to extend and (at least in this example) required breaking changes to do so. This made any data written using the first version (prior to the introduction of the variable length directions field) incompatible with any data written using the second version. A versioning scheme would need to be devised and represented within the binary format, for example using a pre-defined ‘header’ block of bytes to contain some metadata.

The plain text representation was simple to implement using standard, built in tools and was simple to extend. If the directions field is deemed optional any data written in the first version is fully compatible with that of the second version. Converting the data would be a simple text transformation and could in fact be achieved directly in the shell using a tool such as jq. Here’s an example to add the directions field, setting it to a default of null:

$ cat data/address-book.json | jq 'map(. + {"directions": null})'
[
  {
    "first_name": "Fiona",
    "last_name": "Power",
    "phone_number": "01314960440",
    "house_number": "91",
    "street": "Sam fields",
    "town": "North Shanebury",
    "postcode": "M38 1FH",
    "directions": null,
    "id": "264bfab6-f1a5-4adc-a86b-28ae8e41817b"
  },
  {
    "first_name": "Lorraine",
    "last_name": "Richards",
    "phone_number": "+448081570114",
    "house_number": "9",
    "street": "Ashleigh loaf",
    "town": "North William",
    "postcode": "M4H 5PW",
    "directions": null,
    "id": "b0b98056-c8ff-4b4e-a68b-b31e8ae43ac3"
  },
  ...
]
comments powered by Disqus