Thomas M. Kehrenberg

Beyond YAML: Towards a config file format without syntax typing and without significant whitespace

The question of configuration file formats can be a very emotional one. And rightly so! A good configuration experience is important. Every time I have to edit a JSON config file, I weep. When the shortcomings of JSON are mentioned, someone is bound to suggest YAML. So, what is wrong with YAML?

The most common criticism I hear about YAML is that it has significant whitespace and that the syntax determines the type (aka syntax typing). I agree with these criticisms, so let’s try to do better. We’ll start with the issue of significant whitespace.

Hjson

The one format that, in my opinion, comes closest to the Platonic ideal of good configuration syntax is Hjson (“Human JSON”). It looks like this (I’m using the syntax variant without root braces):

# comments are useful
"rate": 1000

# key names do not need to be placed in quotes
key: "value"

# you don't need quotes for strings
text: look ma, no quotes!

# note that for quoteless strings everything up
# to the next line is part of the string!

# nesting
dict:
{
one: 1
two: 2
}

# multiline string
# the string is "dedented" according to the indentation of the opening '''
haiku:
'''
JSON I love you.
But you strangle my expression.
This is so much better.
'''

# Obviously you can always use standard JSON syntax as well:
favNumbers: [ 1, 2, 3, 6, 42 ]

Note that Hjson isn’t entirely “whitespace-insensitive”; line breaks are very significant, but the following variations are equivalent in Hjson, and I think this is mostly what people care about:

dict1:
{
one: a
two: b
}

dict2: {
one: a # identation doesn't matter!
two: b
}

dict3: {one: a
two: b
}

dict4: {one: "a", two: "b"
}

Note that we had to wrap the values in quotes in the last variant, because otherwise everything after one: would have parsed as a single string.

I see two shortcomings in Hjson; the first one is minor, the second one major. The first one is that there are three (or two, depending how you count) ways to start a comment: #//, /*. Just stick to one!

The second shortcoming is rooted in the fact that syntax determines type. It concerns the optional commas in Hjson. Let’s see how the following is parsed:

commas:
{
a: unquoted,
b: "quoted",
c: 2,
d: 3
e: 4,
}

The parsed JSON:

{
"commas": {
"a": "unquoted,",
"b": "quoted",
"c": 2,
"d": 3,
"e": 4
}
}

Hjson decides what to do with trailing commas based on the type of the value. I can sort of see why it was decided to do it like this, but I still think it’s bad. I would say the “a” and “b” entries are parsed correctly: if you want to use commas, you have to use quotes. This means “c” and “e” are parsed incorrectly; they should have been parsed as strings with the values “2,” and “4,”. But, as I point out in the next section, we should just parse everything as strings anyway, in which case this inconsistency goes away on its own.

NestedText

NestedText is a project which solves the “syntax determines types” issue. If you want to know why this is a problem, read The Norway Problem. NestedText solves this problem by simply parsing everything as a string, and then recommends data validation libraries to convert the strings to the desired values. The syntax of NestedText looks just like YAML:

debug: false
webmaster_email: admin@example.com

allowed_hosts:
- www.example.com

database:
host: db.example.com
port: 3306

But it’s parsed like this:

{"debug": "false",
"webmaster_email": "admin@example.com",
"allowed_hosts": ["www.example.com"],
"database": {"host": "db.example.com",
"port": "3306"}}

You can then use a data validation library like pydantic to convert:

#!/usr/bin/env python3

import nestedtext as nt
from pydantic import BaseModel, EmailStr
from typing import List
from pprint import pprint

class Database(BaseModel):
host: str
port: int

class Config(BaseModel):
debug: bool
webmaster_email: EmailStr # check for valid email address
allowed_hosts: List[str]
database: Database

obj = nt.load('deploy.nt')
config = Config.parse_obj(obj)

pprint(config.dict())

Output:

{"debug": False,
"webmaster_email": "admin@example.com",
"allowed_hosts": ["www.example.com"],
"database": {"host": "db.example.com",
"port": 3306}}

I think it makes a lot of sense to do configuration parsing in two steps: first parse the file and then convert/validate the values. Oftentimes when I load a YAML file, I have to do some validation anyway; like checking whether the given port number is a positive integer within the allowed range. So, I might as well do the conversion there as well. And all major programming languages already have these data validation libraries.

(I should also mention StrictYAML at this point. It’s essentially a bundling of a type-unaware parser and a data validation library, which allows you to define schemas for validating the loaded data. It’s a decent solution to the problem of syntax typing, though I think pydantic has a much nicer syntax for defining schemas, and so if I had to choose, I would go with NestedText+pydantic over StrictYAML.)

We conclude that NestedText is very nice, but it would be even better with a whitespace-insensitive syntax!

Why not XY?

I feel like it has already been discussed to death why all existing configuration formats suck, but let’s review them quickly anyway:

XML

My main complaint is that it is too verbose. Though I don’t have anything against XML for very complicated configuration needs, like the configuration files for Wayland extensions. It’s hard to see how this could have been done in, e.g., YAML.

JSON

Too human-unfriendly (e.g., key names have to be quoted). And it has syntax typing. Though it’s of course fine for exchanging data between programs.

TOML

I mostly don’t like the syntax for nested configurations. The original author of PyTOML agrees:

TOML is a bad file format. It looks good at first glance, and for really really trivial things it is probably good. But once I started using it and the configuration schema became more complex, I found the syntax ugly and hard to read.

It’s fine for very simple configurations though.

INI

INI is, as far as I can tell, just a worse, un-specced TOML.

Conclusion

I propose a new configuration file format, based on the syntax of Hjson, but one that parses everything as strings.

This:

# specify rate in requests/second
"rate": 1000

# key names do not need to be placed in quotes
key: 104.45

# you don't need quotes for strings
text: look ma, no quotes!

# one way to define a mapping
commas: {
one: 1
two: 2
}

# trailing commas are allowed only when values are quoted
trailing:
{
one: 1,
two: "2",
}

# multiline string
haiku:
'''
JSON I love you.
But you strangle my expression.
This is so much better.
'''

# string continuation (will not have newlines)
long_str: this string is too long to fit on \
a single line

# to put multiple values on a single line, use quotes
favoriteNums: [
1 2
3
4, 5
"6", "42"
]

is parsed as

{
"rate": "1000",
"key": "104.45",
"text": "look ma, no quotes!",
"commas": {
"one": "1",
"two": "2"
},
"trailing": {
"one": "1,",
"two": "2"
},
"haiku": "JSON I love you.\nBut you strangle my expression.\nThis is so much better.",
"long_str": "this string is too long to fit on a single line",
"favoriteNums": [
"1 2",
"3",
"4, 5",
"6",
"42"
]
}

Unfortunately, I don’t have an implementation of this. The goal of this was mostly to see whether I could come up with a config format that I liked. It was a success in that regard. Anyone who wants to implement this has my full permission. You don’t even need to mention me; I would just be happy if this existed.