What can be so difficult in creating a sensible standard for Structured Data Documents? To understand why they tend to get improved into unusable complexity, I’ll need to explain a bit of background.
Structured Data Documents come in three different flavors. There are the text files that represent object data, text files that represent tabular data (rows and columns) and text data for the values of the settings, initialization or configuration of applications.
So Many Formats
The formats such as XML, JSON, CSV, TOML, INI, YAML and CONF are in good everyday use. That doesn’t mean that they are always fit for purpose, though. CSV, for example, has a sensible proposed standard (RFC 4180 October 2005) that few, after two decades, conform with. JSON still doesn’t allow documentation of data. YAML lacks a built-in way of detecting the corruption or truncation of the document. TOML confusingly allows completely different ways of representing tables and arrays.
For a system to be classed as a useful structured data document, it must have certain common features.
- You need to know whether you’ve got all the data that was intended. A file or a transmission can get truncated, and you need to be certain that it is all there.
- The system you devise must be simple. Data must be visible and intelligible, and it must be easy to edit and write. You may be a propellor-head developer, but the world around you is composed of ordinary mortals who are thinking less about recursive descent parsers and more about what’s for lunch. These people will need to be able to use your clever data document. I must emphasize this point, it isn’t just ‘nice to have’, it is essential.
- The data document must not be verbose. If it is used for database data, it can get big, very big. Sure, it will work nicely when you’re testing it out on just little chunks of data, but you need to be able to accommodate huge jellyblobs of data.
- It must be possible to check that a structured data format is syntactically correct by parsing it.
- There should never be two or three syntactically different ways of representing the same structured data.
There are virtues that are essential for particular categories of data documents, some of them are contradictory. A virtue for one category becomes a vice for other categories
- Data documents that are used for transmission downstream must have a definition of what the data means, a schema. Otherwise, you can never change them without hassle. For this, you obviously need to transmit information on the type of data, measurement or quantity type, and probably also the distribution of the data. If it is people’s height, you need more than just whether it is metric or feet ‘n inches. You need to know that a height of 8ft is unlikely. It is handy to have other check constraints too that are defined for the data.
- You need to be able to write to existing initialization files, not just rewrite them completely for a simple update. An INI or .conf file will be liberally commented on if it is to be useful and helpful. Because these are usually not read, they get lost if you over-write an INI file. It can be a source of error if you need to rewrite them and makes it difficult for the person who has to track changes in the configuration of the infrastructure, and who lodges all configuration files in source control.
- Some types of data documents have to be ‘strongly-typed, so that it is clear how complex data such as dates are being represented. This has to be interpreted accurately at the receiving end while the data is being converted to its object-form, before the data is processed. This isn’t true of initialization or configuration files because the application will know what format is required- in fact, the application must be tolerant of the human who is trying to configure it, and do its best to make sense of it.
There is a temptation to devise a format that suits all uses and applications. The result is worthy but unusable for everyday use. I have a book by my bedside called XML in a NutShell. Not even a coconut could accommodate that thick tome. INI and .CONF files are in such common use, and are so obviously useful, that it requires a certain insensitivity to assume they need improving.
Some great things have been achieved by personal initiative, such as JSON Schema and the RFC 4180 for CSV, but generally the idea of making a change to a well-accepted standard is bad news.
Conclusion
It is easy to get carried away when writing an object notation. Markdown is really just an object notation where the author didn’t get carried away at all and had a clear idea of what his user-base wanted. YAML was wonderful for representing hierarchies in text but became prematurely flaccid with unnecessary complexity. JSON was exciting when it was proposed, because it was simple and easy to understand. If it allowed comments, I love it and use it with JSON-schema.
For most of the work I do, however, I already have the perfect object notation for structured data that can be used on Windows or Linux. It can be read easily. It is PowerShell’s object notation. It has a great deal of power and versatility. I wrote about it ten years ago on Simple-Talk here in Getting Data Into and Out of PowerShell Objects.
Load comments