Format Comparisons

Contents:

Format Comparisons

Comparison to CBOR

CBOR is probably the closest in spirit to what e-NON wants to do. Some similarities:

CBOR uses the declarative notation concept to prefix values with type and size.
CBOR also has a “semantic tag” facility for metadata.
CBOR supports chunked data via the “indefinite concatentation” and “break code”

There are a few other similarities, but some of the design choices are limiting.

First, the 3/5 split of the prefix byte is extremly wasteful due to redundancy. For example, 64 of the possible 256 prefixes are used just to introduce integers. Another 64 are reserved for strings. This means that half of the available prefix codes are consumed by 2 data types.

The problem seems to be the 3-bit pre-prefix, which divides the 256 prefixes into 8 “Major Type” zones of 32 values each. This forces redundancy. For example:

5 out of 8 zones reserve 0..23 as direct size-or-length values
all 8 zones reserve values 24-27 to define the length of the size indicator (1, 2, 4, 8 bytes, respectively)

That accounts for (5*24)+(8*4) = 152 prefixes that provide only 28 distinct values.

Second, the 8 prefix divsions place arbitrary restrictions on prefix semantics. There are unused prefix values, but new prefixes have to fit into one of the 8 divisions. The “semantic tag” offers virtually unlimited extensions, but those are “second-class” prefixes, requiring extra bytes to specify.

e-NON takes a different approach.

e-NON interprets the 0..255 prefix range as follows:

range	ASCII block	e-NON usage
0x00..0x19, 0x7F	control	33 control codes to introduce metadata, processing instructions, etc.
0x20..0x7E	printable	95 prefixes for data types or other uses
0x80..0xFF	extended	numeric data values defining the range -63..64

Redundancy is avoided.
Prefixes are defined only once and reused as needed.
Sizes and lengths always follow the prefix.
Sizes are never embedded in the prefix. As compared to CBOR, this requires an extra byte for types with sizes. However, this practice frees up prefix values for other uses. It also allows sizes up to 250 to be specified in a single byte. As a result, CBOR loses its 1-byte advantage for sizes 24..250.
Types have dedicated prefixes
The availability of 95 data prefix codes allows specialized types to have dedicated codes. Examples include big integer, big decimal, time. With CBOR, special data interpretations require extra bytes for “semantic tags”.
There are 128 direct numeric values
This is a slight improvement on the (very excellent) CBOR direct value concept. CBOR has 47 direct values [-23..23] (with 0 represented twice).

Comparison to BSON, BON, JSON-B, etc.

These formats are extensions to JSON. As a result, they still have text-based roots.

Comparison to JSON

Size

Container Overhead

container	entries (n)	e-NON (bytes)	JSON (bytes)	O() e-NON:JSON	notes
map	<=250	2n+3	4n+1	O(n):O(n)	Keys are strings, and includes string overhead for the keys. Assume key length < 250.
map	>65,535	2n+4	4n+1	O(n):O(n)	e-NON needs an extra byte to store the map size
map	>=2³¹	2n+11	4n+1	O(n):O(n)	e-NON needs extra bytes to store the map size
list	<=250	2	n+1	O():O(n)
list	>65,535	3	n+1	O():O(n)	e-NON needs an extra byte to store the list size
list	>=2³¹	10	n+1	O():O(n)	e-NON needs extra bytes to store thelist size

The analysis above considers an e-NON stream with minimal use of features, only that which is necessary to replicate the equivalent JSON.

map
While both maps are O(n), the JSON map requires double the overhead of e-NON.
- e-NON has a 1-byte prefix and a 1-9 byte entry count. The e-NON map has one more byte for a map-id, even if mapping is not enabled. For keys shorter than 251 bytes, e-NON has a 2 byte per key.
- JSON has the enclosing braces (2), 2 quote chars and a colon per key, and a comma for each entry except the last one.
list
- e-NON needs only a 2-10 byte prefix. There is no per-entry overhead.
- JSON has the enclosing braces (2) and a comma for every entry except the last one.

Container Optimizations

e-NON has some optimizations that can be applied. These optimizations don’t impact the overhead analysis above, but they can decrease the total bytes under certain conditions.

Map References
When enabled, an instance of any map is serialized only once. After that it is references by a 2-10 byte key. This can save a lot of bytes in a data set where some objects make multiple appearances. However, there will be some extra overhead when the object count exceeds 2^{16^{. This overhead may not be offset if there is little object reuse.}}
Glossary
The glossary will store references to map keys (and other things). In data sets where the same map keys appear multiple times, the glossary can