Speed up UTF8 validation by about 25%
Use a pseudo-simd (SWAR-like) trick to validate UTF8 faster. There are essentially two improvements here:
Use a table similar to
hexvalsto quicky get the length and knock out some of the invalid casescheck continuation bytes in parallel by packing them into an int and then applying a mask on all of them
All in all, we get a nice perf boost on UTF8-heavy benchmarks and it’s not just stats padding since some users have data that is UTF8-heavy so it seems worthy at the expense of making the code a tiny bit more complicated.
The idea here is not new, and originally was presented by Keiser and Lemire in “Validating UTF-8 In Less Than One Instruction Per Byte” (https://arxiv.org/pdf/2010.03090) and a practical C implementation of loading all bytes into an unsigned is form yyjson.
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
Jiffy - JSON NIFs for Erlang
A JSON parser as a NIF. This is a complete rewrite of the work I did in EEP0018 that was based on Yajl. This new version is a hand crafted state machine that does its best to be as quick and efficient as possible while not placing any constraints on the parsed JSON.
Usage
Jiffy is a simple API. The only thing that might catch you off guard is that the return type of
jiffy:encode/1is an iolist even though it returns a binary most of the time.A quick note on unicode. Jiffy only understands UTF-8 in binaries. End of story.
Errors are raised as error exceptions.
jiffy:decode/1,2jiffy:decode(IoData)jiffy:decode(IoData, Options)The options for decode are:
return_maps- Tell Jiffy to return objects using the maps data type on VMs that support it. This raises an error on VMs that don’t support maps.{null_term, Term}- Returns the specifiedTerminstead ofnullwhen decoding JSON. This is for people that wish to useundefinedinstead ofnull.use_nil- Returns the atomnilinstead ofnullwhen decoding JSON. This is a short hand for{null_term, nil}.return_trailer- If any non-whitespace is found after the first JSON term is decoded the return value of decode/2 becomes{has_trailer, FirstTerm, RestData::iodata()}. This is useful to decode multiple terms in a single binary.dedupe_keys- If a key is repeated in a JSON object this flag will ensure that the parsed object only contains a single entry containing the last value seen. This mirrors the parsing beahvior of virtually every other JSON parser.copy_strings- Normally, when strings are decoded, they are created as sub-binaries of the input data. With some workloads, this leads to an undesirable bloating of memory: Strings in the decode result keep a reference to the full JSON document alive. Setting this option will instead allocate new binaries for each string, so the original JSON document can be garbage collected even though the decode result is still in use.{bytes_per_red, N}where N >= 0 - This controls the number of bytes that Jiffy will process as an equivalent to a reduction. Each 20 reductions we consume 1% of our allocated time slice for the current process. When the Erlang VM indicates we need to return from the NIF.{bytes_per_iter, N}where N >= 0 - Backwards compatible option that is converted into thebytes_per_redvalue.jiffy:encode/1,2jiffy:encode(EJSON)jiffy:encode(EJSON, Options)where EJSON is a valid representation of JSON in Erlang according to the table below.
The options for encode are:
uescape- Escapes UTF-8 sequences to produce a 7-bit clean outputpretty- Produce JSON using two-space indentationforce_utf8- Force strings to encode as UTF-8 by fixing broken surrogate pairs and/or using the replacement character to remove broken UTF-8 sequences in data.use_nil- Encodes the atomnilasnull.escape_forward_slashes- Escapes the/character which can be useful when encoding URLs in some cases.{bytes_per_red, N}- Refer to the decode options{bytes_per_iter, N}- Refer to the decode optionsData Format
N.B. The last entry in this table is only valid for VM’s that support the
mapsdata type (i.e., 17.0 and newer) and client code must pass thereturn_mapsoption tojiffy:decode/2.Improvements over EEP0018
Jiffy should be in all ways an improvement over EEP0018. It no longer imposes limits on the nesting depth. It is capable of encoding and decoding large numbers and it does quite a bit more validation of UTF-8 in strings.