lemire / simdjson
- пятница, 22 февраля 2019 г. в 00:17:17
C++
Parsing gigabytes of JSON per second
JSON documents are everywhere on the Internet. Servers spend a lot of time parsing these documents. We want to accelerate the parsing of JSON per se using commonly available SIMD instructions as much as possible while doing full validation (including character encoding).
We can use a quarter or fewer instructions than a state-of-the-art parser like RapidJSON, and half as many as sajson. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second on commodity processors.
On a Skylake processor, the parsing speeds (in GB/s) of various processors on the twitter.json file are as follows.
parser | GB/s |
---|---|
simdjson | 2.2 |
RapidJSON encoding-validation | 0.51 |
RapidJSON encoding-validation, insitu | 0.71 |
sajson (insitu, dynamic) | 0.70 |
sajson (insitu, static) | 0.97 |
dropbox | 0.14 |
fastjson | 0.26 |
gason | 0.85 |
ultrajson | 0.42 |
jsmn | 0.28 |
cJSON | 0.34 |
This code is made available under the Apache License 2.0.
Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.
#include "simdjson/jsonparser.h"
/...
const char * filename = ... //
// use whatever means you want to get a string of you JSON document
std::string_view p = get_corpus(filename);
ParsedJson pj;
pj.allocateCapacity(p.size()); // allocate memory for parsing up to p.size() bytes
bool is_ok = json_parse(p, pj); // do the parsing, return false on error
// parsing is done!
// You can safely delete the string content
free((void*)p.data());
// the ParsedJson document can be used here
// js can be reused with other json_parse calls.
It is also possible to use a simpler API if you do not mind having the overhead of memory allocation with each new JSON document:
#include "simdjson/jsonparser.h"
/...
const char * filename = ... //
std::string_view p = get_corpus(filename);
ParsedJson pj = build_parsed_json(p); // do the parsing
// you no longer need p at this point, can do aligned_free((void*)p.data())
if( ! pj.isValid() ) {
// something went wrong
}
See the "singleheader" repository for a single header version. See the included file "amalgamation_demo.cpp" for usage. This requires no specific build system: just copy the files in your project in your include path. You can then include them quite simply:
#include <iostream>
#include "simdjson.h"
#include "simdjson.cpp"
int main(int argc, char *argv[]) {
const char * filename = argv[1];
std::string_view p = get_corpus(filename);
ParsedJson pj = build_parsed_json(p); // do the parsing
if( ! pj.isValid() ) {
std::cout << "not valid" << std::endl;
} else {
std::cout << "valid" << std::endl;
}
return EXIT_SUCCESS;
}
Note: In some settings, it might be desirable to precompile simdjson.cpp
instead of including it.
Requirements: recent clang or gcc, and make. We recommend at least GNU GCC/G++ 7 or LLVM clang 6. A system like Linux or macOS is expected.
To test:
make
make test
To run benchmarks:
make parse
./parse jsonexamples/twitter.json
Under Linux, the parse
command gives a detailed analysis of the performance counters.
To run comparative benchmarks (with other parsers):
make benchmark
Requirements: We require a recent version of cmake. On macOS, the easiest way to install cmake might be to use brew and then type
brew install cmake
There is an equivalent brew on Linux which works the same way as well.
You need a recent compiler like clang or gcc. We recommend at least GNU GCC/G++ 7 or LLVM clang 6. For example, you can install a recent compiler with brew:
brew install gcc@8
Optional: You need to tell cmake which compiler you wish to use by setting the CC and CXX variables. Under bash, you can do so with commands such as export CC=gcc-7
and export CXX=g++-7
.
Building: While in the project repository, do the following:
mkdir build
cd build
cmake ..
make
make test
CMake will build a library. By default, it builds a shared library (e.g., libsimdjson.so on Linux).
You can build a static library:
mkdir buildstatic
cd buildstatic
cmake -DSIMDJSON_BUILD_STATIC=ON ..
make
make test
In some cases, you may want to specify your compiler, especially if the default compiler on your system is too old. You may proceed as follows:
brew install gcc@8
mkdir build
cd build
export CXX=g++-8 CC=gcc-8
cmake ..
make
make test
We are assuming that you have a common Windows PC with at least Visual Studio 2017, and an x64 processor with AVX2 support (2013 Haswell or later).
cmake
be made available from the command line. Please choose a recent version of cmake.VisualStudio
.cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..
in the shell while in the VisualStudio
repository. (Alternatively, if you want to build a DLL, you may use the command line cmake -DCMAKE_GENERATOR_PLATFORM=x64 -DSIMDJSON_BUILD_STATIC=OFF ..
.)simdjson.sln
). Open this file in Visual Studio. You should now be able to build the project and run the tests. For example, in the Solution Explorer
window (available from the View
menu), right-click ALL_BUILD
and select Build
. To test the code, still in the Solution Explorer
window, select RUN_TESTS
and select Build
.json2json mydoc.json
parses the document, constructs a model and then dumps back the result to standard output.json2json -d mydoc.json
parses the document, constructs a model and then dumps model (as a tape) to standard output. The tape format is described in the accompanying file tape.md
.minify mydoc.json
minifies the JSON document, outputting the result to standard output. Minifying means to remove the unneeded white space charaters.We provide a fast parser. It fully validates the input according to the various specifications. The parser builds a useful immutable (read-only) DOM (document-object model) which can be later accessed.
To simplify the engineering, we make some assumptions.
We do not aim to provide a general-purpose JSON library. A library like RapidJSON offers much more than just parsing, it helps you generate JSON and offers various other convenient functions. We merely parse the document.
long
or a C/C++ long long
. Among the parsers that differentiate between integers and floating-point numbers, not all support 64-bit integers. (For example, sajson rejects JSON files with integers larger than or equal to 2147483648. RapidJSON will parse a file containing an overly long integer like 18446744073709551616 as a floating-point number.) When we cannot represent exactly an integer as a signed 64-bit value, we reject the JSON document.[0e+]
as valid JSON.)The parser works in three stages:
Here is a code sample to dump back the parsed JSON to a string:
ParsedJson::iterator pjh(pj);
if (!pjh.isOk()) {
std::cerr << " Could not iterate parsed result. " << std::endl;
return EXIT_FAILURE;
}
compute_dump(pj);
//
// where compute_dump is :
void compute_dump(ParsedJson::iterator &pjh) {
if (pjh.is_object()) {
std::cout << "{";
if (pjh.down()) {
pjh.print(std::cout); // must be a string
std::cout << ":";
pjh.next();
compute_dump(pjh); // let us recurse
while (pjh.next()) {
std::cout << ",";
pjh.print(std::cout);
std::cout << ":";
pjh.next();
compute_dump(pjh); // let us recurse
}
pjh.up();
}
std::cout << "}";
} else if (pjh.is_array()) {
std::cout << "[";
if (pjh.down()) {
compute_dump(pjh); // let us recurse
while (pjh.next()) {
std::cout << ",";
compute_dump(pjh); // let us recurse
}
pjh.up();
}
std::cout << "]";
} else {
pjh.print(std::cout); // just print the lone value
}
}
The following function will find all user.id integers:
void simdjson_traverse(std::vector<int64_t> &answer, ParsedJson::iterator &i) {
switch (i.get_type()) {
case '{':
if (i.down()) {
do {
bool founduser = equals(i.get_string(), "user");
i.next(); // move to value
if (i.is_object()) {
if (founduser && i.move_to_key("id")) {
if (i.is_integer()) {
answer.push_back(i.get_integer());
}
i.up();
}
simdjson_traverse(answer, i);
} else if (i.is_array()) {
simdjson_traverse(answer, i);
}
} while (i.next());
i.up();
}
break;
case '[':
if (i.down()) {
do {
if (i.is_object_or_array()) {
simdjson_traverse(answer, i);
}
} while (i.next());
i.up();
}
break;
case 'l':
case 'd':
case 'n':
case 't':
case 'f':
default:
break;
}
}
If you want to see how a wide range of parsers validate a given JSON file:
make allparserscheckfile
./allparserscheckfile myfile.json
For performance comparisons:
make parsingcompetition
./parsingcompetition myfile.json
For broader comparisons:
make allparsingcompetition
./allparsingcompetition myfile.json
Inspiring links:
Validating UTF-8 takes no more than 0.7 cycles per byte:
A JSON parser transforms a JSON text into another representation. A JSON parser MUST accept all texts that conform to the JSON grammar. A JSON parser MAY accept non-JSON forms or extensions. An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.
All JSON is Javascript but NOT all Javascript is JSON. So {property:1} is invalid because property does not have double quotes around it. {'property':1} is also invalid, because it's single quoted while the only thing that can placate the JSON specification is double quoting. JSON is even fussy enough that {"property":.1} is invalid too, because you should have of course written {"property":0.1}. Also, don't even think about having comments or semicolons, you guessed it: they're invalid. (credit:https://github.com/elzr/vim-json)
The structural characters are:
begin-array = [ left square bracket
begin-object = { left curly bracket
end-array = ] right square bracket
end-object = } right curly bracket
name-separator = : colon
value-separator = , comma
A character is pseudo-structural if and only if:
This helps as we redefine some new characters as pseudo-structural such as the characters 1, 1, G, n in the following:
{ "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }