Log-based transactional graph engine
LemonGraph is a log-based transactional graph (nodes/edges/properties) database engine that is backed by a single file. The primary use case is to support streaming seed set expansion. The core of the graph library is written in C, and the Python (2.x) layer adds friendly bindings, a query language, and a REST service. LemonGraph rides on top of (and inherits a lot of awesome from) Symas LMDB - a transactional key/value store that the OpenLDAP project developed to replace BerkeleyDB.
Using PyPy 4.0 on physical hardware (3ghz X5670, plenty of RAM), under a single transaction:
|Seconds||Storage (mb)||Speed (per second)||Operation(s)|
|4.5||96||222k nodes||insert 1 million nodes|
|11||204||153k properties||above, plus setting a single property on each node|
|410||1856||25k edges||above, plus adding 10 million random edges|
Symas LMDB provides transactions, multi-process MVCC abilities (single writer, multiple non-blocking readers), nested write transactions, and rapid non-blocking binary snapshots. We also inherit some caveats:
The graph library on top supports:
The Python wrapper provides a friendly interface:
Note that the REST service cannot run on CentOS 6's Python 2.6, as we rely on the new
yum install -y gcc gcc-c++ make libffi-devel zlib-devel python-devel python-setuptools
apt-get install libffi-dev zlib1g-dev python-dev python-cffi
curl https://bootstrap.pypa.io/ez_setup.py | python -
curl https://bootstrap.pypa.io/ez_setup.py | pypy -
python setup.py install(or you know, pypy)
Or to run without proper installation, you must manually install dependencies:
easy_install lazy pysigset msgpack-python python-dateutil ujson
easy_install lazy pysigset msgpack-python python-dateutil
import LemonGraph import os import sys import tempfile fd, path = tempfile.mkstemp() os.close(fd) # open graph object g = LemonGraph.Graph(path) # enter a write transaction with g.transaction(write=True) as txn: # create a couple of nodes node1 = txn.node(type='foo', value='bar') node2 = txn.node(type='foo', value='baz') # create an edge between them edge1 = txn.edge(src=node1, tgt=node2, type='foo', value='foobar') # set some properties node1['prop1'] = 'propval1' node2['prop2'] = 'propval2' node2['prop3'] = 'propval3' edge1['prop4'] = 'propval4' # set a graph property txn['thing1'] = 'thing2' # print out nodes and edges for n in txn.nodes(): print n for e in txn.edges(): print e b4_delete = txn.lastID # delete a property del node1['prop1'] # delete a node - cascades to edges and properties node2.delete() print with g.transaction(write=False) as txn: # run an ad-hoc query before delete print "ad-hoc query: nodes before deletions" for chain in txn.query('n()', stop=b4_delete): print chain print # run an ad-hoc query print "ad-hoc query: nodes after deletions" for chain in txn.query('n()'): print chain print # run streaming queries print "streaming query: nodes/edges" for q, chain in txn.mquery(['n(prop3)','e()','n()->n()'], start=1): print q, "=>", chain print # dump the internal graph log to stdout print "dump:" txn.dump() # delete graph artifacts from disk g.delete()
Query a (potentially active) graph interactively:
python -mLemonGraph.dbtool path/to/foo.db
Snapshot a (potentially active) database:
python -mLemonGraph.snapshot path/to/foo.db > copy.db
Run the REST service (use
-h to list options):
python -mLemonGraph.server <options>
The rest service maintains an indexed cache of basic graph information. Every N milliseconds (
-d option), it will check for and flush updated graphs to disk.
For greater throughput, use
-s at the expense of possible data loss in the event of OS/hardware/power-related failure.
By default, it will run:
-w) worker processes to handle http requests
This query language is designed to query for arbitrarily long node-edge-node chains in the graph, and supports two querying styles:
Note that in streaming mode, nodes and edges are captured as soon as they match the required filters. Any additional properties that are added later will not be present in the results.
A query pattern must specify at least one node or edge, but many may be chained together:
Results are returned as chains (lists) of the requested nodes and/or edges.
Arrow heads may be added to perform directed queries, and nodes or edges may be inferred. Inferred components are not included in the result set. To manually omit non-inferred components, prepend an 'at' sign to the node or edge token. These two queries are identical:
By default, nodes and edges must be unique in a returned chain. To suppress that for a given slot in the query pattern, upper-case the node/edge token:
To filter (and potentially accelerate) queries, a list of property filters may be provided inside the parenthesis for nodes and edges in the query:
Property filters must specify at least a key, and optionally an operator with value[s].
Simple keys (consists entirely of 'word' characters, may not start w/ digit) may be specified as barewords, or otherwise be quoted. Nested keys for objects are supported - separate key components by unquoted dots.
>=, and translate directly into Python
=), values can be:
~) (python flavored, in perl-style notation):
i- case insensitive
s- dot matches newlines
number- any numeric type
n()- select all nodes
e()- select all edges
n()-e()-n()- select all node/edge/node chains
n()-n()- select node-to-node chains w/out the edge
n()->n()<-n()- directed queries
One or more additional filters may optionally be appended to a query pattern. The contents of each filter are effectively merged with the target node[s]/edge[s].
Indexing is one-based:
n(type="foo")-n(), 1(value!="bar"), 2(blah)
Bareword aliases are supported as well - an alias can refer to one or more nodes or edges, and additional filters can be attached via the alias:
Internally, aliases are case-folded, but if either the alias or the additional filter in the query contains upper case characters, it is an error for the other not to be present.
n:blah()(alias is ignored)
n(), blah()(filter is ignored)
n:blah(), blah()(filter is applied to aliased node)
n:Blah(), BLAH()(filter is applied to aliased node)
n(), Blah()(missing alias)
LMDB is an MVCC transactional key/value store (btree) and supports the concept of sub-tables. We use several tables to accomplish our goals:
Together, these provide an on-disk hash table for byte sequences - records in these tables are never deleted or updated:
Actual graph data is stored in a log - records are never deleted, but a numeric field is updated in the event of deletion or property value update:
Track stats for write transactions that grow the log:
To provide speedy operations, graph data is indexed several ways (we reserve the right to add more) - records are never deleted or updated:
We also provide a non-logged domain/key => value storage interface. Domains are mapped to IDs using the above scalar storage, and keys and values can be as well. Mapping keys and/or values can result in less overall storage, but anything mapped never truly goes away. For non-mapped keys, maximum key length is limited to about 500 bytes. Unlike the others, deletes may be performed against this table: