jbellis / jvector
- суббота, 7 октября 2023 г. в 00:00:03
JVector: the most advanced embedded vector search engine
JVector is a pure Java, zero dependency, embedded vector search engine, used by DataStax Astra DB and (soon) Apache Cassandra.
What is JVector?
JVector vs Lucene searching the Deep100M dataset (about 35GB of vectors and 25GB index):
JVector scales updates linearly to at least 32 threads:
Adding to your project. Replace ${latest-version}
with . Example
<version>1.0.1</version>
:
<dependency>
<groupId>io.github.jbellis</groupId>
<artifactId>jvector</artifactId>
<!-- Use the latest version from https://central.sonatype.com/artifact/io.github.jbellis/jvector -->
<version>${latest-version}</version>
</dependency>
Building the index:
GraphIndexBuilder
is the entry point for building a graph. You will need to implement
RandomAccessVectorValues
to provide vectors to the builder;
ListRandomAccessVectorValues
is a good starting point.build()
and it will parallelize the build across
all available cores. Otherwise you can call addGraphNode
as you add vectors;
this is non-blocking and can be called concurrently from multiple threads.GraphIndexBuilder.complete
when you are done adding vectors. This will
optimize the index and make it ready to write to disk. (Graphs that are
in the process of being built can be searched at any time; you do not have to call
complete first.)Searching the index:
GraphSearcher
is the entry point for searching. Results come back as a SearchResult
object that contains node IDs and scores, in
descending order of similarity to the query vector. GraphSearcher
objects are re-usable,
so unless you have a very simple use case you should use GraphSearcher.Builder
to
create them; GraphSearcher::search
is also available with simple defaults, but calling it
will instantiate a new GraphSearcher
every time so performance will be worse.RandomAccessVectorValues
you provided. You can get the original vector
back with GraphIndex.getVector
, if necessary, but since this is a disk-backed index
you should design your application to avoid doing so if possible.JVector implements DiskANN-style search, meaning that vectors can be compressed using product quantization so that searches can be performed using the compressed representation that is kept in memory. You can enable this with the following steps:
ProductQuantization
object with your vectors using ProductQuantization.compute
. This will take some time
to compute the codebooks.ProductQuantization::encode
or encodeAll
to encode your vectors.CompressedVectors
object from the encoded vectors.NeighborSimilarity.ApproximateScoreFunction
for your query that uses the
ProductQuantization
object and CompressedVectors
to compute scores, and pass this
to the GraphSearcher.search
method.OnDiskGraphIndex
and CompressedVectors
have write()
methods to save state to disk.
They initialize from disk using their constructor and load()
methods, respectively.
Writing just requires a DataOutput, but reading requires an
implementation of RandomAccessReader
and the related ReaderSupplier
to wrap your
preferred i/o class for best performance. See SimpleMappedReader
and SimpleMappedReaderSupplier
for an example.OnDiskGraphIndex
in a CachingGraphIndex
to keep the most commonly accessed
nodes (the ones nearest to the graph entry point) in memory.SiftSmall
class demonstrates how to put all of the above together to index and search the
"small" SIFT dataset of 10,000 vectors.Bench
class performs grid search across the GraphIndexBuilder
parameter space to find
the best tradeoffs between recall and throughput. You can use plot_output.py
to graph the pareto-optimal
points found by Bench
.Some sample KNN datasets for testing based on ada-002 embeddings generated on wikipedia data are available in ivec/fvec format for testing at:
aws s3 ls s3://astra-vector/wikipedia/ --no-sign-request
PRE 100k/
PRE 1M/
PRE 4M/
download them with the aws s3 cli as follows:
aws s3 sync s3://astra-vector/wikipedia/100k ./ --no-sign-request
This project is organized as a multimodule Maven build. The intent is to produce a multirelease jar suitable for use as
a dependency from any Java 11 code. When run on a Java 20+ JVM with the Vector module enabled, optimized vector
providers will be used. In general, the project is structured to be built with JDK 20+, but when JAVA_HOME
is set to
Java 11 -> Java 19, certain build features will still be available.
Base code is in jvector-base and will be built for Java 11 releases, restricting language features and APIs appropriately. Code in jvector-twenty will be compiled for Java 20 language features/APIs and included in the final multirelease jar targetting supported JVMs. jvector-multirelease packages jvector-base and jvector-twenty as a multirelease jar for release. jvector-examples is an additional sibling module that uses the reactor-representation of jvector-base/jvector-twenty to run example code.
You can run SiftSmall
and Bench
directly to get an idea of what all is going on here. Bench
requires some datasets to be downloaded from https://github.com/erikbern/ann-benchmarks. The files used by SiftSmall
can be found in the siftsmall directory in the project root.
To run either class, you can use the Maven exec-plugin via the following incantations:
mvn compile exec:exec@bench
or for Sift:
mvn compile exec:exec@sift
To run Sift/Bench without the JVM vector module available, you can use the following invocations:
mvn -Pjdk11 compile exec:exec@bench
mvn -Pjdk11 compile exec:exec@sift
The ... -Pjdk11
invocations will also work with JAVA_HOME
pointing at a Java 11 installation.
To release, configure ~/.m2/settings.xml
to point to OSSRH and run mvn -Prelease clean deploy
.