lacuna / bifurcan
- среда, 10 октября 2018 г. в 00:22:08
Java
impure functional data structures
<dependency>
<groupId>io.lacuna</groupId>
<artifactId>bifurcan</artifactId>
<version>0.1.0-alpha6</version>
</dependency>
This library provides high-quality Java implementations of mutable and immutable data structures, each sharing a common API and these design principles:
Rather than using the existing collection interfaces in java.util
such as List
or Map
, it provides its own interfaces (IList
, IMap
, ISet
) that provide functional semantics - each update to a collection returns a reference to a new collection. Each interface provides a method (toList
, toMap
, toSet
) for coercing the collection to a read-only version of the standard Java interfaces.
An in-depth comparison of Bifurcan to similar libraries on the JVM can be found here.
java.util.HashMap
for larger collections.merge
, union
, difference
, intersection
) significantly faster.String
, it uses UTF-8 encoding and can efficiently index via both full code points and Java's preferred UTF-16 code units.Full documentation can be found here.
If we pass a mutable data structure into a function, we have to know what that function intends to do with it. If it updates the data structure, we cannot safely read from it afterwards. If it stores the data structure, we cannot safely write to it afterwards. In other words, to use a mutable data structure safely we must ensure it has a single owner. Enforcing this may require us to hold a large amount of code in our head at once.
Immutable data structures free us from having to care. Functions can update or hold onto data without any risks. We can reason locally about the flow of data, without any care for what the rest of the code is doing. This can be enormously freeing.
This freedom, however, comes at a cost. Updates to immutable data structures require a subset of the structure to be copied, which is much more expensive than simply overwriting existing memory.
If a data structure is referenced in multiple places, this is usually a cost worth paying. However, in this case it's just wasteful:
Set<Long> set = new Set<>();
for (int i = 0; i < 1000; i++) {
set = set.add(i);
}
This will create 999 intermediate copies of the set, none of which we care about. There is only a single reference to these copies, and each is discarded as soon as add()
is called. The dataflow of these calls form a simple, linear chain. To have more than one reference, the dataflow must diverge.
Where the dataflow is linear, we can safely use mutable collections. Where it is not, we prefer to use immutable collections. Since this linear flow is a local property, we would also like mutability to be a local property:
Set<Long> set = new Set<>().linear();
for (int i = 0; i < 1000; i++) {
set.add(i);
}
set = set.forked();
The call to linear()
indicates that the collection has a single owner, and may be updated in-place. The call to forked()
indicates that this is no longer true. By allowing temporary mutability, we gain huge performance benefits. There is still a cost, however, relative to purely mutable data structures. For this reason, Bifurcan provides permanently linear variants of each collection:
LinearSet<Long> set = new LinearSet<>();
for (int i = 0; i < 1000; i++) {
set.add(i);
}
If we call forked()
on this collection, it will be wrapped in an immutable facade which tracks changes without touching the underlying collection. This facade provides fast reads, marginally slower writes, and does not support efficient set operations. All variants share the same API, allowing us to tweak the performance/safety tradeoffs of our code by changing only a few lines.
Most libraries for "functional programming" provide a lazy list or stream construct. However, this is less a data structure than a mechanism for control flow. While useful, such a mechanism is out of scope for this library.
Many modern data structure libraries also provide "parallel" collections, which make it easy to use multiple cores to process a single data structure. However, these collections are simply normal data structures with an execution model bolted on, without any obvious way to disentangle the two.
Rather than provide its own execution model, Bifurcan allows any collection to split into sub-collections using split(k)
, which will return 1 to k
pieces depending on its size. The sub-collections can be processed and then merged using methods such as concat
, merge
, union
, intersection
, or difference
.
This separation of concerns provides greater flexibility, but also requires more work to get up and running.
Copyright © 2016-2018 Zachary Tellman
Distributed under the MIT License