artagnon / rhine
- понедельник, 21 марта 2016 г. в 02:13:43
C++
A typed Elixir-inspired language on llvm
rhine is designed to be a fast language utilizing the LLVM JIT featuring N-d tensors, first-class functions, and type inference; specifying argument types is enough. It has a full blown AST into which it embeds a UseDef graph.
rhine started off as rhine-ml, and rhine-ml was called rhine earlier.
def bar(arithFn ~Function(Int -> Int -> Int)) do
println $ arithFn 2 4
end
def addCandidate(A ~Int, B ~Int) do
ret $ A + B
end
def subCandidate(C ~Int, D ~Int) do
ret $ C - D
end
def main() do
if false do
bar addCandidate
else
bar subCandidate
end
A = {{2}, {3}}
println A[1][0]
end
~Int
is a type annotation, and only argument types need to be annotated,
return type is inferred. ~Function(Int -> Int -> Int)
is a function that takes
two integers and returns one integer, mixing in some Haskell syntax. $
is
again from Haskell, which is basically like putting the RHS in parens.
rhine-ml, in contrast, has arrays, first-class functions, closures, variadic arguments, macros. It's also much less buggy.
rhine uses a handwritten recursive-descent parser, which is faster and reports better errors, than the former Bison one. You will need to use a one-token lookahead atleast, if you want to keep the code simple. This gives you one level of:
parseSymbol(); // Oops, the lexed token indicates that we're not in the right
// function
parseInstruction(); // Ask it to use an existing token, not lex a new one
Another minor consideration is that newlines must be handled explicitly if you want to substitute ; with a newline in the language.
void Parser::getTok() {
LastTok = CurTok;
CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);
LastTokWasNewlineTerminated = false;
while (CurTok == NEWLINE) {
LastTokWasNewlineTerminated = true;
CurTok = Driver->Lexx->lex(&CurSema, &CurLoc);
}
}
The AST is heavily inspired by LLVM IR, although it has some higher-level
concepts like Tensor
. It's an SSA and has a UseDef graph embedded in it,
making analysis and transformation easy.
The main classes are Type
and Value
. All types like IntType
, FloatType
inherit from Type
, most of the others inherit from Value
. A BasicBlock
is
a Value
, and so is ConstantInt
.
A BasicBlock
is a vector of Instruction
, and this is how the AST is an SSA:
assignments are handled as a StoreInst
; there is no real LHS, just RHS
references.
StoreInst::StoreInst(Value *MallocedValue, Value *NewValue);
Value
is uniquified using LLVM's FoldingSet
, and Use
wraps it, so we can
replace one Value
with another.
/// A Use is basically a linked list of Value wrappers
class Use {
Value *Val;
Use *Prev;
Use *Next;
// Laid out in memory as [User] - [Use1] - [Use2]. Use2 has DistToUser 2
unsigned DistToUser;
};
An Instruction
is a User
. User
and its Use
values are laid out
sequentially in memory, so it's possible to reach all the Use
values from the
User
. It's also possible to reach the User
from any Use
, using
DistToUser
.
class User : public Value {
protected:
unsigned NumOperands;
};
class Instruction : User;
The User
has a custom new
to allocate memory for the Use
instances
as well.
void *User::operator new(size_t Size, unsigned Us) {
void *Storage = ::operator new (Us * sizeof(Use) + Size);
auto Start = static_cast<Use *>(Storage);
auto End = Start + Us;
for (unsigned Iter = 0; Iter < Us; Iter++) {
new (Start + Iter) Use(Us - Iter);
}
auto Obj = reinterpret_cast<User *>(End);
return Obj;
}
};
The Context is a somewhat large object that keeps the uniqified Type
and
Value
instances. It also keeps track of Externals
, the external C functions
that are provided as part of a "standard library". Unique llvm::Builder
and
llvm::Context
objects, as well as the DiagnosticPrinter
are exposed member
variables. Finally, it is necessary for symbol resolution, and keeps the
ResolutionMap
.
src/Transform/Resolve is an example of something that utilizes the UseDef embedded in the AST.
B = A + 2
creates one UnresolvedValue
, A
, an AddInst
, and a MallocInst
,
which takes the string "B" and AddInst
as operands.
The transform basically goes over all the Instruction
in the BasicBlock
,
resolves UnresolvedValue
instances, and sets the Use
to the resolved value.
It hence replaces the Value
underneath the Use
, and since the Instruction
is referencing Use
instances, there are no dangling references.
if (auto S = K->Map.get(V, Block)) {
/// %S = 2;
/// ^
/// Came from here (MallocInst, Argument, or Prototype)
///
/// Foo(%S);
/// ^
/// UnresolvedValue; replace with %Replacement
if (auto M = dyn_cast<MallocInst>(S)) {
if (dyn_cast<StoreInst>(U->getUser()))
U.set(M);
}
}
Type Inference is too simple. One visit
function is overloaded for all
possible Value
classes.
Type *TypeInfer::visit(MallocInst *V) {
V->setType(visit(V->getVal()));
assert(!V->isUnTyped() && "unable to type infer MallocInst");
return VoidType::get(K);
}
$ git submodule update --init
$ mkdir llvm-build
$ cd llvm-build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ../llvm
$ ninja
$ mkdir ../rhine-build
$ cd ../rhine-build
$ cmake -GNinja ../rhine
$ ninja
An inefficient untyped language is easy to implement. println
taking 23 and
"twenty three" as arguments is a simple matter of switching on
type-when-unboxed. There's no need to rewrite the value in IR, and certainly no
need to come up with an overloading scheme.
Crystal made a good decision to start with Ruby. If your idea is to self-host, then the original language's efficiency does not matter. All you need is good generated assembly (which LLVM makes easy).