Briefer Course on LLVM-based Compiler Theory - Getting Easier in Writing Compiler - Alibaba Cloud Developer Forums: Cloud Discussion Forums

Blanche
Engineer
Engineer
  • UID619
  • Fans3
  • Follows2
  • Posts59
Reads:2096Replies:0

[Share]Briefer Course on LLVM-based Compiler Theory - Getting Easier in Writing Compiler

Created#
More Posted time:Sep 29, 2016 15:26 PM
As we enter the 21st Century, new programming languages spring up like mushrooms. Demand was, of course, an important driving force; however, tool chain improvement played an important role.
In 2000, under the direction of Chris Lattner from UIUC, a suite of compiler tool libraries called LLVM (Low Level Virtual Machine) was invented. Later, the scope of LLVM got increasingly wider, Low Level Virtual Machine wasn't able to represent the entirety of LLVM, so LLVM became the official name. LLVM can be used as a traditional compiler, JIT compiler, assembler, debugger, static analysis tools and for other functions related to programming languages.
Afterwards, also under the direction of Chris Lattner, Clang was developed for the front-end of C/C++/Objective-C. This compiler directly challenged the dominant position of GCC. After becoming the main compiler of Apple's development platform, there are more and more Android modules requesting to use Clang.
In 2012, LLVM won the ACM software system award, along with traditional systems such as UNIX, WWW, TCP/IP, TeX, Java, etc.
All ACM Awards
Let us talk about the main Author and Architect of LLVM - Chris Lattner. He was born in 1978.
In 2005, Chris Lattner joined Apple. Since Apple was not satisfied with the inadequate capacity of GCC to support Objective-C, LLVM and Clang became Apple's powerful weapons of choice to replace GCC.
In 2010, again under the direction of Chris Lattner, the Swift language began development.
Ok, let’s digress. First of all, we'd like to say, unlike the complex steps impression described in thick academic books, it is easy to write a simple compiler with LLVM, because LLVM has already done lots of things for us.
It Is Easy to Make a Simple Compiler with LLVM
Let us see the diagram to realize a programming language after using LLVM tools:


What we need to do completely manually or by using other tools such as lex, yacc, etc., is the lexical analysis from source code to token and the syntactic analysis from token to AST. This means we, as the definer of the language, need to realize the major parts of the front-end. In LLVM introduction books, the sections which describe the front-end are only small sections, so we can take it easy.
In the kaleidoscope example of the LLVM language, the lexical analysis and syntactic analysis with notes is only 400 lines. Even so, if you find it complicated, I will show you some simple examples to complete some of the functions first and then conduct iterative development. There is no need to learn compiler theory for only a few hundred lines of code.
Clang, for instance, is a front-end with realized C/C++/Objective-C.
When switching from AST to LLVM, the latter has already provided a series of tools to help us in rapid development. From IR (intermediate instruction code) to DAG (directed acrylic graph) then to machine instruction, LLVM has a complete back-end with respect to commonly used platforms. That is to say, after completing the first step to IR, we will enjoy the same advanced productivity as Clang for follow-up work.
If you say anecdotal evidence isn't enough, then let me show you an example. This example is to convert the binary expression AST into the function of IR:
Value *BinaryExprAST::codegen() {
...
  switch (Op) {
  case '+':
    return Builder.CreateFAdd(L, R, "addtmp");
  case '-':
    return Builder.CreateFSub(L, R, "subtmp");
  case '*':
    return Builder.CreateFMul(L, R, "multmp");
  case '<':
    L = Builder.CreateFCmpULT(L, R, "cmptmp");
    // Convert bool 0/1 to double 0.0 or 1.0
    return Builder.CreateUIToFP(L, Type::getDoubleTy(TheContext), "booltmp");
  default:
...
  }
}


At this stage, we don't need to concern about the generation of IR calculations as LLVM will help us generate the corresponding code.
Then, let's look at another prototype of declaration function:
Function *PrototypeAST::codegen() {
  // Make the function type:  double(double,double) etc.
  std::vector<Type *> Doubles(Args.size(), Type::getDoubleTy(TheContext));
  FunctionType *FT =
      FunctionType::get(Type::getDoubleTy(TheContext), Doubles, false);

  Function *F =
      Function::Create(FT, Function::ExternalLinkage, Name, TheModule.get());

  // Set names for all arguments.
  unsigned Idx = 0;
  for (auto &Arg : F->args())
    Arg.setName(Args[Idx++]);

  return F;
}


Lexical analysis
Today, as using regular expression has become a basic skill, there is no threshold to complete the lexical analysis. Under normal circumstances, we only need to write down a group of regular expressions, or to write a simple state machine.
The process of lexical analysis is to parse source codes into tokens. These tokens are small units with types and values, such as Keyword, Number, Identifier, etc. In this process, it is not necessary to consider how they are combined or used.
For instance, if the token type is a number and its value is 3, then there is sufficient information, as to the usage of this 3, it will be placed in a suitable position when sorting AST later.
As to context-free language, deterministic and non-deterministic finite automaton, etc, we do not need to know for the moment.
Syntactic analysis
Syntactic analysis is indeed more complicated than lexical analysis. However, fortunately, for most statements and expressions, there is no need for any profound knowledge. Although "Shift-Reduce" is a good method, it will not be required in quite a long period of learning.
The output of syntactic analysis is Abstract Syntax Tree (AST). Since it is a tree, then recursion is necessary for the natural structure. Therefore, in most statements, we only need to use the method of recursive descent.
For expressions, the method of recursive descent is insufficient, as operators at least have precedence. So, for expressions, we also need the method of priority analysis for operators. There is no need to use SLR, LALR and LR for the moment.
Syntax-directed translation and intermediate code generation
From above simple examples, we can see that for this part, calling the major part of IR construction tools provided by LLVM for us is enough. For items we can think of during the induction phase, such as code block, function call, control structure, etc., LLVM has already prepared them for us.
Optimization
LLVM provides Pass optimizations for us to choose and combine. During the phases of IR and machine code, we will discuss the optimization in details. This may be what we are really interested in.
Lexical Analysis Is Very Simple
Let us look at an official example, first to define the type of token, and to count every kind you know. Extension will be a kind of physical labor in the future.
enum Token {
  tok_eof = -1,
...
  // primary
  tok_identifier = -4,
  tok_number = -5
};


Then, to parse the regular expression, this will need no technical skills at all.
I have cut down the official version a little bit to make it clearer:
static std::string IdentifierStr; // Filled in if tok_identifier
static double NumVal;             // Filled in if tok_number

/// gettok - Return the next token from standard input.
static int gettok() {
  static int LastChar = ' ';

  // Skip any whitespace.
  while (isspace(LastChar))
    LastChar = getchar();

  if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
    IdentifierStr = LastChar;
    while (isalnum((LastChar = getchar())))
      IdentifierStr += LastChar;
...
    return tok_identifier;
  }

  if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
    std::string NumStr;
    do {
      NumStr += LastChar;
      LastChar = getchar();
    } while (isdigit(LastChar) || LastChar == '.');

    NumVal = strtod(NumStr.c_str(), nullptr);
    return tok_number;
  }
 ...
  // Check for end of file.  Don't eat the EOF.
  if (LastChar == EOF)
    return tok_eof;
...
}


If you do not want to write it down by hand, there are lots of tools such as lex, flex, etc. for you and all of them are used to determine the token type as per regular expressions, you can just save the value corresponding to the token type.
If there are lots of types for a token, then it is necessary to build blocks and write down regular expressions. All of this work is a kind of physical labor.
Enough Syntactic Analysis Is Also Very Simple Indeed
As it is introduced above, we construct the Abstract Syntax Tree in a top-down manner.
First of all, to define a root type:
/// ExprAST - Base class for all expression nodes.
class ExprAST {
public:
  virtual ~ExprAST() {}
  virtual Value *codegen() = 0;
};


For simplicity, to show a number. This is easy - it is just a node to save a numerical value.
/// NumberExprAST - Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
  double Val;

public:
  NumberExprAST(double Val) : Val(Val) {}
  Value *codegen() override;
};


One more example, for variable, it is just the name of variable. Assignment is coming in the next step.
/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
  std::string Name;

public:
  VariableExprAST(const std::string &Name) : Name(Name) {}
  Value *codegen() override;
};


Function prototype:

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its name, and its argument names (thus implicitly the number
/// of arguments the function takes).
class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;

public:
  PrototypeAST(const std::string &Name, std::vector<std::string> Args)
      : Name(Name), Args(std::move(Args)) {}
  Function *codegen();
  const std::string &getName() const { return Name; }
};


Then, let us see how to construct the AST for a numerical value via token:
During the process of lexical analysis, this numerical value was saved temporarily and we can just take it out then use it.
/// numberexpr ::= number
static std::unique_ptr<ExprAST> ParseNumberExpr() {
  auto Result = llvm::make_unique<NumberExprAST>(NumVal);
  getNextToken(); // consume the number
  return std::move(Result);
}


Again look at the example of function declaration:
/// prototype
///   ::= id '(' id* ')'
static std::unique_ptr<PrototypeAST> ParsePrototype() {
  if (CurTok != tok_identifier)
    return LogErrorP("Expected function name in prototype");

  std::string FnName = IdentifierStr;
  getNextToken();

  if (CurTok != '(')
    return LogErrorP("Expected '(' in prototype");

  std::vector<std::string> ArgNames;
  while (getNextToken() == tok_identifier)
    ArgNames.push_back(IdentifierStr);
  if (CurTok != ')')
    return LogErrorP("Expected ')' in prototype");

  // success.
  getNextToken(); // eat ')'.

  return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
}


First to read the function name, find the left parenthesis and then the parameter list, finally handling the right parenthesis. You know what? It needs no technical skill at all...
For all examples above, there is no nest and there is no need for recursive descent or operator precedence. These will occur when dealing with binary expressions and the like. We can learn the easier part first to ensure that we are able to assemble these easy components into a compiler which really realizes the process from code source to machine instruction although with an incomplete language function.
All examples above are taken from the kaleidoscope language's official examples. The Official Tutorials are good enough but slightly complicated, as it is still inefficient to generate a workable compiler. I'm going to lower the learning curve, so as to construct a playable compiler little by little through constant iteration and then gradually extend its function.
Hello,LLVM
Downloading LLVM
1. First to download LLVM shell svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
2. Under the directory of LLVM tools, to download Clang (optional but recommended): shell cd llvm/tools svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
3. Under the directory of LLVM projects, to download compiler-rt, Libomp, libcxx and libcxxabi (optional). Anyway, I have downloaded all of them at shell svn co http://llvm.org/svn/llvm-project/compiler-rt/trunk compiler-rt svn co http://llvm.org/svn/llvm-project/openmp/trunk openmp svn co http://llvm.org/svn/llvm-project/libcxx/trunk libcxx svn co http://llvm.org/svn/llvm-project/libcxxabi/trunk libcxxabi
Compiling LLVM
Since it is said officially that many LLVM developers are using Ninja, so we just have to follow that statement.
As I am using a Mac, so I use Homebrew to install CMake and Ninja. Linux is similar to them, while for users with old GCC versions, please install these software on your own. I did not try Windows, so it will be updated later.
1. Under the directory of LLVM, to create a build directory
2. cd build
3. cmake -G Ninja
4. Ninja
Write a "Hello, World" program on LLVM
When transforming AST into IR, we all need to use LLVM-based tools. First, let us write down a simple program so as to learn how to compile a LLVM-based program:
#include "llvm/IR/Module.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include <cstdlib>
#include <map>
#include <memory>
#include <string>
#include <vector>

static llvm::LLVMContext TheContext;
static llvm::IRBuilder<> Builder(TheContext);
static std::unique_ptr<llvm::Module> TheModule;
static std::map<std::string, llvm::Value *> NameValues;

int main(){
 TheModule = llvm::make_unique<llvm::Module>("hello,llvm",TheContext);

 TheModule -> dump();
 return 0;
}


Output results are as follows:
; ModuleID = 'hello,llvm'
source_filename = "hello,llvm"


How to Link LLVM Library
If the LLVM library is used, lots of parameters are need.
The following are parameters on my computer:
-I/Users/ziyingliuziying/lusing/llvm/llvm/include -I/Users/ziyingliuziying/lusing/llvm/llvm/build/include  -fPIC -fvisibility-inlines-hidden -Wall -W -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Werror=date-time -std=c++11 -fcolor-diagnostics   -fno-exceptions -fno-rtti -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
-L/Users/ziyingliuziying/lusing/llvm/llvm/build/lib -Wl,-search_paths_first -Wl,-headerpad_max_install_names
-lLLVMCore -lLLVMSupport
-lcurses -lz -lm


This could be worrisome every time you write it in this way. As a result, LLVM offers us with the llvm-config tool. For those strings I just showed you, they are generated by using the command lines below:
llvm-config --cxxflags --ldflags --system-libs --libs core

A complete compile command can be written in this way:
clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core` -o toy
Guest