diff --git a/notes/2.md b/notes/2.md index 75fa7f6..aa76b09 100644 --- a/notes/2.md +++ b/notes/2.md @@ -1,5 +1,4 @@ -Lexical Analysis -=== +# Lexical Analysis 포트란은 모든 whitespace를 지움. diff --git a/notes/3.md b/notes/3.md index bb27d84..255e510 100644 --- a/notes/3.md +++ b/notes/3.md @@ -1,30 +1,72 @@ # Syntax Analysis +## Specification -## Context-Free Grammars (CFG) +### Context-Free Grammars (CFG) -## Parse Tree +There are four main compoennts of CFG -A tree representation of the derivation +* Terminal Symbols +* Non-terminal Symbols +* Start Symbol $S$ +* Production + +Language generated by a CFG is a set of strings of terminals by repeatedly applying productions to the non-terminals: $L(G)$ indicates a language generated by the grammar $G$. + +We can use CFGs to express the syntax of the target programming languages: Parser detects that the token stream from lexer is valid or invalid. + +**We cannot rely on regex to sepcify the syntax:** Because regex is not expressive enough to describe valid syntax. (e.g. nested parenthesis) + +## Recognition + +### Parse Tree + +A tree representation of the derivation. parse tree has `terminals` at the leaves, `non-terminals` at the interior. An in-order traversal of the leaves is the original input. +We can appply productions for the non-terminals in any order: * leftmost derivation * rightmost derivation -### Ambiguity +#### Ambiguity -should be removed +A grammar $G$ is ambigous if it produces different parse tree depening on the order. -for example: `A + B * C` should be resolved +It should be **resolved** to construct a useful parser. **removing ambiguity** +Example of Ambiguity. + +1. Precedence: +The production at higher levels will have operators with lower priorities (and vice versa). +we can insert non-terminals to enforce precendence. + +2. Associativity: +We should determine where to place recursion depending on the associativity. + * left associative: place the recursion on the left + ```txt + S -> S - T | T + T -> id + ``` + * right associative: place the recursion on the right + ```txt + S -> T ^ S | T + T -> id + ``` + * non associative: do not use recursion + ```txt + S -> A < A | A + A -> id + ``` + ### AST (Abstract Syntax Tree) +AST discards unneeded information for syntax analysis: removes non-terminals from parse tree. ### Error Handling @@ -32,24 +74,179 @@ One of the purposes of the compiler is error handling. - to detect non-valid programs - and to translate the non-valid to the valid -## Parsing +so therefore, error handler should: +* report errors accurately and clearly +* recover quickly from an error +* not slow down compilation of valid code -* Top-down Parsing +so Why? +* back in the day, the compiler was extremely slow. +* but nowadays, we do not need complex error handling procedure + * Quick recompilatio + * Users tend to correct few errors at once + * Does not need a complex error recovery procedure -**Recursive Descent Parsing** +## Automation -by using backtracking +How to generate parse tree from CFG? -* Predictive Parsing +### Top-Down Parsing +Construct a leftmost derivation of string while reading the token stream. +e.g. +```txt +S -> E + S | E +E -> num | (S) +``` -Parsing Table: no need to backtrack. +We can implment it as **recursive descent parsing**. +Recursive descent parsing is try out rules in order and backtrack if the production does not generate proper token. + +#### Predictive Parsing and LL(1) + +But it needs **backtracking**. +So we introduce **predictive parsing**. + +Predictive Parsing applies a single production without "backtracking". LL(1) grammar can apply **"at most a single production"**, which actually eliminates the multiple matches in top-down parsing(recursive-descent). + +### Parser Implementation + +#### Recursive Descent Parser by LL(1) + +```text +S -> ES' +S' -> +ES' | e +E -> num | (S) +``` + +We can use the table to implement parsers. + +| * | num | + | ( | ) | $(EOF) | +| --- | --- | --- | --- | --- | ------ | +| S | ES_ | | ES_ | | | +| S_ | | + S | | e | e | +| E | num | | (S) | | | + +```c +void parse_S() { + switch(token) { + case num: parse_E(); parse_S_(); return; + case '(': parse_E(); parse_S_(); return; + default: ParseError(); + } +} +void parse_S_() { + switch(token) { + case '+': token=input.next(); parse_S_(); return; + case ')': return; + case EOF: return; + default: ParseError(); + } +} + +void parse_E() { + switch(token) { + case num: token = input.next(); return; + case '(': token = input.next(); parse_S(); if(token != ')') ParseError(); token = input.next(); return; + default: ParseError(); + } +} +``` + +#### Parsing Tables + +And then How to Construct a Parsing Tables? There are three important traits to gen parse tables. +* $x$ is nullable if it can derive an empty string. +* $\text{First}(x)$ is a set of terminals that can derived in the first position. +* $\text{Follow}(x)$ is a set of terminals that can appears after $\alpha$ in at least one of the derivations. + +Computing Nullable +1. Easy + +Computing First +1. $\text{First}(t) = \set{t}$ +2. For a production $x \to A_1 A_2 \dots A_n$ where $A_1 \dots A_{i-1}$ are nullable, then $\text{First}(x) += \text{First}(A_{i})$ + +Computing Follow +1. $\text{Follow}(S) = \set{ \$ }$, ($S$ is the start symbol) +2. For a production $x \to A_1 A_2 \dots A_n$ where $A_{i+1}\dots A_{n}$ are nullable, then $\text{Follow}(A_{i}) += \text{Follow}(x)$ +3. For a production $x \to A_1 A_2 \dots A_n$ where $A_{i+1}A_{i+2}\dots A_{j-1}$ are nullable, $\text{Follow}(A_{i}) += \text{Follow}(A_{j})$ + +So Combine Them Together: + +```py +S = symbols.start +for x in symbols: + First(x) = {}; Follow(x) = {}, Nullable(x) = false +Follow(S) = {$} +for t in terminals: + First(t) = {t} +Nullable(e) = True +while not (First.is_changed or Follow.is_changed or Nullable.is_changed): + for X, A in productions: + if all(A) is Nullable: Nullalbe(X) = True + if A[1..i-1] is Nullable: First(X) += First(A[i]) + if A[i+1..n] is Nullable: Follow(A[i]) += Follow(X) + if A[i+1..j-1] is Nullable: Follow(A[i]) += First(A[j]) +``` + +We can use the tables with `First`, `Follow`, `Nullable` to make actual **Parsing Tables** by combining. + +### Bottom-Up Parsing + +Bottom-up Parsing is more efficient than Top-down parsing by using **LR grammars**, which means **L**eft-recursive grammars, and **R**ight-most derivation. +It relies on **Shift-reduce Parsers**. + +Shift-Reducing Parsing: + +Bottom-up parser traces a rightmost derivation in reverse, we should scan the input terminals in a left-to-right manner. +So the parser splits a string into two parts: sequence of symbols to reduce(**stack**), remaining tokens(**input**). +And shift-reduce parsing requires two actions: **Reduce**, **Shift**. + +What's the important challenge is "Action Selection Problem": +As there can be conflicts: For a given state(stack + input) there can be multiple possible actions: +* shift-reduce conflict +* reduce-reduce conflict + +### LR Grammars + +* LR(k): left-toright scanning, right most derivation and $k$ symbol lookahead +* LR(0) Grammar +LR(0) indicates grammars that can determine actions without any lookahead. +There are **no reduce-reduce and shift-reduce conflicts**, because it should be determined by stacks. -### Parser Implement +### NFA Representations +We can represent shift-reduce parsing using an **NFA**, whose states are production with separator '`.`' on RHS. And we have an additional dummy production `S' -> S$` for a start and end state. +There are two types of transitions between the states: +* **shift transition**: transition by the shift actions. +* **$\epsilon$ transition**: that is the parser expand the expected list not consuming any input tokens. (transition to LHS of the production is equal to next of the current position) +#### NFA to DFA +NFA can be fully converted to DFA. by using DFA, the parser determine whether to shift or reduce: by using symbols in the stack to **traverse the state** and **determine whether to shift or reduce by destination state**. +### Parsing Table And LR(0) +DFA Traversal is implemented by simplifying DFA to **LR Parsing Table**. + +Store the states along with the symbols in the stack: <`symbol`, `state`> + +There are two different types of tables: `goto`, `action`. +* `goto`: determine the next state using the top state and an **input non-terminal**. +* `action`: determine the action using the top state and an input terminal. + +And table consists of four different actions: +* `shift x`: push on the stack (a is current input and x is a state) +* `reduce x -> a`: pop a from the stack and push <`x`, `goto[curr_state, x]`> +* accept(`S' -> S$.`) / Error + +DFA states are converted to index of each rows. + +### SLR(1) Parsing + +A simple extension of LR(0). + +For each reduction `X -> b`, look at the next symbol `c` and then apply reduction only if `c` is in `Follow(X)`.