2025-02-Compiler/2.md

Lexical Analysis
===

포트란은 모든 whitespace를 지움.

```fortran
do 5 I = 1.25

```

```
do 5 I = 1,25
```


## Tokens

대표적인 토큰의 예시

* Identifiers
* Keywords
* Integers
* Floating-points
* Symbols
* Strings

하기 위해서 하는 것

* Specification

확실하게 명세를 해줘야함.

* Recognition

DFA를 이용해서 패턴 매칭

* Automation

RE로 부터 DFA를 generate해야함

Lex라는 툴을 이용

그러나 내부적으로는 Tompson's construction (RE -> NFA), Subset Construction(NFA -> DFA)도 알아야함


## Specification

**Regular Expression**

* 여러가지에 사용됨 `grep`, `find`, `sed`, `awk`


Multiple Matches

`elsex = 0`이라는 코드에서

`else / x / = / 0`
또는

`elsex / = 0` 두가지 선택지가 있음. 둘 중 하나를 무조건 골라야함. 이때 가장 긴 토큰이 선택된다.

* `elsex`가 `else`보다 더 길어서 `elsex`가 선택됨.

만약에 두 경우가 모두 똑같다면 토큰 종류의 우선순위에 따라 선택된다.

* `Keyword`가 `Identifier`가 더 높음.

## Recognition

FSA를 이용함.

DFA와 NFA의 표현력은 동일하나 DFA는 편하게 구현할 수 있다는 장점이 있음.
NFA는 RE로부터 쉽게 변환가능하다는 장점이 있음.

**Lexical Analysis**

`Lexical Spec -> RE -> NFA -> DFA -> Table`

## Automation

* `Lex`(`Flex`: faster implementation of Lex)
* `Bison`

### Lex/Flex

* Definition Section
  * can declear or include var, enumeration, using the code in between `%{`, `%}`
  * provide names sub-rules for complex patterns used in **rules**

* Rules Section
  * Lexical Pattern


* User Function Section
  * Copied to the Lex Program


```c
// example.l
%{
    #include <stdio.h>
    int num_lines = 0;
%}
%%
[ \t] {}
a |
an |
the  {printf("%s: is an article\n", yytext)}
[a-z]+ {printf("%s: ???\n", yytext)}
%%
main() {
    yylex();1
}
```

### Handwork

* Thompson's construction (RE -> NFA)
* Subset Construction(NFA -> DFA)