[1] Python Internals - Tokens
ℹ️ References:
- Python’s source code repo
- Keynote - Demystifying Python Internals - Diving into CPython by implementing an operator
- Internals of CPython - Louie Lu
1. What are tokens?
When executing any python code, the interpreter should inderstand what each one of the characters/words written means and how they have to be handled. To this task, python breaks the code into tokens using mainly spaces as separators (or when identifying and special character with a pre-defined function). Let’s use the below code as an example:
1#test_tokens.py
2x = 1 + 2
3print(x)The tokenize package
Python has a built-in library call tokenize that helps us to understand how the language identify the tokens.
We can use in the CLI:
1$ python3 -m tokenize test_tokens.py
2 0,0-0,0: ENCODING 'utf-8'
3 1,0-1,1: NAME 'x'
4 1,2-1,3: OP '='
5 1,4-1,5: NUMBER '1'
6 1,6-1,7: OP '+'
7 1,8-1,9: NUMBER '2'
8 1,9-1,10: NEWLINE '\n'
9 2,0-2,5: NAME 'print'
10 2,5-2,6: OP '('
11 2,6-2,7: NAME 'x'
12 2,7-2,8: OP ')'
13 2,8-2,9: NEWLINE ''
14 3,0-3,0: ENDMARKER '' On line (2) python declares the encoding that it’s using to create tokens (UTF-8 is the default). The pairs in the beginning of the lines references the coodinates (line, column) of start and end of that token followed by its classification in what we call grammar of python language (NAME, OP, NUMBER, …)
The Python Grammar
So, how does python know that x is a NAME, but ),( or = are OP type?
The operators are declared in /Grammar/Tokens in the source code. There we can find a mapping between the names and the tokens that it’s related to. Example:
GREATER '>'
EQUAL '='
DOT '.'
PERCENT '%'
LBRACE '{'
RBRACE '}'
EQEQUAL '=='
NOTEQUAL '!='
LESSEQUAL '<='
GREATEREQUAL '>='If any changes would be made in this file, the tokens references must be rebuild by the execution of the command
make regen-token. This command has to be executed in the root of the repo. It refers to this Makefile.
Update: ./configure must be executed before the make command
Appendix (A): Dissecting the command make regen-token
(Refers to this Makefile)
This command refers to this piece of the code:
1.PHONY: regen-token
2regen-token:
3 # Regenerate Doc/library/token-list.inc from Grammar/Tokens
4 # using Tools/build/generate_token.py
5 $(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py rst \
6 $(srcdir)/Grammar/Tokens \
7 $(srcdir)/Doc/library/token-list.inc \
8 $(srcdir)/Doc/library/token.rst
9 # Regenerate Include/internal/pycore_token.h from Grammar/Tokens
10 # using Tools/build/generate_token.py
11 $(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py h \
12 $(srcdir)/Grammar/Tokens \
13 $(srcdir)/Include/internal/pycore_token.h
14 # Regenerate Parser/token.c from Grammar/Tokens
15 # using Tools/build/generate_token.py
16 $(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py c \
17 $(srcdir)/Grammar/Tokens \
18 $(srcdir)/Parser/token.c
19 # Regenerate Lib/token.py from Grammar/Tokens
20 # using Tools/build/generate_token.py
21 $(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py py \
22 $(srcdir)/Grammar/Tokens \
23 $(srcdir)/Lib/token.py