[1] Python Internals - Tokens

2025-09-22 · Cezar Peixeiro

ℹ️ References:

1. What are tokens?

When executing any python code, the interpreter should inderstand what each one of the characters/words written means and how they have to be handled. To this task, python breaks the code into tokens using mainly spaces as separators (or when identifying and special character with a pre-defined function). Let’s use the below code as an example:

1#test_tokens.py
2x = 1 + 2
3print(x)

The tokenize package

Python has a built-in library call tokenize that helps us to understand how the language identify the tokens.
We can use in the CLI:

 1$ python3 -m tokenize test_tokens.py 
 2    0,0-0,0:            ENCODING       'utf-8'        
 3    1,0-1,1:            NAME           'x'            
 4    1,2-1,3:            OP             '='            
 5    1,4-1,5:            NUMBER         '1'            
 6    1,6-1,7:            OP             '+'            
 7    1,8-1,9:            NUMBER         '2'            
 8    1,9-1,10:           NEWLINE        '\n'           
 9    2,0-2,5:            NAME           'print'        
10    2,5-2,6:            OP             '('            
11    2,6-2,7:            NAME           'x'            
12    2,7-2,8:            OP             ')'            
13    2,8-2,9:            NEWLINE        ''             
14    3,0-3,0:            ENDMARKER      ''

On line (2) python declares the encoding that it’s using to create tokens (UTF-8 is the default). The pairs in the beginning of the lines references the coodinates (line, column) of start and end of that token followed by its classification in what we call grammar of python language (NAME, OP, NUMBER, …)

The Python Grammar

So, how does python know that x is a NAME, but ),( or = are OP type? The operators are declared in /Grammar/Tokens in the source code. There we can find a mapping between the names and the tokens that it’s related to. Example:

GREATER                 '>'
EQUAL                   '='
DOT                     '.'
PERCENT                 '%'
LBRACE                  '{'
RBRACE                  '}'
EQEQUAL                 '=='
NOTEQUAL                '!='
LESSEQUAL               '<='
GREATEREQUAL            '>='

If any changes would be made in this file, the tokens references must be rebuild by the execution of the command make regen-token. This command has to be executed in the root of the repo. It refers to this Makefile.

Update: ./configure must be executed before the make command

Appendix (A): Dissecting the command `make regen-token`

(Refers to this Makefile)

This command refers to this piece of the code:

 1.PHONY: regen-token
 2regen-token:
 3	# Regenerate Doc/library/token-list.inc from Grammar/Tokens
 4	# using Tools/build/generate_token.py
 5	$(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py rst \
 6		$(srcdir)/Grammar/Tokens \
 7		$(srcdir)/Doc/library/token-list.inc \
 8		$(srcdir)/Doc/library/token.rst
 9	# Regenerate Include/internal/pycore_token.h from Grammar/Tokens
10	# using Tools/build/generate_token.py
11	$(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py h \
12		$(srcdir)/Grammar/Tokens \
13		$(srcdir)/Include/internal/pycore_token.h
14	# Regenerate Parser/token.c from Grammar/Tokens
15	# using Tools/build/generate_token.py
16	$(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py c \
17		$(srcdir)/Grammar/Tokens \
18		$(srcdir)/Parser/token.c
19	# Regenerate Lib/token.py from Grammar/Tokens
20	# using Tools/build/generate_token.py
21	$(PYTHON_FOR_REGEN) $(srcdir)/Tools/build/generate_token.py py \
22		$(srcdir)/Grammar/Tokens \
23		$(srcdir)/Lib/token.py

#tokenizer #python-internals #python3 #python

[1] Python Internals - Tokens

1. What are tokens?

The tokenize package

The Python Grammar

Appendix (A): Dissecting the command make regen-token

Appendix (A): Dissecting the command `make regen-token`