Skip to content

GH-116380: Speed up glob.glob() by removing some system calls#116392

Merged
barneygale merged 101 commits intopython:mainfrom
barneygale:gh-116380
Feb 28, 2025
Merged

GH-116380: Speed up glob.glob() by removing some system calls#116392
barneygale merged 101 commits intopython:mainfrom
barneygale:gh-116380

Conversation

@barneygale
Copy link
Contributor

@barneygale barneygale commented Mar 5, 2024

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves #104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

  • Certain special pattern segments ("", "." and "..") leave the flag unchanged
  • Literal pattern segments (e.g. foo/bar) set the flag to false
  • Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
  • Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

  • Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
  • Use a stack rather than recursion to implement recursive wildcards.
  • Pre-compile regular expressions and pre-join literal pattern segments.
  • Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
  • Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
  • Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

  1. Support for dir_fd
  2. Support for include_hidden
  3. Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern speedup
Lib/* 1.87x
Lib/*/ 1.85x
Lib/*.py 1.3x
Lib/** 5.62x
Lib/**/ 1.23x
Lib/**/* 1.92x
Lib/**/** 17x
Lib/**/*/ 2.15x
Lib/**/*.py 1.79x
Lib/**/__init__.py 1.03x
Lib/**/*/*.py 2.41x
Lib/**/*/__init__.py 1.76x

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants