Bug report
Most static strings are interned during Python initialization in _PyUnicode_InitStaticStrings. However, the _Py_LATIN1_CHR characters (code points 0-255) are static, but not interned. They may be interned later while the Python is running. This can happen for various reasons, including calls to sys.intern.
This isn't thread-safe: it modifies the hashtable _PyRuntime.cached_objects.interned_strings, which is shared across threads and interpreters, without any synchronization.
It also can break the interning identity invariant. You can have a non-static, interned 1-characters string later shadowed by the global interning of the static 1-character string.
Suggestions
- The
_PyRuntime.cached_objects.interned_strings should be immutable. We should not modify it after Py_Initialize() until shutdown (i.e., _PyUnicode_ClearInterned called from finalize_interp_types())
- The 1-character latin1 strings should be interned. This can either be by explicitly interning them during startup, or by handling 1-character strings specially in
intern_common.
cc @encukou @ericsnowcurrently
Linked PRs