Skip to content

Enhance URI scheme validation for Windows paths#3161

Open
AlgoDeveloper400 wants to merge 1 commit intoapache:mainfrom
AlgoDeveloper400:main
Open

Enhance URI scheme validation for Windows paths#3161
AlgoDeveloper400 wants to merge 1 commit intoapache:mainfrom
AlgoDeveloper400:main

Conversation

@AlgoDeveloper400
Copy link

fix: handle Windows drive letters in parse_location

Rationale for this change

When a Windows user passes a local file path like C:\Users\file.avro to PyArrowFileIO,
Python's urlparse incorrectly treats the Windows drive letter C as a URL scheme (like s3 or http).

This caused PyIceberg to crash with:

Unrecognized filesystem type in URI: 'c'

The Fix

Before ❌ (Original Code):

uri = urlparse(location)

if not uri.scheme:
    default_scheme = properties.get("DEFAULT_SCHEME", "file")
    default_netloc = properties.get("DEFAULT_NETLOC", "")
    return default_scheme, default_netloc, os.path.abspath(location)

After ✅ (Fixed Code):

uri = urlparse(location)

if not uri.scheme or (len(uri.scheme) == 1 and uri.scheme.isalpha()):
    # len == 1 and isalpha() catches Windows drive letters like C:\ D:\
    default_scheme = properties.get("DEFAULT_SCHEME", "file")
    default_netloc = properties.get("DEFAULT_NETLOC", "")
    return default_scheme, default_netloc, os.path.abspath(location)

The only change:

# Before ❌
if not uri.scheme:

# After ✅
if not uri.scheme or (len(uri.scheme) == 1 and uri.scheme.isalpha()):

The added condition checks if the scheme is a single alphabetic character (e.g. C, D, E)
and treats it as a Windows drive letter instead of a URL scheme.


Example

from pyiceberg.io.pyarrow import PyArrowFileIO

io = PyArrowFileIO()

# Before fix - crashed with: Unrecognized filesystem type in URI: 'c'
# After fix - works correctly
scheme, netloc, path = io.parse_location("C:\\Users\\test\\file.avro")

print(scheme)  # 'file'
print(netloc)  # ''
print(path)    # 'C:\\Users\\test\\file.avro'

Impact

This fix affects all local file operations on Windows including:

  • Reading local Iceberg tables
  • Writing local Iceberg tables
  • Any local Avro/Parquet file operations

Are these changes tested?

Yes - existing tests now pass on Windows.

tests/test_avro_sanitization.py

python -m pytest tests/test_avro_sanitization.py -v
tests/test_avro_sanitization.py::test_comprehensive_field_name_sanitization  PASSED
tests/test_avro_sanitization.py::test_comprehensive_avro_compatibility        PASSED
tests/test_avro_sanitization.py::test_emoji_field_name_sanitization           PASSED

tests/io/test_pyarrow.py

python -m pytest tests/io/test_pyarrow.py::test_pyarrow_infer_local_fs_from_path -v
tests/io/test_pyarrow.py::test_pyarrow_infer_local_fs_from_path               PASSED

Are there any user-facing changes?

Yes - fixes local file access on Windows for all PyIceberg users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant