Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non unicode path error on Linux when scanning a dir with unicode as input #86

Open
pombredanne opened this issue Aug 16, 2017 · 3 comments

Comments

@pombredanne
Copy link

With Python 2.7.13, on Linux (fs.encoding is UTF-8) I get this error:

$ touch foo$'\261'bar
$ touch plain
$ python2 -c "import scandir;print list(scandir.scandir('.'))"
[<DirEntry 'foo\xb1bar'>, <DirEntry 'plain'>]

$ python2 -c "import scandir;print list(scandir.scandir(u'.'))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/venv/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 5: invalid start byte

FWIW, os.listdir is completely out of whackon Python 2 returning a mix of bytes or unicode:

$ python2 -c "import os;print os.listdir('.');print os.listdir(u'.')"
['foo\xb1bar', 'plain']
['foo\xb1bar', u'plain']

While Python 3 uses surrogate escape for decoding to unicode:

$ python3 -c "import os;print(os.listdir('.'))"
['foo\udcb1bar', 'plain']
``

I was hoping that scandir would not have the shortcomings of os.listdir on Python 2....
@benhoyt
Copy link
Owner

benhoyt commented Aug 16, 2017

Thanks @pombredanne. On Python 2, scandir with a Unicode path (scandir(u'.')) should return unicode path names in DirEntry.name and DirEntry.path. Given your traceback here, I'm wondering if the UnicodeDecodeError you've shown is actually due to the printing on the console?

@pombredanne
Copy link
Author

@benhoyt Thank you for having looked into this!

The unicode error happens also when not printing to the console with Python 2:

$ mkdir testdir
$ cd testdir
$ touch foo$'\261'bar
$ python
[...]
>>> import scandir;l=list(scandir.scandir('.'))
>>> l
[<DirEntry 'foo\xb1bar'>]
>>> import scandir;l=list(scandir.scandir(u'.'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 5: invalid start byte
>>> 

@pombredanne
Copy link
Author

In the end the issue is the lack of a surrogateescape handler for Python2.
On Python3, everything uses the os.fsencode/fsdecode at the boundaries using surrogate pairs when needed (such as on Linux)
I am not sure how this is handled on your Python3 scandir code though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants