Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling invalid UTF-8 bytes #38

Open
sunfishcode opened this issue Jan 8, 2020 · 7 comments
Open

Handling invalid UTF-8 bytes #38

sunfishcode opened this issue Jan 8, 2020 · 7 comments

Comments

@sunfishcode
Copy link
Contributor

I'm looking at using vte for a use case where I want to translate invalid UTF-8 bytes into Unicode replacement characters, however vte seem to silently swallow some invalid UTF-8 bytes. For example, if I feed it input consisting of the byte 0x90, it produces no events.

Would it make sense to add Execute rules to the Ground table for 0x90 and other formerly special C1 codes?

Would it make sense to introduce something like a InvalidUtf8 action, to fill in the Ground table in general?

@sunfishcode sunfishcode changed the title Handling invalid UTF_8 bytes Handling invalid UTF-8 bytes Jan 8, 2020
@chrisduerr
Copy link
Member

Non-utf8 8-bit C1 escapes should be passed to execute, so you should be able to handle C1 codes if that's your issue?

@sunfishcode
Copy link
Contributor Author

Here's a more specific testcase:

$ echo -e '\x90' > test.txt
$ target/debug/examples/parselog < test.txt
[execute] 0a
$

The 0x90 byte is silently dropped with no execute or any other action.

@chrisduerr
Copy link
Member

\x90 is an escape introducer, which is stripped for security based on my understanding of the code.

So escapes like \x85 will emit an execute, but the DCS(x90)/CSI(x9b)/OSC(x9d) 8-bit escapes are ignored.

@sunfishcode
Copy link
Contributor Author

I don't actually want to interpret C1 controls in my use case; I want to replace all non-UTF-8 bytes into replacement characters.

Right now, vte doesn't support that, either for bytes like 0x90 which are C1 controls, or bytes like 0xfd which are not. Is this a use case vte is interested in supporting?

@chrisduerr
Copy link
Member

Is this a use case vte is interested in supporting?

I'm not sure if it's possible to support that without removing existing functionality.

Take things like the NEL non-utf8 8-bit C1 escape \x85. We trigger the execute function for that with this byte attached. So it's a valid escape that we propagate upstream for handling. So it's not actually invalid at all.

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

@sunfishcode
Copy link
Contributor Author

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

Yes, that's what I want to do. It's ok if vte reports these bytes through execute or a new invalid hook or some other hook. I just want to know when these bytes happen so that I know when to emit replacement characters.

Specifically, I want to do this for both C1 codes like 0x90, and non-C1 codes like 0xfd. I can cope if these two cases are reported differently, and it's even ok if the API doesn't tell me what the actual bytes are, as long as it provides indications that such bytes were processed.

@chrisduerr
Copy link
Member

For actually invalid UTF-8, we already print error glyphs (see echo -e "\xc2\xc2"). So as far as I can tell we'd probably just need to make sure that bytes that are ignored right now are somehow propagated (like C1 DCS/CSI/OSC).

For these specific bytes it would be possible to propagate them to the execute function without actually handling them, though I'm not sure about other things like 0xfd, I'd have to look into that myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants