Opened 3 years ago

Last modified 2 years ago

#63007 new update

par @1.52_2: update to 1.53 to fix crashes with utf-8

Reported by: grr Owned by:
Priority: Normal Milestone:
Component: ports Version:
Keywords: Cc: qbarnes@…
Port: par

Description

the current port version, 1.52_2, always crashes when used with utf-8 data, but version 1.53, which was released several months ago, fixes it.

https://bitbucket.org/amc-nicemice/par/src/master/releasenotes

Attachments (1)

patch-i18n.diff.gz (9.6 KB) - added by casr (Chris Rawnsley) 2 years ago.
patch-i18n-1.53.diff

Download all attachments as: .zip

Change History (14)

comment:1 Changed 3 years ago by qbarnes (Quentin Barnes)

Can you tell me how to reproduce the crash you're seeing? I tried export LANG=en_US.UTF-8 and running par, but couldn't make it crash.

I looked into the 1.53.0 release and what it would take to update the Macports version. I've got the Portfile updated. Most of the existing patches have been integrated, so they can be deprecated, but not every patch has been integrated. The monster one for i18n multibyte support isn't there, and its huge and complex (1400+ lines).

I wrote the author about the i18n patch. (He didn't write it originally.) I haven't heard back yet. It's been over a year since 1.53.0 was released. I would think if he was planning to update the patch he would have done it by now. I could try to update the patch myself, but I'm not sure if I want to put in that effort (yet).

I could see about integrating my porting changes nowish to get 1.53.0 out, but dropping the +i18n variant, or waiting until when or if that patch gets updated. It would help a lot if I knew how much that patch was used by the community.

comment:2 Changed 3 years ago by grr

Just realized that the UTF-8 data I was using might be invalid, so the issue probably is that it crashes on invalid encoded data. The example I was using:

lynx -dump https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html | par

This results in:

par(69110,0x1026bbd40) malloc: Incorrect checksum for freed object 0x159705e78: probably modified after being freed.
Corrupt value: 0xe2200000e34
par(69110,0x1026bbd40) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

and sometimes: Segmentation fault: 11

This also fails:

lynx -dump --assume_charset=utf-8 --display_charset=utf-8 https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html | par

But this works:

lynx -dump --assume_charset=utf-8 https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html | par

Version 1.53 of par doesn't fail on any of these cases.

comment:3 Changed 3 years ago by grr

Just tried reinstalling @1.52_2 with -i18n and that does fix the problem.

comment:4 in reply to:  3 ; Changed 3 years ago by qbarnes (Quentin Barnes)

Replying to grr:

Just tried reinstalling @1.52_2 with -i18n and that does fix the problem.

That variant is default off. You'd need to build with +i18n to see if it changed anything.

comment:5 Changed 3 years ago by qbarnes (Quentin Barnes)

I wrote the author about the i18n patch. [...]

I heard back from the author. He verified that all the previous non-i18n patches had been folded in (as I had thought).

He also let me know this about the i18n patch:

I don't know of anyone working on updating the i18n patch. I have started working on a shim layer that will let Par compile under either C89 without multibyte charset support or under C95 with multibyte charset support built on wchar.h and wctype.h, but I have little spare time and cannot predict how long that will take. If anyone were to try updating the i18n patch, they'd probably beat me.

I don't use the +i18n variant. Sounds like grr doesn't either. Given that 1.52_2 has a reported problem against it (this ticket), maybe we should just proceed with 1.53.0 without its i18n patch, and then let them file a ticket asking to update the i18n patch for 1.53.0 if they use it. Comments?

comment:6 in reply to:  4 Changed 3 years ago by grr

Replying to qbarnes:

Replying to grr:

Just tried reinstalling @1.52_2 with -i18n and that does fix the problem.

That variant is default off. You'd need to build with +i18n to see if it changed anything.

that variant is default on:

from the Portfile: default_variants +i18n

comment:7 Changed 3 years ago by qbarnes (Quentin Barnes)

that variant is default on:

You're absolutely correct! When skimming the file my eyes just skipped over the default_variants line right into the variants line. Also, I didn't do the change this the Portfile that added that line making i18n default (ticket #48551). I only approved the change.

The variant being on complicates matters. I wouldn't want to lose default functionality with an update. I could look into the crash or I could look into porting the i18n patch. Not sure which is best at the moment for the short-term.

Grr, did you try reproducing your crash yet with the i18n variant as off?

comment:8 Changed 3 years ago by qbarnes (Quentin Barnes)

Grr, did you try reproducing your crash yet with the i18n variant as off?

Nevermind, I just found your previous comment:

Just tried reinstalling @1.52_2 with -i18n and that does fix the problem.

Is that a short-term acceptable work-around for you?

comment:9 Changed 3 years ago by qbarnes (Quentin Barnes)

Given that this crash doesn't exist in either 1.52 (when built without its i18n patch) or in 1.53.0 (which doesn't have the i18n patch), I'd bet that 1.53.0 would have the crash problem return when an i18n patch is ported to it. So I don't consider it a good plan to just wait for the patch to become available.

In the mean time, I tore into 1.52+i18n to see if I could debug it. I've made good progress so far. I've gotten the crash to reproduce under Address Sanitizer (ASAN) where a call to memcpy() overwrites memory due to a bad pointer. I've also got the problem down to where I can reproduce it with just a one-line file containing only 3 (wide) characters.

comment:10 Changed 2 years ago by casr (Chris Rawnsley)

Would you be able to share your small repro?

I've updated the patch for wide char support that applies cleanly against 1.53. It seems to be fine against the text from that webpage where the same patch on 1.52 crashes. However, the new test suite that 1.53 introduces now has some failures so it isn't perfect. Perhaps we can use it as a starting point?

Changed 2 years ago by casr (Chris Rawnsley)

Attachment: patch-i18n.diff.gz added

patch-i18n-1.53.diff

comment:11 Changed 2 years ago by casr (Chris Rawnsley)

By the way, I originally added the i18n patch because many characters would be replaced with ? when they were reformatted. 1.53 does not seem as susceptible to this although with the patch is still better on the example given above.

comment:12 Changed 2 years ago by casr (Chris Rawnsley)

Okay, I jumped the gun on this a little. With that patch applies it *sometimes* crashes when trying to reformat the text. That is fun...

Here is the original text https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt so we can have a more consistent base to test on (won't need to be translated by Lynx or whatever).

comment:13 in reply to:  10 Changed 2 years ago by qbarnes (Quentin Barnes)

Replying to casr:

Would you be able to share your small repro?

I've updated the patch for wide char support that applies cleanly against 1.53. It seems to be fine against the text from that webpage where the same patch on 1.52 crashes. However, the new test suite that 1.53 introduces now has some failures so it isn't perfect. Perhaps we can use it as a starting point?

I've had trouble locating my work on 1.53.0. It wasn't that far along. You've probably gotten further than I had.

I did find all of my 1.52 work with test cases I was using. I didn't have the work in a repo, so I converted it to one and uploaded it to https://github.com/qbarnes/Par. I got close, but still had crashes with Thai with characters of zero width and Japanese with characters of two width.

Note the branches in my Par repo. The one you want to use to reproduce my crashes and debugging output when using ASAN is wcwidth0-debugging.

I put my testing results in https://gist.github.com/qbarnes/4445673340b52ec4c8698c295e4f0085. Files lynxa.out, lynx7.out, and japanese.out are the input files to par. The files asan-test11.out and japanese-asan-test1.out show the buffer corruption crashes that ASAN detected. The latter two files also have the shell commands embedded starting with $ in them I was using for reproducing the output. If you don't get functionally equivalent output I captured, please let me know.

The 3 input files above are samples I took from https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt. I just reduced them from their source.

One thing I didn't remember was if there was a reason why a small patch from my earlier work got dropped from the work under branch wcwidth0-debugging. The original patch was under branch wcwidth0:

--- a/par.c
+++ b/par.c
@@ -422,9 +422,10 @@ static wchar_t **readlines(
         }
         continue;
       }
-      // Exclude non-breaking space from the class of space chars
-      if (isspace(c) && isascii(c)) ch = ' ';
-      else blank = 0;
+      if (isspace(c)) 
+        ch = ' ';
+      else 
+        blank = 0;
       additem(cbuf, &ch, errmsg);
       if (*errmsg)
         goto rlcleanup;

Simple regression or a problem I later found with it?

I'll send you the emails I exchanged with Adam this summer if you send me your email address. (Mine is in the cc: for this ticket.) They go into detail with explaining my changes to him and his thoughts on them.

Note: See TracTickets for help on using tickets.