Notice: We have migrated to GitLab launching 2024-05-01 see here: https://gitlab.rtems.org/

#4726 closed defect (fixed)

RSB decode exception stops build

Reported by: Chris Johns Owned by: Chris Johns
Priority: normal Milestone: 6.1
Component: admin Version: 6
Severity: normal Keywords:
Cc: Blocked By:
Blocking:

Description (last modified by Chris Johns)

Building in a Rocky VM on FB 12 with 5 I got:

Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/work/chris/rtems/rsb/rtems-source-builder.git/source-builder/sb/execute.py", line 204, in _readthread
    data = data.decode(sys.stdout.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4095: unexpected end of data

If the data is corrupted or broken things stop. Fix to attempt to keep going.

This issue also effect rtems-tools.

Change History (12)

comment:1 Changed on 09/29/22 at 09:18:56 by Frank Kuehndel

I have seen similar errors in special build environments. These two are from March 2022, RSB for RTEMS 6 (note the slight difference in the error message in the last line):

  GIT BUILD HEAD: 4cdec141b1320a1e5a04b898e13e04c43ec233c3 ubuntu-nios2
  DevTools: Build #124 ubuntu-nios2 (Mar 9, 2022, 12:38:59
  building: nios2-rtems6-gcc-0f001dd-newlib-85f2dca-x86_64-linux-gnu-1
  21:08:10  Exception in thread _stderr[]:
  21:08:10  Traceback (most recent call last):
  21:08:10    File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  21:08:10      self.run()
  21:08:10    File "/usr/lib/python3.8/threading.py", line 870, in run
  21:08:10      self._target(*self._args, **self._kwargs)
  21:08:10    File "/home/minna/src/rtems-source-builder/source-builder/sb/execute.py", line 204, in _readthread
  21:08:10      data = data.decode(sys.stdout.encoding)
  21:08:10  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4095: unexpected end of data
  GIT_HEAD="6fe98f91d94bbf965bc0e78015585ff8823d17bd
  BUILD_ACTUAL_RSB_OPTIONS="--with_ada --with_cxx --with_objc --jobs=12"
  building: v850-rtems6-gcc-0f001dd-newlib-d88cbd0-x86_64-linux-gnu-1
  18:55:36  Exception in thread _stderr[]:
  18:55:36  Traceback (most recent call last):
  18:55:36    File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  18:55:36      self.run()
  18:55:36    File "/usr/lib/python3.8/threading.py", line 870, in run
  18:55:36      self._target(*self._args, **self._kwargs)
  18:55:36    File "/home/minna/src/rtems-source-builder/source-builder/sb/execute.py", line 204, in _readthread
  18:55:36      data = data.decode(sys.stdout.encoding)
  18:55:36  UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 4094-4095: unexpected end of data

Setting the locale to any "only-one-byte-unicode" linke LANG=en_US.iso885915 when invoking the source-builder avoids the error.

There is most likely no error in the input data but the loop decoding it has probably a bug. Unicode code points encoded in UTF-8 can be up to 4 bytes long. The input data is read by source-builder/sb/execute.py in blocks of 4096 bytes. I guess it cuts through a multi-byte code point with the first byte(s) at the end of the current block and the rest of the bytes in the beginning of the next block. Note that the error always points to the end of the block position 4095 or position 4094-4095. Moreover, byte 0xe2 indicates that this is not the last byte of the code point (0xe2 is the first byte of a 3 byte code point?).

I know in Linux there exist some GNU Unicode decoding function which keeps a state between consecutive calls. This function is intended for use in cases where not the complete text can be decoded at once. I guess there is such a function for Python too. An alternative may be to read and decode the input in one large piece.

comment:2 Changed on 09/29/22 at 09:26:56 by Chris Johns

Thanks for input and feedback. I agree with what you are saying.

The input can be in hundreds of mega so that is not practical to hold it until the end. This code is a performance hot spot so what happens needs to be kept as simple as possible.

Would you be able to make a buffer of data I could test solution with?

comment:3 Changed on 09/29/22 at 09:34:21 by Chris Johns

It looks like using Python codecs with streaming is the way to handle this.

Again thank you for the insights in how the data is handled.

comment:4 Changed on 09/29/22 at 10:00:33 by Frank Kuehndel

I case this helps, this script

1) writes 4093 spaces (note the field with is 4094)
2) a single "A" character
3) A 4 byte unicode
4) A new line

env LANG=en_US.UTF-8 printf '%4094c\U0001F0A1\n' A >file.txt

You can visualize it:

$ wc file.txt 
   1    1 4099 file.txt
$ hexdump -C file.txt
00000000  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00000ff0  20 20 20 20 20 20 20 20  20 20 20 20 20 41 f0 9f  |             A..|
00001000  82 a1 0a                                          |...|
00001003
Last edited on 09/29/22 at 10:05:44 by Frank Kuehndel (previous) (diff)

comment:5 Changed on 09/29/22 at 10:22:52 by Chris Johns

Thanks. I have a patch I am testing. It uses the codecs.getincrementaldecoder() factory function to create a decoder that hold state across decodes.

comment:6 Changed on 09/29/22 at 10:23:27 by Chris Johns

Description: modified (diff)

comment:7 Changed on 09/29/22 at 10:46:08 by Chris Johns

I am testing with this code:

import codecs
import sys
bad = True
if bad:
    data = ''
    with open('file.txt', 'rb') as f:
        while True:
            block = f.read(4096)
            data += block.decode(sys.stdout.encoding)
            if len(block) < 4096:
                break
    print(len(data), len(block))
else:
    decoder = codecs.getincrementaldecoder(sys.stdout.encoding)()
    data = ''
    with open('file.txt', 'rb') as f:
        while True:
            block = f.read(4096)
            data += decoder.decode(block)
            if len(block) < 4096:
                break
    print(len(data), len(block))

and I cannot generate a failure with bad = True.

comment:8 Changed on 09/29/22 at 11:05:33 by Frank Kuehndel

When I paste your code in f.py with bad = True I get the error:

$ python f.py
Traceback (most recent call last):
  File "f.py", line 9, in <module>
    data += block.decode(sys.stdout.encoding)
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4094-4095: unexpected end of data

with bad = False it works:

$ python f.py
(4096, 3)

I tested on an OpenSUSE 15.4, file.txt produced by the line from comment 4. locale prints

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=POSIX
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Last edited on 09/29/22 at 11:06:55 by Frank Kuehndel (previous) (diff)

comment:9 Changed on 09/29/22 at 11:12:41 by Chris Johns

Thanks for testing. I think the patch I posted to the devel list should fix this. If you could be so kind and review the patch I would appreciate it.

comment:10 Changed on 09/29/22 at 20:56:03 by Chris Johns <chrisj@…>

In [changeset:"d7fb57fa9fae3071793a57d92ff8e0f4adb8b819/rtems-source-builder" d7fb57f/rtems-source-builder]:

sb/execute: Use a decoder that maintains state aross blocks

Update #4726

comment:11 Changed on 09/30/22 at 02:27:15 by Chris Johns

The change has broken a dry-run. There is no stdout and so no encoder. The incremental decoder needs an encoding.

comment:12 Changed on 09/30/22 at 21:07:04 by Chris Johns <chrisj@…>

Resolution: fixed
Status: assignedclosed

In [changeset:"e04c84191b790b7bddd179bc67337e4205b61f8e/rtems-source-builder" e04c841/rtems-source-builder]:

sb/execute: Fix incremental decoder with --dry-run

Closes #4726

Note: See TracTickets for help on using tickets.