changeset 805:a43c5dc251dc

doc/User-phone-tools: new sms-pdu-decode backslash escapes
author Mychaela Falconia <falcon@freecalypso.org>
date Thu, 25 Mar 2021 05:10:43 +0000
parents 30fbaa652ea5
children 843850c526b7
files doc/User-phone-tools
diffstat 1 files changed, 36 insertions(+), 4 deletions(-) [+]
line wrap: on
line diff
--- a/doc/User-phone-tools	Thu Mar 25 03:26:23 2021 +0000
+++ b/doc/User-phone-tools	Thu Mar 25 05:10:43 2021 +0000
@@ -200,8 +200,7 @@
 
 By default, sms-pdu-decode only emits 7-bit ASCII characters in its output; any
 GSM7 or UCS-2 characters which fall outside of this plain ASCII repertoire are
-displayed as the '?' error character and the presence of such decoding errors
-is indicated in the Length: header.  This conservative default behaviour can be
+converted into backslash escapes.  This conservative default behaviour can be
 modified as follows:
 
 -e option extends the potential output character repertoire from 7-bit ASCII to
@@ -209,8 +208,41 @@
 i.e., are NOT encoded in UTF-8 - this option is intended for non-UTF-8
 environments.
 
--u option extends the potential output character repertoire to the entire Basic
-Multilingual Plane of Unicode, and changes the output encoding to UTF-8.
+-u option extends the potential output character repertoire to all of Unicode,
+and changes the output encoding to UTF-8.
+
+Regardless of whether the source message character set is GSM7 or UCS-2 and
+irrespective of -e or -u options, any backslash characters are always escaped
+as \\, and any CR characters are represented as \r.  Additional backslash
+escape encodings depend on the source message character set:
+
+* If the source message character set is GSM7, the following additional
+  backslash escapes can be emitted:
+
+  - In the absence of -u option, the Euro currency symbol is converted to \E;
+
+  - Any GSM7 escape characters (0x1B) that aren't part of a valid escape
+    sequence for [\]^ or {|}~ or \E are represented as \e;
+
+  - Any GSM7 characters that either can't be represented in the output character
+    set (ASCII or ISO 8859-1) or are outright invalid per GSM 03.38 are
+    represented as \xX, where xX is the original GSM7 code point in 2-digit
+    hexadecimal form between 00 and 7F;
+
+  - Invalid GSM7 escape sequences are emitted as \e\xX.
+
+* If the source message character set is UCS-2, the following additional
+  backslash escapes can be emitted:
+
+  - Invalid UCS-2 characters falling onto control character code points are
+    emitted as \u00XX;
+
+  - UCS-2 characters that can't be represented in ASCII or ISO 8859-1 (when
+    running without -u option) are emitted as \uXXXX;
+
+  - If UTF-16 surrogate pairs are detected in the input, the encoded high-plane
+    Unicode character is reconstructed and emitted as \UXXXXXX in the absence
+    of -u option, or as the appropriate UTF-8 byte sequence with -u.
 
 -h option causes the user data portion of every message to be displayed as a
 raw hex dump; in the case of GSM7-encoded messages, this hex dump shows the