diff SIM-data-formats @ 38:ec184dad4877

SIM-data-formats article written
author Mychaela Falconia <falcon@freecalypso.org>
date Fri, 12 Feb 2021 08:42:11 +0000
parents
children ce044aa49baf
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/SIM-data-formats	Fri Feb 12 08:42:11 2021 +0000
@@ -0,0 +1,181 @@
+FreeCalypso is developing a family of several different tools that operate on
+SIM cards and user data (primarily phonebooks) stored in them, accessing the
+same underlying data through various mechanisms:
+
+* Our current fc-simtool utility operates on SIM cards inserted into a smart
+  card "reader" device, without going through any kind of phone or other GSM
+  device - most direct manipulation of SIM user data content.
+
+* We have plans to develop a companion utility (tentatively named fc-simint)
+  that will operate on SIM cards inserted into Calypso phones or FC modem
+  boards, working on the same principle as fc-loadtool (suspending and bypassing
+  the Calypso device's regular operational firmware), but operating on the
+  device's SIM interface rather than its flash.  This companion utility is
+  planned to replicate the end-user-oriented functionality of fc-simtool.
+
+* We have a FreeCalypso User Phone Tools suite that communicates with FC modem
+  boards and the future FC phone handset via AT commands.  We have plans to add
+  phonebook manipulation commands to this suite (based on AT+CPBR and AT+CPBW),
+  reading and writing phonebook data files in the same format as fc-simtool.
+
+Because we have several different tools (some already written, others only
+planned) that will need to read and write exactly the same data formats, and
+because these tools will have to live in different source repositories (totally
+different underlying hardware and system library requirements), the data format
+specification needs to be global and independent of particular hw tools - it is
+the present document.
+
+GSM 03.38 / 23.038 string representation
+========================================
+
+The world of GSM does not use ASCII - in all places where ASCII strings would
+appear in the world of ordinary computing, GSM uses its own different 7-bit
+character set instead, defined in GSM TS 03.38 or 3GPP TS 23.038.  Many SIM card
+data files (including phonebooks) contain so-called alpha fields in which GSM
+03.38 (not ASCII!) characters are packed into 8-bit bytes, with the high bit
+zeroed.  (These alpha fields also allow alternative UCS-2 encodings,
+distinguished by the high bit being set - but we handle this case separately.)
+Some other SIM card data files (EF_PNN for example) contain GSM 03.38 7-bit text
+strings packed into bytes like in SMS.
+
+However, when we store text strings (such as phonebook contact names) that have
+been read out of a SIM (or are intended to be written to a SIM) in UNIX text
+files, or pass them around in command line arguments, we need an ASCII-based
+representation of these text strings that are encoded in GSM7 in the actual
+GSM/SIM world.  Furthermore, our ASCII representation needs to be 100% lossless
+and well-defined.
+
+Our function for lossless conversion of GSM 03.38 strings to ASCII operates as
+follows:
+
+* The output is always enclosed in double-quote characters, as in "text string".
+
+* All GSM7 code points that map to characters that are also present in ASCII
+  translate to these ASCII characters: for example, GSM7 code 0x00 becomes '@',
+  and GSM7 code 0x02 becomes '$'.
+
+* Any double-quote characters in the data are escaped with a backslash,
+  becoming \"
+
+* GSM7 escape sequences for ASCII characters [\]^ and {|}~ are recognized and
+  converted to these ASCII characters; \ is then escaped in the output as \\
+
+* GSM7 code points corresponding to CR and LF are represented as \r and \n
+
+* GSM7 escape characters that are not part of a valid sequence for [\]^ or {|}~
+  are represented as \e
+
+* All other GSM7 characters that cannot be represented in ASCII in any other
+  way are represented as \xX escapes, where xX is a two-digit hexadecimal number
+  in the range between 00 and 7F, inclusive.
+
+The result of these rules is as follows:
+
+* If the text item consists entirely of characters that exist in ASCII (the most
+  common use case), it will appear naturally in ASCII, even if it contains
+  characters like '@' and '$' that have different code points in GSM7, or
+  characters in the [\]^ and {|}~ sets that require escaping in GSM7.
+
+* Any text item containing weird characters will still be converted losslessly,
+  so it can be written back into the SIM or decoded manually by a GSM7-knowing
+  user, and the representation in data files and command output is always
+  printable ASCII, nothing else.
+
+* In cases where an occasional weird character appears in an otherwise ASCII-
+  dominated string, it is easy to both mentally decode and manually enter such
+  characters when necessary.  For example, if one of your SIM contacts is a lady
+  named Michele who spells her name in the French way, with an accent grave on
+  the first 'e' (non-ASCII character U+00E8), her name shall be entered as
+  "Mich\04le", nicely preserving the needed non-ASCII character whose GSM 03.38
+  code point is 0x04.
+
+When a string argument that is destined for conversion to GSM7 is parsed, our
+input parser always interprets any backslash (\) characters as escapes; it
+understands all of the same escapes sequences which we emit in output:
+
+\"	literal "
+\\	literal \ (encoded in GSM 03.38 as another form of escape)
+\e	GSM 03.38 escape character 0x1B
+\n	GSM 03.38 LF character 0x0A
+\r	GSM 03.38 CR character 0x0D
+\xX	GSM 03.38 code point xX, passed through literally
+
+If the input contains ASCII characters which do not exist in GSM7 (` and all
+control characters except \n and \r), it is an error.
+
+If our ASCII-to-GSM7 conversion functions are given 8-bit input, such input is
+interpreted as ISO 8859-1: any 8859-1 high characters that have GSM7
+counterparts will be translated accordingly.  (Non-GSM7-mappable high characters
+are an error just like non-GSM7-mappable ASCII chars.)  However, our output is
+always 7-bit ASCII only, using \xX escapes for GSM 03.38 characters that fall
+outside of ASCII.
+
+Phonebook file format
+=====================
+
+fc-simtool pb-dump command displays SIM phonebook content on the terminal or
+saves it in a file in the format defined here, and other tools such as
+fc-simtool pb-update command need to be able to read back the same format
+losslessly.  The phonebook file format is hereby shown by way of example:
+
+#1: #646#,0x81 "Check Minutes"
+#2: #674#,0x81 "Check Text Usage"
+#3: #225#,0x81 "Check Balance"
+#4: 8675309,0x81 "Jenny"
+#5: 88211016401,0x91 "sysmoUSIM-SJS1 MSISDN"
+#6: 44444,0x81 HEX 810B0893BEC03ABEBC209A9FA1A1
+#7: *123#,0x81 ""
+#8: 5551234,0x81 "HEX magic spells by Mich\04le"
+
+The rules are as follows:
+
+* Each line in the file format represents one phonebook record.
+
+* The decimal number between the initial '#' and the following ':' is the
+  record number in the phonebook, between 1 and 255 as in the SIM protocol
+  READ RECORD and UPDATE RECORD commands.
+
+* The phone number is always given without quotes, and consists only of digits
+  and '*' and '#' characters - no '+' international symbol is allowed in this
+  file format.
+
+* The TON/NPI byte is required, is always given in hex as 0xXX (no other form
+  allowed in this file format), and is separated from the phone number digit
+  string by a comma.  Note how this byte usually equals 0x91 for international
+  numbers (those entered with a '+' in typical UIs) or 0x81 otherwise.
+
+* Either a quoted-string or a hex-string is always present at the end of each
+  record, giving the alpha tag for the phonebook entry.  This field is
+  mandatory in the file format; if there is no alpha tag (really meaning empty
+  alpha tag), the line ends with empty quoted-string "".
+
+* Quoted-strings for the alpha tag are used for either empty/null or
+  GSM7-encoded alpha tags; hex-strings are used for UCS2-encoded alpha tags.
+
+* The format of hex-string alpha tags is as shown in entry #6 in the example
+  above - this example gives a contact name in Russian.  (Full decoding of this
+  contact name is left as an exercise for adventurous readers - see
+  ETSI TS 102 221 Annex A and the Cyrillic block of Unicode.)
+
+* Hex-strings can be used for any arbitrary bytes in the alpha tag, but are only
+  needed for UCS-2 encodings.  Every possible GSM7 string can be represented in
+  our quoted-string notation.
+
+* The quoted-string (GSM 03.38) form of the alpha tag must always be quoted,
+  even if quotes seem optional like in the "Jenny" example above (record #4).
+  The absence of quotes is what allows the HEX keyword to be distinguished:
+  compare and contrast records #6 and #8 in the example.
+
+The above format applies when the almost-never-used CCP and EXT bytes in the
+phonebook record both equal 0xFF, meaning not used.  In the unlikely case when
+these fields are used, the following extra fields are added to the line-based
+representation:
+
+* If CCP != 0xFF, a "CCP=%u " field is inserted between the phone number and
+  the alpha tag.
+
+* If EXT != 0xFF, a "EXT=%u " field is inserted between the phone number and
+  the alpha tag.
+
+* If both CCP and EXT are present, the CCP= field appears before the EXT= field,
+  same order as in the SIM binary record.