view SIM-data-formats @ 94:7aaed576fa26

SIM-data-formats: fix Russian UCS-2 string example
author Mychaela Falconia <falcon@freecalypso.org>
date Tue, 10 May 2022 06:33:43 +0000
parents 7609ff4be49f
children
line wrap: on
line source

FreeCalypso is developing a family of several different tools that operate on
SIM cards and user data (primarily phonebooks) stored in them, accessing the
same underlying data through various mechanisms:

* fc-simtool in our FC SIM tools suite operates on SIM cards inserted into a
  smart card "reader" device, without going through any kind of phone or other
  GSM device - most direct manipulation of SIM user data content.

* Our FC host tools suite features a new utility called fc-simint - it is a
  front end to fc-simtool that operates on SIM cards inserted into Calypso
  phones or FC modem boards, working on the same principle as fc-loadtool
  (suspending and bypassing the Calypso device's regular operational firmware),
  but operating on the device's SIM interface rather than its flash.

* We have a FreeCalypso User Phone Tools suite that communicates with FC modem
  boards and the future FC phone handset via AT commands.  We have plans to add
  phonebook manipulation commands to this suite (based on AT+CPBR and AT+CPBW),
  reading and writing phonebook data files in the same format as fc-simtool.

Because we have several different tools (some already written, others only
planned) that will need to read and write exactly the same data formats, and
because these tools will have to live in different source repositories (totally
different underlying hardware and system library requirements), the data format
specification needs to be global and independent of particular hw tools - it is
the present document.

GSM 03.38 / 23.038 string representation
========================================

The world of GSM does not use ASCII - in all places where ASCII strings would
appear in the world of ordinary computing, GSM uses its own different 7-bit
character set instead, defined in GSM TS 03.38 or 3GPP TS 23.038.  Many SIM card
data files (including phonebooks) contain so-called alpha fields in which GSM
03.38 (not ASCII!) characters are packed into 8-bit bytes, with the high bit
zeroed.  (These alpha fields also allow alternative UCS-2 encodings,
distinguished by the high bit being set - but we handle this case separately.)
Some other SIM card data files (EF_PNN for example) contain GSM 03.38 7-bit text
strings packed into bytes like in SMS.

However, when we store text strings (such as phonebook contact names) that have
been read out of a SIM (or are intended to be written to a SIM) in UNIX text
files, or pass them around in command line arguments, we need an ASCII-based
representation of these text strings that are encoded in GSM7 in the actual
GSM/SIM world.  Furthermore, our ASCII representation needs to be 100% lossless
and well-defined.

Our function for lossless conversion of GSM 03.38 strings to ASCII operates as
follows:

* The output is always enclosed in double-quote characters, as in "text string".

* All GSM7 code points that map to characters that are also present in ASCII
  translate to these ASCII characters: for example, GSM7 code 0x00 becomes '@',
  and GSM7 code 0x02 becomes '$'.

* Any double-quote characters in the data are escaped with a backslash,
  becoming \"

* GSM7 escape sequences for ASCII characters [\]^ and {|}~ are recognized and
  converted to these ASCII characters; \ is then escaped in the output as \\

* GSM7 escape sequence for the Euro currency symbol is recognized and converted
  to \E

* GSM7 code points corresponding to CR and LF are represented as \r and \n

* GSM7 escape characters that are not part of a valid sequence for [\]^ or {|}~
  (or for \E) are represented as \e

* All other GSM7 characters that cannot be represented in ASCII in any other
  way are represented as \xX escapes, where xX is a two-digit hexadecimal number
  in the range between 00 and 7F, inclusive.

The result of these rules is as follows:

* If the text item consists entirely of characters that exist in ASCII (the most
  common use case), it will appear naturally in ASCII, even if it contains
  characters like '@' and '$' that have different code points in GSM7, or
  characters in the [\]^ and {|}~ sets that require escaping in GSM7.

* Any text item containing weird characters will still be converted losslessly,
  so it can be written back into the SIM or decoded manually by a GSM7-knowing
  user, and the representation in data files and command output is always
  printable ASCII, nothing else.

* In cases where an occasional weird character appears in an otherwise ASCII-
  dominated string, it is easy to both mentally decode and manually enter such
  characters when necessary.  For example, if one of your SIM contacts is a lady
  named Michele who spells her name in the French way, with an accent grave on
  the first 'e' (non-ASCII character U+00E8), her name shall be entered as
  "Mich\04le", nicely preserving the needed non-ASCII character whose GSM 03.38
  code point is 0x04.

When a string argument that is destined for conversion to GSM7 is parsed, our
input parser always interprets any backslash (\) characters as escapes; it
understands all of the same escape sequences which we emit in output:

\"	literal "
\\	literal \ (encoded in GSM 03.38 as another form of escape)
\E	Euro currency symbol (ditto)
\e	GSM 03.38 escape character 0x1B
\n	GSM 03.38 LF character 0x0A
\r	GSM 03.38 CR character 0x0D
\xX	GSM 03.38 code point xX, passed through literally

If the input contains ASCII characters which do not exist in GSM7 (` and all
control characters except \n and \r), it is an error.

If our ASCII-to-GSM7 conversion functions are given 8-bit input, such input is
interpreted as ISO 8859-1: any 8859-1 high characters that have GSM7
counterparts will be translated accordingly.  (Non-GSM7-mappable high characters
are an error just like non-GSM7-mappable ASCII chars.)  However, our output is
always 7-bit ASCII only, using \xX escapes for GSM 03.38 characters that fall
outside of ASCII.

Phonebook file format
=====================

fc-simtool pb-dump command displays SIM phonebook content on the terminal or
saves it in a file in the format defined here, and other tools such as
fc-simtool pb-restore and pb-update commands need to be able to read back the
same format losslessly.  The phonebook file format is hereby shown by way of
example:

#1: #646#,0x81 "Check Minutes"
#2: #674#,0x81 "Check Text Usage"
#3: #225#,0x81 "Check Balance"
#4: 8675309,0x81 "Jenny"
#5: 88211016401,0x91 "sysmoUSIM-SJS1 MSISDN"
#6: 44444,0x81 HEX 810B0893BEC0BABEBC209A9FA1A1
#7: *123#,0x81 ""
#8: 5551234,0x81 "HEX magic spells by Mich\04le"

The rules are as follows:

* Each line in the file format represents one phonebook record.

* The decimal number between the initial '#' and the following ':' is the
  record number in the phonebook, between 1 and 255 as in the SIM protocol
  READ RECORD and UPDATE RECORD commands.

* The phone number is always given without quotes, and consists only of digits
  and '*' and '#' characters - no '+' international symbol is allowed in this
  file format.

* The TON/NPI byte is required, is always given in hex as 0xXX (no other form
  allowed in this file format), and is separated from the phone number digit
  string by a comma.  Note how this byte usually equals 0x91 for international
  numbers (those entered with a '+' in typical UIs) or 0x81 otherwise.

* Either a quoted-string or a hex-string is always present at the end of each
  record, giving the alpha tag for the phonebook entry.  This field is
  mandatory in the file format; if there is no alpha tag (really meaning empty
  alpha tag), the line ends with empty quoted-string "".

* Quoted-strings for the alpha tag are used for either empty/null or
  GSM7-encoded alpha tags; hex-strings are used for UCS2-encoded alpha tags.

* The format of hex-string alpha tags is as shown in entry #6 in the example
  above - this example gives a contact name in Russian.  (Full decoding of this
  contact name is left as an exercise for adventurous readers - see
  ETSI TS 102 221 Annex A and the Cyrillic block of Unicode.)

* Hex-strings can be used for any arbitrary bytes in the alpha tag, but are only
  needed for UCS-2 encodings.  Every possible GSM7 string can be represented in
  our quoted-string notation.

* The quoted-string (GSM 03.38) form of the alpha tag must always be quoted,
  even if quotes seem optional like in the "Jenny" example above (record #4).
  The absence of quotes is what allows the HEX keyword to be distinguished:
  compare and contrast records #6 and #8 in the example.

The above format applies when the almost-never-used CCP and EXT bytes in the
phonebook record both equal 0xFF, meaning not used.  In the unlikely case when
these fields are used, the following extra fields are added to the line-based
representation:

* If CCP != 0xFF, a "CCP=%u " field is inserted between the phone number and
  the alpha tag.

* If EXT != 0xFF, a "EXT=%u " field is inserted between the phone number and
  the alpha tag.

* If both CCP and EXT are present, the CCP= field appears before the EXT= field,
  same order as in the SIM binary record.