diff doc/HR-codec-Rx-logic @ 632:7fc57e2a6784

beginning of GSM-HR documentation
author Mychaela Falconia <falcon@freecalypso.org>
date Thu, 19 Mar 2026 04:13:45 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/doc/HR-codec-Rx-logic	Thu Mar 19 04:13:45 2026 +0000
@@ -0,0 +1,447 @@
+Rx DTX handler logic for GSM-HR speech codec
+============================================
+
+With all 3 classic GSM speech codecs (FR, HR and EFR), as TCH UL Rx traffic on
+the network side passes from the BTS to the TRAU, the first processing step
+performed by the TRAU prior to actual speech decoding is an Rx DTX handler.
+(For TCH DL Rx on the mobile side, exactly the same processing steps happen in
+total, but because everything is integrated into a single device, interfaces
+between steps may be implemented more loosely.)
+
+For GSM-HR codec the 3 controlling specs for different parts of Rx DTX handler
+logic are GSM 06.21, GSM 06.22 and GSM 06.41 - however, for the full details
+these specs defer to the reference C code in GSM 06.06.  This article explains
+this logic from all aspects which we find important: what the Rx DTX logic was
+in the original reference code from ETSI and how we adapted it in libgsmhr1,
+both for the full speech decoder and for our implementation of TFO transform.
+
+Normative vs freely changeable aspects
+======================================
+
+In the case of error-free transmission, such that the receiver never encounters
+a frame with BFI or UFI set except during continuation of a DTX pause (after
+receiving a valid SID that begins comfort noise insertion) and is never asked
+to begin CN insertion with an invalid SID, the full behaviour of the speech
+decoder to the final linear PCM output is required to be bit-exact and gets
+exercised by test sequences.  This bit-exact behaviour includes non-error-
+handling aspects of the Rx DTX handler and comfort noise generation, complete
+with interpolation for periodic CN updates via subsequent SID frames.
+
+However, the reference C implementation becomes a non-normative example
+(allowing changes in logic without violating spec requirements) in the
+following aspects:
+
+* Handling of BFI and UFI outside of DTX pauses previously entered via a valid
+  SID, including most aspects of error concealment;
+
+* Exact manner of comfort noise muting when expected SID updates fail to arrive;
+
+* The exact logic to be applied when a CN insertion period begins with an
+  invalid SID frame.
+
+Almost-modular nature of GSM-HR Rx DTX handler
+==============================================
+
+An Rx DTX handler can be considered fully modular if its output (which is then
+passed as input to the main body of the speech decoder) is a potentially
+modified set of speech parameters that can be packed into a new speech frame
+and transmitted through a second radio leg with no change in the final output
+of the speech decoder.  The Rx DTX handler implemented in the reference code
+from ETSI (both spec-normative and "example" aspects as broken down above)
+_almost_ meets this modularity criterion, but not fully.  The following aspects
+are non-modular:
+
+* The interpolation of R0 and LPC parameters during comfort noise insertion
+  (bit-exact implementation considered normative) happens after expansion of
+  transmitted parameter bits into linear form.  In the general case one cannot
+  produce a new set of encoded parameters (that can be transmitted through a
+  second radio leg) that will produce the same bit-exact result upon final
+  decoding.
+
+* Handling of speech frames (not SID, outside of DTX pause state) that are
+  marked with BFI=0 and UFI=1 (unreliable frames) has both a modular and a
+  non-modular aspect.  If R0 increment is either small enough to not trigger
+  any mitigation or large enough to where UFI is converted into BFI, the applied
+  handling is fully modular.  However, if R0 increment falls into the narrow
+  window between the two thresholds, the applied handling (output signal
+  concealment per GSM 06.21 section 5.1.2) is non-modular: it happens deep in
+  the guts of the speech decoder and cannot be represented via a modified set
+  of speech parameters.
+
+TFO transform derived from the reference Rx DTX handler
+=======================================================
+
+If one extracts the reference Rx DTX handler from GSM 06.06 code and removes
+the two non-modular aspects detailed above, leaving only fully modular logic,
+the result can be used as a TFO transform that implements the functions of
+TS 28.062 section C.3.2.1.1, specifically Case 1 in which UL may have DTX, but
+DL is required to consist of speech frames only.
+
+How does one address the two non-modular aspects of the standard GSM-HR Rx DTX
+handler that are not possible in TFO?  The simplest implementation is to remove
+them altogether:
+
+* Comfort noise parameters are not interpolated, instead an abrupt change in R0
+  and LPC parameters occurs every 240 ms when a new SID frame arrives.
+
+* UFI is simply dropped in the case when the standard decoder would apply output
+  signal concealment, i.e., the latter feature is given up.
+
+Obviously this approach constitutes functional regression relative to the
+standard speech decoder - thus we were initially hesitant to adopt it.  However,
+experiments with a real historical TRAU that supports TFO (Nokia TCSM2) reveal
+that Nokia implemented exactly the same approach (minimal complexity at the
+price of slight functional degradation) in their TRAU DSP firmware.  Seeing
+that a major classic vendor of GSM infrastructure implemented this simplistic
+approach, we are now comfortable with doing the same - especially considering
+the work scope limits explained in HR-codec-limits article.
+
+In Themyscira libgsmhr1 implementation, a component has been factored out which
+we call the Rx front end (RxFE).  This RxFE is our cleaned-up reimplementation
+of those parts of the original Rx DTX handler that are fully modular (including
+the speech ECU and all CN parameters that aren't interpolated), plus some
+additional internal flag inputs and outputs.  Out of the latter internal flags,
+some are used only by the full speech decoder, while others are used only by
+the TFO transform.  RxFE state, which also serves as the API-visible TFO
+transform state, is a subset of full speech decoder state.  However, the core
+RxFE function is not exported directly as API; instead the TFO transform API
+function is a TFO-specific wrapper around the RxFE.
+
+Detailed RxFE logic and its evolution
+=====================================
+
+Now that we have covered the background of the previous sections, we can
+properly examine the actual logic of our RxFE, the follow-up logic for CN
+interpolation that exists only in the full decoder, and their origins in the
+reference GSM 06.06 code.
+
+Unless noted otherwise, all logic described in the following sections is the
+same between ETSI original and the present Themyscira implementation.  The
+internal representation and code structure may be different, but the behavioral
+logic remains the same unless explicitly called out otherwise.
+
+Input frame classification
+--------------------------
+
+As the very first processing step for every incoming frame, BFI, UFI and SID
+flags are combined per GSM 06.41 Table 1 to classify the frame as good speech,
+valid SID, invalid SID or unusable for DTX purposes.  Note that UFI turns valid
+SID into invalid just like BFI, and for DTX purposes all non-SID frames marked
+with UFI are considered "unusable".  But as we shall see shortly, this
+"unusable" classification matters only for DTX and not for speech ECU logic,
+which is separate.
+
+Speech vs CNI state
+-------------------
+
+RxFE state that carries from one frame to the next includes one very important
+two-state flag: either speech or CNI (comfort noise insertion) mode.  By
+combining the 4 possible frame classifications from GSM 06.41 Table 1 (see
+above) with these two possible carry-over states, we get 4 possible ways in
+which the current frame may be handled:
+
+Input frame class	Previously speech	Previously CNI
+--------------------------------------------------------------
+SID (valid or invalid)	CNIFIRSTSID		CNICONT
+Good speech		SPEECH			SPEECH
+Unusable		SPEECH			CNIBFI
+
+Here we can see that unless we enter DTX/CNI state, neither BFI nor UFI moves
+RxFE logic out of SPEECH handling.  This SPEECH handling mode includes the ECU
+and handles both good and bad speech frames.  However, once DTX/CNI state has
+been entered, then only a (BFI==0 && UFI==0 && SID==0) good speech frame can
+effect exit from this state!
+
+Speech ECU logic
+================
+
+The frame-to-frame persistent state for the ECU consists of the state counter
+variable (range [0,7]) described in GSM 06.21 section 6.3 and a saved copy of
+the last good speech frame.  The just-referenced spec section describes the
+logic quite well, but a few additional notes are in order:
+
+* The last good speech frame that gets regurgitated in substitution/muting
+  states of the ECU is not exactly the same as the actual last good speech frame
+  that went through:
+
+  + GSP0 parameters for the first 3 subframes are replaced with GSP0 parameter
+    for the last subframe;
+
+  + If the frame is voiced, LTP lag parameters are modified - read the code for
+    the details.
+
+  In the original ETSI implementation, these modifications are applied at the
+  time of substitution/muting output; in our implementation, they are applied
+  at the time when a good speech frame is saved.  Our implementation approach
+  makes it clearer what state is actually retained, but the functional behaviour
+  is exactly the same.
+
+* When that last good speech frame gets regurgitated during bad frame handling,
+  codevector parameters may be taken either from that saved last good speech
+  frame or from the current bad frame.  Use of codevector parameters from the
+  current bad frame is possible only when the current bad frame and the saved
+  last good speech frame have the same voiced vs unvoiced mode.  If this mode
+  matches for one frame and bad-frame codevector parameters get passed on, but
+  the next bad frame has incompatible mode, the saved last good speech frame
+  gets used in its entirety once again, subject only to the modifications
+  described above.
+
+* Our Themyscira version features an extension: if BFI equals 2 instead of 1,
+  indicating BFI without payload bits, then there are no bad-frame codevector
+  parameters and the saved last good speech frame is used in its entirety,
+  just as if BFI frames always have the wrong voiced vs unvoiced mode.
+
+BFI out of reset
+================
+
+What happens if the very first input frame in reset state (after external reset
+or after a decoder homing frame) is a bad frame per BFI, or per UFI treated as
+BFI - what is the default "last" good speech frame?  In ETSI original code it
+is a frame of all zero parameters, but this oddity is not readily visible - the
+final output of linear PCM is also all zeros, and all is well.  In Themyscira
+implementation, the output of our RxFE may be visible externally if it is used
+as a TFO transform - hence more attention was given to this issue.
+
+If we feed all zeros as PCM input to a homed standard GSM-HR speech encoder, we
+get this frame, repeating endlessly as long as all-zeros PCM input continues:
+
+R0=00 LPC=164,171,cb Int=0 Mode=0
+s1=00,00,00 s2=00,00,00 s3=00,00,00 s4=00,00,00
+
+This frame differs from all-zero params only in the LPC set, and this sane-LPC
+silence frame is the one we have adopted as our reset-default fallback frame.
+
+When libgsmhr1 full speech decoder engine is used, as opposed to TFO transform,
+there is an additional check.  If the current state is the special home state
+(logic required for spec-mandated EHF output with repeated DHF input) and the
+input frame has BFI flag set (no other flags are considered in this case), the
+PCM output is set to all zero samples without leaving the home state.  However,
+the regular speech ECU and its last good frame default can still be reached if
+BFI is clear, UFI is set and R0 is high.
+
+Comfort noise logic in RxFE
+===========================
+
+GSM 06.22 spec treats the required bit-exact CN generator as a single entity -
+however, in our implementation it is split between the RxFE and the main body
+of the full speech decoder.  The bit-exact result in the case of full speech
+decoding remains the same, but our arrangement allows non-interpolated CN
+generation in the TFO transform as well.
+
+When our RxFE is used as a TFO transform with DTXd=0 (the mode that includes CN
+generation), CN output from the transform matches GSM 06.22 Table 2, with the
+exception of R0 and LPC parameters.  These R0 and LPC parameters will be filled
+as follows:
+
+* If CN insertion period begins with a valid SID, R0 and LPC are taken from
+  that SID.
+
+* If CN insertion period begins with an invalid SID, R0 and LPC are taken from
+  the last good speech frame, the one used by the speech ECU.  Directly out of
+  reset (or after a DHF), these parameters are as shown above:
+
+  R0=00 LPC=164,171,cb
+
+* Any time a new valid SID frame arrives during a CN insertion period, R0 and
+  LPC parameters change to this new SID.
+
+* Any time the input during CN insertion is either an unusable frame or an
+  invalid SID, R0 and LPC parameters remain unchanged from the most recently
+  received valid SID, or from the last good speech frame if only invalid SID
+  frames have been received in the entire CN insertion period so far.
+
+Comfort noise muting
+====================
+
+Per GSM 06.21 sections 5.2.3 and 5.2.4, when SID frames fail to arrive for 3
+consecutive TAF positions, generated comfort noise needs to be muted.  We
+implement this logic in our RxFE, and the actual logic is unchanged from ETSI
+reference code - it is described in GSM 06.21 section 6.4.
+
+This SID aging and CN muting logic works by counting unusable frames received
+in between SID updates.  In the original GSM 06.06 code the criterion to start
+CN muting is:
+
+	TAF == 1 && CNIBFI_count >= 25
+
+In our version we changed it to:
+
+	CNIBFI_count >= (TAF ? 25 : 36)
+
+When TAF is indicated correctly, once every 12 frames and with the flag always
+present at least in BFI frames (consider GSM 08.61 TRAU-8k format), our extended
+criterion is equivalent to the original; however, our version will also produce
+eventual CN muting if TAF is missing.
+
+For the purpose of this logic, invalid SID is as good as valid: while it is
+treated just like unusable frames (CNIBFI) for the purpose of R0 and LPC
+parameters and their interpolation (see next section), for the purpose of SID
+aging and CN muting, invalid SID resets the count of unusable frames, and if
+muting already started previously, it is halted at the current (partially muted)
+R0 value.
+
+Comfort noise interpolation
+===========================
+
+When our RxFE is invoked internally by our full speech decoder, the RxFE passes
+some additional flags to the main body of the decoder.  One of these flags
+controls interpolation of R0 and LPC parameters for CNI, a function that is
+required by the specs with bit-exact stipulation, but which cannot be
+implemented at the level of speech parameters.
+
+The only case in which the behaviour of our libgsmhr1 full speech decoder
+differs from ETSI original is when an invalid SID frame arrives immediately out
+of reset, not preceded by any good speech, valid SID or even unusable frames.
+In this case the original GSM 06.06 code uses initialized all-zero state of
+pswOldFrmKsDec[] array, which cannot happen in any other case.  In our
+implementation we use LPC=164,171,cb instead, as already explained.
+
+Outside of this corner case, invalid SID frames are handled as follows
+(unchanged between EISI original and our version):
+
+* If CN insertion period begins with an invalid SID, R0 and LPC are taken from
+  the last good speech frame, the one used by the speech ECU.  These R0 and LPC
+  params are then fed into the prescribed bit-exact interpolation mechanism as
+  if CN insertion started with a valid SID frame with these parameters.
+
+* Any invalid SID frames that occur in the middle of a CN insertion period are
+  treated just like unusable frames for the purpose of interpolation.
+
+Return from CN insertion to speech state
+========================================
+
+Exit from DTX/CNI state happens upon receipt of a good speech frame, i.e., a
+frame that meets this criterion:
+
+	BFI == 0 && UFI == 0 && SID == 0
+
+However, the original implementation in GSM 06.06 reference code exhibits this
+flaw: if the speech ECU is in state 6 (see GSM 06.21 section 6.3) and then an
+accepted SID frame (valid or invalid) puts us into DTX state, the first good
+speech frame after this DTX pause will be dropped and replaced with fully muted
+form of the last good speech frame from before the CN insertion period.  This
+effect happens no matter how long that DTX pause was - thus the last good speech
+frame being regurgitated (with R0 reduced to 0) may be indefinitely old and out
+of place.  Furthermore, if the CNI-exiting good speech frame that is dropped
+here is followed by BFI unusable frames, the ECU will return to state 6 and the
+parameters (other than muted R0) of the last good speech frame from before the
+DTX pause will continue being reused indefinitely.
+
+In our libgsmhr1 version, the state counter for the speech ECU is reset to 7
+(the initial home state) whenever our RxFE passes through DTX/CNI state.  Since
+only a good speech frame with BFI=0 and UFI=0 can make exit from CN insertion
+state, this reset of ECU state ensures that this good speech frame will pass
+through, and then the ECU will be in state 0 after this talkspurt-opening good
+speech frame.
+
+Fully muted state after unusable frames in input
+================================================
+
+If the input to the speech decoder or TFO transform becomes nothing but BFI
+unusable frames, what is the final fully muted or "decayed" output at the level
+of modified speech parameters?  In GSM-FR codec there is a special silence frame
+defined in GSM 06.11 Table 1, and the final decayed state is a continuous output
+of these fixed silence frames - irrespective of whether the Rx DTX handler got
+to this fully decayed state from speech or CN muting.
+
+However, no equivalent fully decayed state with fixed output is defined for
+GSM-HR.  While this aspect is a non-normative "example" implementation detail,
+in both GSM 06.06 reference code and Themyscira libgsmhr1 the fundamental state
+of speech vs CNI persists indefinitely even when fully muted:
+
+* If an indefinitely long string of unusable frames occurs in speech state,
+  the speech ECU will be in state 6, and the output from the RxFE (externally
+  visible in the case of TFO) will endlessly repeat parameters of the last good
+  speech frame, except for R0 reduced to 0.
+
+* If an indefinitely long string of unusable frames occurs in DTX/CNI state,
+  the output form shown in GSM 06.22 Table 2, complete with bit-exact
+  pseudorandom sequence in unvoiced codevector parameters, will likewise
+  continue indefinitely.  LPC parameters will remain from the most recently
+  received valid SID frame (or from the last good speech frame if CNI period
+  began with invalid SID and no valid SID was received afterward), but R0 will
+  be reduced to 0 by the CN muting logic.
+
+Because R0 is reduced to 0 in both cases, the above details are generally
+invisible with full endpoint speech decoding.  However, they become fully
+visible in the case of TFO transform with DTXd=0.
+
+TFO transform with DTXd=1
+=========================
+
+The internal RxFE block that emits CN parameters during DTX/CNI state is correct
+for the full endpoint speech decoder application and for TFO transform with
+DTXd=0.  The case of TFO transform with DTXd=1 is implemented by calling the
+same RxFE block, then applying this simple modification to its output: if the
+current frame was processed in DTX/CNI mode, the frame of CN parameters is
+transformed into a downlink SID frame by replacing all speech parameters beyond
+R0 and LPC with all-ones SID codeword.
+
+The internal RxFE block tells the TFO wrapper when this just-described
+modification should be applied by way of an internal flag.  This flag is set
+in two cases:
+
+1) When the current frame was processed in DTX/CNI mode, or
+
+2) When the speech ECU applied substitution/muting handling to the current
+   frame, and the ECU state was 6 or 7 at the beginning of current frame
+   processing.
+
+The effects of this logic are as follows:
+
+1) DTX pauses in UL pass through into DTX pauses in DL, with unusable frames
+   and invalid SID replaced with the most recent valid SID, or with R0+LPC from
+   the last good speech frame in the case of initial invalid SID.  The
+   spec-compliant Rx DTX handler in the destination MS can then produce the
+   most correct form of comfort noise, including interpolation of R0 and LPC
+   parameters.
+
+2) When the input to TFO transform is nothing but unusable frames, the downlink
+   radio leg should go into DTXd state in order to produce the desired reduction
+   in radio interference and BTS power consumption.  This effect should happen
+   irrespective of whether the "fully decayed" state of RxFE is DTX/CNI muting
+   or speech ECU, as covered in the previous section.  Our logic of turning
+   "fully decayed" ECU state into DTXd SID achieves the desired effect.
+
+Finally, there is one more modification applied only in the case of TFO
+transform with DTXd=1 and not in other cases: muting of comfort noise.  In the
+case of full endpoint speech decoding or TFO transform with DTXd=0, when the
+criterion for CN muting is first reached, the muting proceeds by decrementing
+R0 by 2 on every frame, i.e., gradually.  (See GSM 06.21 section 6.4.)  However,
+in the case of TFO transform with DTXd=1, CN muting is effected by reducing R0
+to 0 immediately as soon as CN muting criterion is reached.  The rationale is
+as follows:
+
+* A TRAU (or TRAU-emulating MGW that feeds Abis to a BTS) has no way of knowing
+  exactly which of its continuously emitted DL SID frames will actually get
+  transmitted on the air and seen by the MS.  Therefore, a muting process that
+  gradually decrements R0 with every emitted SID frame would make no sense.
+
+* If the destination MS receives a SID update with R0=0 subsequent to whatever
+  previous SID it received with non-zero R0, the spec-required CN interpolation
+  logic in that MS will produce the desired effect of gradual muting over 240 ms
+  - not too far from the 320 ms muting time called for in GSM 06.21 section
+  5.2.4.
+
+TFO transform homing
+====================
+
+ThemWi implementation of TFO transform includes the feature of in-band homing:
+if the input to the transform is the spec-defined decoder homing frame (DHF),
+this DHF is passed through to the output just like any other good speech frame,
+but the internal state is reset to the initial "home" state.
+
+The check for DHF (all bits must match, plus (BFI == 0 && SID == 0) criterion)
+and the resulting state reset happen at the end of frame processing, after the
+output for the current frame has been generated.  In the case of ThemWi TFO
+transform for GSM-HR, there are two corner cases in which an incoming DHF may
+be acted upon (produce state reset), but not appear in the output:
+
+1) The overall state of RxFE was speech (as opposed to DTX/CNI) and the speech
+   ECU state was 6 - the state in which the first received good speech frame
+   gets dropped.
+
+2) The overall state of RxFE was DTX/CNI and the incoming DHF is marked with
+   UFI=1.  UFI is not a criterion for DHF detection, only BFI is, but UFI in
+   DTX/CNI state will cause current frame processing to treat the frame as
+   unusable.