FreeCalypso > hg > gsm-codec-lib

Rx DTX handler logic for GSM-HR speech codec
============================================

With all 3 classic GSM speech codecs (FR, HR and EFR), as TCH UL Rx traffic on
the network side passes from the BTS to the TRAU, the first processing step
performed by the TRAU prior to actual speech decoding is an Rx DTX handler.
(For TCH DL Rx on the mobile side, exactly the same processing steps happen in
total, but because everything is integrated into a single device, interfaces
between steps may be implemented more loosely.)

For GSM-HR codec the 3 controlling specs for different parts of Rx DTX handler
logic are GSM 06.21, GSM 06.22 and GSM 06.41 - however, for the full details
these specs defer to the reference C code in GSM 06.06.  This article explains
this logic from all aspects which we find important: what the Rx DTX logic was
in the original reference code from ETSI and how we adapted it in libgsmhr1,
both for the full speech decoder and for our implementation of TFO transform.

Normative vs freely changeable aspects
======================================

In the case of error-free transmission, such that the receiver never encounters
a frame with BFI or UFI set except during continuation of a DTX pause (after
receiving a valid SID that begins comfort noise insertion) and is never asked
to begin CN insertion with an invalid SID, the full behaviour of the speech
decoder to the final linear PCM output is required to be bit-exact and gets
exercised by test sequences.  This bit-exact behaviour includes non-error-
handling aspects of the Rx DTX handler and comfort noise generation, complete
with interpolation for periodic CN updates via subsequent SID frames.

However, the reference C implementation becomes a non-normative example
(allowing changes in logic without violating spec requirements) in the
following aspects:

* Handling of BFI and UFI outside of DTX pauses previously entered via a valid
  SID, including most aspects of error concealment;

* Exact manner of comfort noise muting when expected SID updates fail to arrive;

* The exact logic to be applied when a CN insertion period begins with an
  invalid SID frame.

Almost-modular nature of GSM-HR Rx DTX handler
==============================================

An Rx DTX handler can be considered fully modular if its output (which is then
passed as input to the main body of the speech decoder) is a potentially
modified set of speech parameters that can be packed into a new speech frame
and transmitted through a second radio leg with no change in the final output
of the speech decoder.  The Rx DTX handler implemented in the reference code
from ETSI (both spec-normative and "example" aspects as broken down above)
_almost_ meets this modularity criterion, but not fully.  The following aspects
are non-modular:

* The interpolation of R0 and LPC parameters during comfort noise insertion
  (bit-exact implementation considered normative) happens after expansion of
  transmitted parameter bits into linear form.  In the general case one cannot
  produce a new set of encoded parameters (that can be transmitted through a
  second radio leg) that will produce the same bit-exact result upon final
  decoding.

* Handling of speech frames (not SID, outside of DTX pause state) that are
  marked with BFI=0 and UFI=1 (unreliable frames) has both a modular and a
  non-modular aspect.  If R0 increment is either small enough to not trigger
  any mitigation or large enough to where UFI is converted into BFI, the applied
  handling is fully modular.  However, if R0 increment falls into the narrow
  window between the two thresholds, the applied handling (output signal
  concealment per GSM 06.21 section 5.1.2) is non-modular: it happens deep in
  the guts of the speech decoder and cannot be represented via a modified set
  of speech parameters.

TFO transform derived from the reference Rx DTX handler
=======================================================

If one extracts the reference Rx DTX handler from GSM 06.06 code and removes
the two non-modular aspects detailed above, leaving only fully modular logic,
the result can be used as a TFO transform that implements the functions of
TS 28.062 section C.3.2.1.1, specifically Case 1 in which UL may have DTX, but
DL is required to consist of speech frames only.

How does one address the two non-modular aspects of the standard GSM-HR Rx DTX
handler that are not possible in TFO?  The simplest implementation is to remove
them altogether:

* Comfort noise parameters are not interpolated, instead an abrupt change in R0
  and LPC parameters occurs every 240 ms when a new SID frame arrives.

* UFI is simply dropped in the case when the standard decoder would apply output
  signal concealment, i.e., the latter feature is given up.

Obviously this approach constitutes functional regression relative to the
standard speech decoder - thus we were initially hesitant to adopt it.  However,
experiments with a real historical TRAU that supports TFO (Nokia TCSM2) reveal
that Nokia implemented exactly the same approach (minimal complexity at the
price of slight functional degradation) in their TRAU DSP firmware.  Seeing
that a major classic vendor of GSM infrastructure implemented this simplistic
approach, we are now comfortable with doing the same - especially considering
the work scope limits explained in HR-codec-limits article.

In Themyscira libgsmhr1 implementation, a component has been factored out which
we call the Rx front end (RxFE).  This RxFE is our cleaned-up reimplementation
of those parts of the original Rx DTX handler that are fully modular (including
the speech ECU and all CN parameters that aren't interpolated), plus some
additional internal flag inputs and outputs.  Out of the latter internal flags,
some are used only by the full speech decoder, while others are used only by
the TFO transform.  RxFE state, which also serves as the API-visible TFO
transform state, is a subset of full speech decoder state.  However, the core
RxFE function is not exported directly as API; instead the TFO transform API
function is a TFO-specific wrapper around the RxFE.

Detailed RxFE logic and its evolution
=====================================

Now that we have covered the background of the previous sections, we can
properly examine the actual logic of our RxFE, the follow-up logic for CN
interpolation that exists only in the full decoder, and their origins in the
reference GSM 06.06 code.

Unless noted otherwise, all logic described in the following sections is the
same between ETSI original and the present Themyscira implementation.  The
internal representation and code structure may be different, but the behavioral
logic remains the same unless explicitly called out otherwise.

Input frame classification
--------------------------

As the very first processing step for every incoming frame, BFI, UFI and SID
flags are combined per GSM 06.41 Table 1 to classify the frame as good speech,
valid SID, invalid SID or unusable for DTX purposes.  Note that UFI turns valid
SID into invalid just like BFI, and for DTX purposes all non-SID frames marked
with UFI are considered "unusable".  But as we shall see shortly, this
"unusable" classification matters only for DTX and not for speech ECU logic,
which is separate.

Speech vs CNI state
-------------------

RxFE state that carries from one frame to the next includes one very important
two-state flag: either speech or CNI (comfort noise insertion) mode.  By
combining the 4 possible frame classifications from GSM 06.41 Table 1 (see
above) with these two possible carry-over states, we get 4 possible ways in
which the current frame may be handled:

Input frame class	Previously speech	Previously CNI
--------------------------------------------------------------
SID (valid or invalid)	CNIFIRSTSID		CNICONT
Good speech		SPEECH			SPEECH
Unusable		SPEECH			CNIBFI

Here we can see that unless we enter DTX/CNI state, neither BFI nor UFI moves
RxFE logic out of SPEECH handling.  This SPEECH handling mode includes the ECU
and handles both good and bad speech frames.  However, once DTX/CNI state has
been entered, then only a (BFI==0 && UFI==0 && SID==0) good speech frame can
effect exit from this state!

Speech ECU logic
================

The frame-to-frame persistent state for the ECU consists of the state counter
variable (range [0,7]) described in GSM 06.21 section 6.3 and a saved copy of
the last good speech frame.  The just-referenced spec section describes the
logic quite well, but a few additional notes are in order:

* The last good speech frame that gets regurgitated in substitution/muting
  states of the ECU is not exactly the same as the actual last good speech frame
  that went through:

  + GSP0 parameters for the first 3 subframes are replaced with GSP0 parameter
    for the last subframe;

  + If the frame is voiced, LTP lag parameters are modified - read the code for
    the details.

  In the original ETSI implementation, these modifications are applied at the
  time of substitution/muting output; in our implementation, they are applied
  at the time when a good speech frame is saved.  Our implementation approach
  makes it clearer what state is actually retained, but the functional behaviour
  is exactly the same.

* When that last good speech frame gets regurgitated during bad frame handling,
  codevector parameters may be taken either from that saved last good speech
  frame or from the current bad frame.  Use of codevector parameters from the
  current bad frame is possible only when the current bad frame and the saved
  last good speech frame have the same voiced vs unvoiced mode.  If this mode
  matches for one frame and bad-frame codevector parameters get passed on, but
  the next bad frame has incompatible mode, the saved last good speech frame
  gets used in its entirety once again, subject only to the modifications
  described above.

* Our Themyscira version features an extension: if BFI equals 2 instead of 1,
  indicating BFI without payload bits, then there are no bad-frame codevector
  parameters and the saved last good speech frame is used in its entirety,
  just as if BFI frames always have the wrong voiced vs unvoiced mode.

BFI out of reset
================

What happens if the very first input frame in reset state (after external reset
or after a decoder homing frame) is a bad frame per BFI, or per UFI treated as
BFI - what is the default "last" good speech frame?  In ETSI original code it
is a frame of all zero parameters, but this oddity is not readily visible - the
final output of linear PCM is also all zeros, and all is well.  In Themyscira
implementation, the output of our RxFE may be visible externally if it is used
as a TFO transform - hence more attention was given to this issue.

If we feed all zeros as PCM input to a homed standard GSM-HR speech encoder, we
get this frame, repeating endlessly as long as all-zeros PCM input continues:

R0=00 LPC=164,171,cb Int=0 Mode=0
s1=00,00,00 s2=00,00,00 s3=00,00,00 s4=00,00,00

This frame differs from all-zero params only in the LPC set, and this sane-LPC
silence frame is the one we have adopted as our reset-default fallback frame.

When libgsmhr1 full speech decoder engine is used, as opposed to TFO transform,
there is an additional check.  If the current state is the special home state
(logic required for spec-mandated EHF output with repeated DHF input) and the
input frame has BFI flag set (no other flags are considered in this case), the
PCM output is set to all zero samples without leaving the home state.  However,
the regular speech ECU and its last good frame default can still be reached if
BFI is clear, UFI is set and R0 is high.

Comfort noise logic in RxFE
===========================

GSM 06.22 spec treats the required bit-exact CN generator as a single entity -
however, in our implementation it is split between the RxFE and the main body
of the full speech decoder.  The bit-exact result in the case of full speech
decoding remains the same, but our arrangement allows non-interpolated CN
generation in the TFO transform as well.

When our RxFE is used as a TFO transform with DTXd=0 (the mode that includes CN
generation), CN output from the transform matches GSM 06.22 Table 2, with the
exception of R0 and LPC parameters.  These R0 and LPC parameters will be filled
as follows:

* If CN insertion period begins with a valid SID, R0 and LPC are taken from
  that SID.

* If CN insertion period begins with an invalid SID, R0 and LPC are taken from
  the last good speech frame, the one used by the speech ECU.  Directly out of
  reset (or after a DHF), these parameters are as shown above:

  R0=00 LPC=164,171,cb

* Any time a new valid SID frame arrives during a CN insertion period, R0 and
  LPC parameters change to this new SID.

* Any time the input during CN insertion is either an unusable frame or an
  invalid SID, R0 and LPC parameters remain unchanged from the most recently
  received valid SID, or from the last good speech frame if only invalid SID
  frames have been received in the entire CN insertion period so far.

Comfort noise muting
====================

Per GSM 06.21 sections 5.2.3 and 5.2.4, when SID frames fail to arrive for 3
consecutive TAF positions, generated comfort noise needs to be muted.  We
implement this logic in our RxFE, and the actual logic is unchanged from ETSI
reference code - it is described in GSM 06.21 section 6.4.

This SID aging and CN muting logic works by counting unusable frames received
in between SID updates.  In the original GSM 06.06 code the criterion to start
CN muting is:

	TAF == 1 && CNIBFI_count >= 25

In our version we changed it to:

	CNIBFI_count >= (TAF ? 25 : 36)

When TAF is indicated correctly, once every 12 frames and with the flag always
present at least in BFI frames (consider GSM 08.61 TRAU-8k format), our extended
criterion is equivalent to the original; however, our version will also produce
eventual CN muting if TAF is missing.

For the purpose of this logic, invalid SID is as good as valid: while it is
treated just like unusable frames (CNIBFI) for the purpose of R0 and LPC
parameters and their interpolation (see next section), for the purpose of SID
aging and CN muting, invalid SID resets the count of unusable frames, and if
muting already started previously, it is halted at the current (partially muted)
R0 value.

Comfort noise interpolation
===========================

When our RxFE is invoked internally by our full speech decoder, the RxFE passes
some additional flags to the main body of the decoder.  One of these flags
controls interpolation of R0 and LPC parameters for CNI, a function that is
required by the specs with bit-exact stipulation, but which cannot be
implemented at the level of speech parameters.

The only case in which the behaviour of our libgsmhr1 full speech decoder
differs from ETSI original is when an invalid SID frame arrives immediately out
of reset, not preceded by any good speech, valid SID or even unusable frames.
In this case the original GSM 06.06 code uses initialized all-zero state of
pswOldFrmKsDec[] array, which cannot happen in any other case.  In our
implementation we use LPC=164,171,cb instead, as already explained.

Outside of this corner case, invalid SID frames are handled as follows
(unchanged between EISI original and our version):

* If CN insertion period begins with an invalid SID, R0 and LPC are taken from
  the last good speech frame, the one used by the speech ECU.  These R0 and LPC
  params are then fed into the prescribed bit-exact interpolation mechanism as
  if CN insertion started with a valid SID frame with these parameters.

* Any invalid SID frames that occur in the middle of a CN insertion period are
  treated just like unusable frames for the purpose of interpolation.

Return from CN insertion to speech state
========================================

Exit from DTX/CNI state happens upon receipt of a good speech frame, i.e., a
frame that meets this criterion:

	BFI == 0 && UFI == 0 && SID == 0

However, the original implementation in GSM 06.06 reference code exhibits this
flaw: if the speech ECU is in state 6 (see GSM 06.21 section 6.3) and then an
accepted SID frame (valid or invalid) puts us into DTX state, the first good
speech frame after this DTX pause will be dropped and replaced with fully muted
form of the last good speech frame from before the CN insertion period.  This
effect happens no matter how long that DTX pause was - thus the last good speech
frame being regurgitated (with R0 reduced to 0) may be indefinitely old and out
of place.  Furthermore, if the CNI-exiting good speech frame that is dropped
here is followed by BFI unusable frames, the ECU will return to state 6 and the
parameters (other than muted R0) of the last good speech frame from before the
DTX pause will continue being reused indefinitely.

In our libgsmhr1 version, the state counter for the speech ECU is reset to 7
(the initial home state) whenever our RxFE passes through DTX/CNI state.  Since
only a good speech frame with BFI=0 and UFI=0 can make exit from CN insertion
state, this reset of ECU state ensures that this good speech frame will pass
through, and then the ECU will be in state 0 after this talkspurt-opening good
speech frame.

Fully muted state after unusable frames in input
================================================

If the input to the speech decoder or TFO transform becomes nothing but BFI
unusable frames, what is the final fully muted or "decayed" output at the level
of modified speech parameters?  In GSM-FR codec there is a special silence frame
defined in GSM 06.11 Table 1, and the final decayed state is a continuous output
of these fixed silence frames - irrespective of whether the Rx DTX handler got
to this fully decayed state from speech or CN muting.

However, no equivalent fully decayed state with fixed output is defined for
GSM-HR.  While this aspect is a non-normative "example" implementation detail,
in both GSM 06.06 reference code and Themyscira libgsmhr1 the fundamental state
of speech vs CNI persists indefinitely even when fully muted:

* If an indefinitely long string of unusable frames occurs in speech state,
  the speech ECU will be in state 6, and the output from the RxFE (externally
  visible in the case of TFO) will endlessly repeat parameters of the last good
  speech frame, except for R0 reduced to 0.

* If an indefinitely long string of unusable frames occurs in DTX/CNI state,
  the output form shown in GSM 06.22 Table 2, complete with bit-exact
  pseudorandom sequence in unvoiced codevector parameters, will likewise
  continue indefinitely.  LPC parameters will remain from the most recently
  received valid SID frame (or from the last good speech frame if CNI period
  began with invalid SID and no valid SID was received afterward), but R0 will
  be reduced to 0 by the CN muting logic.

Because R0 is reduced to 0 in both cases, the above details are generally
invisible with full endpoint speech decoding.  However, they become fully
visible in the case of TFO transform with DTXd=0.

TFO transform with DTXd=1
=========================

The internal RxFE block that emits CN parameters during DTX/CNI state is correct
for the full endpoint speech decoder application and for TFO transform with
DTXd=0.  The case of TFO transform with DTXd=1 is implemented by calling the
same RxFE block, then applying this simple modification to its output: if the
current frame was processed in DTX/CNI mode, the frame of CN parameters is
transformed into a downlink SID frame by replacing all speech parameters beyond
R0 and LPC with all-ones SID codeword.

The internal RxFE block tells the TFO wrapper when this just-described
modification should be applied by way of an internal flag.  This flag is set
in two cases:

1) When the current frame was processed in DTX/CNI mode, or

2) When the speech ECU applied substitution/muting handling to the current
   frame, and the ECU state was 6 or 7 at the beginning of current frame
   processing.

The effects of this logic are as follows:

1) DTX pauses in UL pass through into DTX pauses in DL, with unusable frames
   and invalid SID replaced with the most recent valid SID, or with R0+LPC from
   the last good speech frame in the case of initial invalid SID.  The
   spec-compliant Rx DTX handler in the destination MS can then produce the
   most correct form of comfort noise, including interpolation of R0 and LPC
   parameters.

2) When the input to TFO transform is nothing but unusable frames, the downlink
   radio leg should go into DTXd state in order to produce the desired reduction
   in radio interference and BTS power consumption.  This effect should happen
   irrespective of whether the "fully decayed" state of RxFE is DTX/CNI muting
   or speech ECU, as covered in the previous section.  Our logic of turning
   "fully decayed" ECU state into DTXd SID achieves the desired effect.

Finally, there is one more modification applied only in the case of TFO
transform with DTXd=1 and not in other cases: muting of comfort noise.  In the
case of full endpoint speech decoding or TFO transform with DTXd=0, when the
criterion for CN muting is first reached, the muting proceeds by decrementing
R0 by 2 on every frame, i.e., gradually.  (See GSM 06.21 section 6.4.)  However,
in the case of TFO transform with DTXd=1, CN muting is effected by reducing R0
to 0 immediately as soon as CN muting criterion is reached.  The rationale is
as follows:

* A TRAU (or TRAU-emulating MGW that feeds Abis to a BTS) has no way of knowing
  exactly which of its continuously emitted DL SID frames will actually get
  transmitted on the air and seen by the MS.  Therefore, a muting process that
  gradually decrements R0 with every emitted SID frame would make no sense.

* If the destination MS receives a SID update with R0=0 subsequent to whatever
  previous SID it received with non-zero R0, the spec-required CN interpolation
  logic in that MS will produce the desired effect of gradual muting over 240 ms
  - not too far from the 320 ms muting time called for in GSM 06.21 section
  5.2.4.

TFO transform homing
====================

ThemWi implementation of TFO transform includes the feature of in-band homing:
if the input to the transform is the spec-defined decoder homing frame (DHF),
this DHF is passed through to the output just like any other good speech frame,
but the internal state is reset to the initial "home" state.

The check for DHF (all bits must match, plus (BFI == 0 && SID == 0) criterion)
and the resulting state reset happen at the end of frame processing, after the
output for the current frame has been generated.  In the case of ThemWi TFO
transform for GSM-HR, there are two corner cases in which an incoming DHF may
be acted upon (produce state reset), but not appear in the output:

1) The overall state of RxFE was speech (as opposed to DTX/CNI) and the speech
   ECU state was 6 - the state in which the first received good speech frame
   gets dropped.

2) The overall state of RxFE was DTX/CNI and the incoming DHF is marked with
   UFI=1.  UFI is not a criterion for DHF detection, only BFI is, but UFI in
   DTX/CNI state will cause current frame processing to treat the frame as
   unusable.
author	Mychaela Falconia <falcon@freecalypso.org>
date	Fri, 27 Mar 2026 00:13:07 +0000
parents	7fc57e2a6784
children