FreeCalypso > hg > gsm-codec-lib
diff doc/HR-codec-Rx-logic @ 632:7fc57e2a6784
beginning of GSM-HR documentation
| author | Mychaela Falconia <falcon@freecalypso.org> |
|---|---|
| date | Thu, 19 Mar 2026 04:13:45 +0000 |
| parents | |
| children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/HR-codec-Rx-logic Thu Mar 19 04:13:45 2026 +0000 @@ -0,0 +1,447 @@ +Rx DTX handler logic for GSM-HR speech codec +============================================ + +With all 3 classic GSM speech codecs (FR, HR and EFR), as TCH UL Rx traffic on +the network side passes from the BTS to the TRAU, the first processing step +performed by the TRAU prior to actual speech decoding is an Rx DTX handler. +(For TCH DL Rx on the mobile side, exactly the same processing steps happen in +total, but because everything is integrated into a single device, interfaces +between steps may be implemented more loosely.) + +For GSM-HR codec the 3 controlling specs for different parts of Rx DTX handler +logic are GSM 06.21, GSM 06.22 and GSM 06.41 - however, for the full details +these specs defer to the reference C code in GSM 06.06. This article explains +this logic from all aspects which we find important: what the Rx DTX logic was +in the original reference code from ETSI and how we adapted it in libgsmhr1, +both for the full speech decoder and for our implementation of TFO transform. + +Normative vs freely changeable aspects +====================================== + +In the case of error-free transmission, such that the receiver never encounters +a frame with BFI or UFI set except during continuation of a DTX pause (after +receiving a valid SID that begins comfort noise insertion) and is never asked +to begin CN insertion with an invalid SID, the full behaviour of the speech +decoder to the final linear PCM output is required to be bit-exact and gets +exercised by test sequences. This bit-exact behaviour includes non-error- +handling aspects of the Rx DTX handler and comfort noise generation, complete +with interpolation for periodic CN updates via subsequent SID frames. + +However, the reference C implementation becomes a non-normative example +(allowing changes in logic without violating spec requirements) in the +following aspects: + +* Handling of BFI and UFI outside of DTX pauses previously entered via a valid + SID, including most aspects of error concealment; + +* Exact manner of comfort noise muting when expected SID updates fail to arrive; + +* The exact logic to be applied when a CN insertion period begins with an + invalid SID frame. + +Almost-modular nature of GSM-HR Rx DTX handler +============================================== + +An Rx DTX handler can be considered fully modular if its output (which is then +passed as input to the main body of the speech decoder) is a potentially +modified set of speech parameters that can be packed into a new speech frame +and transmitted through a second radio leg with no change in the final output +of the speech decoder. The Rx DTX handler implemented in the reference code +from ETSI (both spec-normative and "example" aspects as broken down above) +_almost_ meets this modularity criterion, but not fully. The following aspects +are non-modular: + +* The interpolation of R0 and LPC parameters during comfort noise insertion + (bit-exact implementation considered normative) happens after expansion of + transmitted parameter bits into linear form. In the general case one cannot + produce a new set of encoded parameters (that can be transmitted through a + second radio leg) that will produce the same bit-exact result upon final + decoding. + +* Handling of speech frames (not SID, outside of DTX pause state) that are + marked with BFI=0 and UFI=1 (unreliable frames) has both a modular and a + non-modular aspect. If R0 increment is either small enough to not trigger + any mitigation or large enough to where UFI is converted into BFI, the applied + handling is fully modular. However, if R0 increment falls into the narrow + window between the two thresholds, the applied handling (output signal + concealment per GSM 06.21 section 5.1.2) is non-modular: it happens deep in + the guts of the speech decoder and cannot be represented via a modified set + of speech parameters. + +TFO transform derived from the reference Rx DTX handler +======================================================= + +If one extracts the reference Rx DTX handler from GSM 06.06 code and removes +the two non-modular aspects detailed above, leaving only fully modular logic, +the result can be used as a TFO transform that implements the functions of +TS 28.062 section C.3.2.1.1, specifically Case 1 in which UL may have DTX, but +DL is required to consist of speech frames only. + +How does one address the two non-modular aspects of the standard GSM-HR Rx DTX +handler that are not possible in TFO? The simplest implementation is to remove +them altogether: + +* Comfort noise parameters are not interpolated, instead an abrupt change in R0 + and LPC parameters occurs every 240 ms when a new SID frame arrives. + +* UFI is simply dropped in the case when the standard decoder would apply output + signal concealment, i.e., the latter feature is given up. + +Obviously this approach constitutes functional regression relative to the +standard speech decoder - thus we were initially hesitant to adopt it. However, +experiments with a real historical TRAU that supports TFO (Nokia TCSM2) reveal +that Nokia implemented exactly the same approach (minimal complexity at the +price of slight functional degradation) in their TRAU DSP firmware. Seeing +that a major classic vendor of GSM infrastructure implemented this simplistic +approach, we are now comfortable with doing the same - especially considering +the work scope limits explained in HR-codec-limits article. + +In Themyscira libgsmhr1 implementation, a component has been factored out which +we call the Rx front end (RxFE). This RxFE is our cleaned-up reimplementation +of those parts of the original Rx DTX handler that are fully modular (including +the speech ECU and all CN parameters that aren't interpolated), plus some +additional internal flag inputs and outputs. Out of the latter internal flags, +some are used only by the full speech decoder, while others are used only by +the TFO transform. RxFE state, which also serves as the API-visible TFO +transform state, is a subset of full speech decoder state. However, the core +RxFE function is not exported directly as API; instead the TFO transform API +function is a TFO-specific wrapper around the RxFE. + +Detailed RxFE logic and its evolution +===================================== + +Now that we have covered the background of the previous sections, we can +properly examine the actual logic of our RxFE, the follow-up logic for CN +interpolation that exists only in the full decoder, and their origins in the +reference GSM 06.06 code. + +Unless noted otherwise, all logic described in the following sections is the +same between ETSI original and the present Themyscira implementation. The +internal representation and code structure may be different, but the behavioral +logic remains the same unless explicitly called out otherwise. + +Input frame classification +-------------------------- + +As the very first processing step for every incoming frame, BFI, UFI and SID +flags are combined per GSM 06.41 Table 1 to classify the frame as good speech, +valid SID, invalid SID or unusable for DTX purposes. Note that UFI turns valid +SID into invalid just like BFI, and for DTX purposes all non-SID frames marked +with UFI are considered "unusable". But as we shall see shortly, this +"unusable" classification matters only for DTX and not for speech ECU logic, +which is separate. + +Speech vs CNI state +------------------- + +RxFE state that carries from one frame to the next includes one very important +two-state flag: either speech or CNI (comfort noise insertion) mode. By +combining the 4 possible frame classifications from GSM 06.41 Table 1 (see +above) with these two possible carry-over states, we get 4 possible ways in +which the current frame may be handled: + +Input frame class Previously speech Previously CNI +-------------------------------------------------------------- +SID (valid or invalid) CNIFIRSTSID CNICONT +Good speech SPEECH SPEECH +Unusable SPEECH CNIBFI + +Here we can see that unless we enter DTX/CNI state, neither BFI nor UFI moves +RxFE logic out of SPEECH handling. This SPEECH handling mode includes the ECU +and handles both good and bad speech frames. However, once DTX/CNI state has +been entered, then only a (BFI==0 && UFI==0 && SID==0) good speech frame can +effect exit from this state! + +Speech ECU logic +================ + +The frame-to-frame persistent state for the ECU consists of the state counter +variable (range [0,7]) described in GSM 06.21 section 6.3 and a saved copy of +the last good speech frame. The just-referenced spec section describes the +logic quite well, but a few additional notes are in order: + +* The last good speech frame that gets regurgitated in substitution/muting + states of the ECU is not exactly the same as the actual last good speech frame + that went through: + + + GSP0 parameters for the first 3 subframes are replaced with GSP0 parameter + for the last subframe; + + + If the frame is voiced, LTP lag parameters are modified - read the code for + the details. + + In the original ETSI implementation, these modifications are applied at the + time of substitution/muting output; in our implementation, they are applied + at the time when a good speech frame is saved. Our implementation approach + makes it clearer what state is actually retained, but the functional behaviour + is exactly the same. + +* When that last good speech frame gets regurgitated during bad frame handling, + codevector parameters may be taken either from that saved last good speech + frame or from the current bad frame. Use of codevector parameters from the + current bad frame is possible only when the current bad frame and the saved + last good speech frame have the same voiced vs unvoiced mode. If this mode + matches for one frame and bad-frame codevector parameters get passed on, but + the next bad frame has incompatible mode, the saved last good speech frame + gets used in its entirety once again, subject only to the modifications + described above. + +* Our Themyscira version features an extension: if BFI equals 2 instead of 1, + indicating BFI without payload bits, then there are no bad-frame codevector + parameters and the saved last good speech frame is used in its entirety, + just as if BFI frames always have the wrong voiced vs unvoiced mode. + +BFI out of reset +================ + +What happens if the very first input frame in reset state (after external reset +or after a decoder homing frame) is a bad frame per BFI, or per UFI treated as +BFI - what is the default "last" good speech frame? In ETSI original code it +is a frame of all zero parameters, but this oddity is not readily visible - the +final output of linear PCM is also all zeros, and all is well. In Themyscira +implementation, the output of our RxFE may be visible externally if it is used +as a TFO transform - hence more attention was given to this issue. + +If we feed all zeros as PCM input to a homed standard GSM-HR speech encoder, we +get this frame, repeating endlessly as long as all-zeros PCM input continues: + +R0=00 LPC=164,171,cb Int=0 Mode=0 +s1=00,00,00 s2=00,00,00 s3=00,00,00 s4=00,00,00 + +This frame differs from all-zero params only in the LPC set, and this sane-LPC +silence frame is the one we have adopted as our reset-default fallback frame. + +When libgsmhr1 full speech decoder engine is used, as opposed to TFO transform, +there is an additional check. If the current state is the special home state +(logic required for spec-mandated EHF output with repeated DHF input) and the +input frame has BFI flag set (no other flags are considered in this case), the +PCM output is set to all zero samples without leaving the home state. However, +the regular speech ECU and its last good frame default can still be reached if +BFI is clear, UFI is set and R0 is high. + +Comfort noise logic in RxFE +=========================== + +GSM 06.22 spec treats the required bit-exact CN generator as a single entity - +however, in our implementation it is split between the RxFE and the main body +of the full speech decoder. The bit-exact result in the case of full speech +decoding remains the same, but our arrangement allows non-interpolated CN +generation in the TFO transform as well. + +When our RxFE is used as a TFO transform with DTXd=0 (the mode that includes CN +generation), CN output from the transform matches GSM 06.22 Table 2, with the +exception of R0 and LPC parameters. These R0 and LPC parameters will be filled +as follows: + +* If CN insertion period begins with a valid SID, R0 and LPC are taken from + that SID. + +* If CN insertion period begins with an invalid SID, R0 and LPC are taken from + the last good speech frame, the one used by the speech ECU. Directly out of + reset (or after a DHF), these parameters are as shown above: + + R0=00 LPC=164,171,cb + +* Any time a new valid SID frame arrives during a CN insertion period, R0 and + LPC parameters change to this new SID. + +* Any time the input during CN insertion is either an unusable frame or an + invalid SID, R0 and LPC parameters remain unchanged from the most recently + received valid SID, or from the last good speech frame if only invalid SID + frames have been received in the entire CN insertion period so far. + +Comfort noise muting +==================== + +Per GSM 06.21 sections 5.2.3 and 5.2.4, when SID frames fail to arrive for 3 +consecutive TAF positions, generated comfort noise needs to be muted. We +implement this logic in our RxFE, and the actual logic is unchanged from ETSI +reference code - it is described in GSM 06.21 section 6.4. + +This SID aging and CN muting logic works by counting unusable frames received +in between SID updates. In the original GSM 06.06 code the criterion to start +CN muting is: + + TAF == 1 && CNIBFI_count >= 25 + +In our version we changed it to: + + CNIBFI_count >= (TAF ? 25 : 36) + +When TAF is indicated correctly, once every 12 frames and with the flag always +present at least in BFI frames (consider GSM 08.61 TRAU-8k format), our extended +criterion is equivalent to the original; however, our version will also produce +eventual CN muting if TAF is missing. + +For the purpose of this logic, invalid SID is as good as valid: while it is +treated just like unusable frames (CNIBFI) for the purpose of R0 and LPC +parameters and their interpolation (see next section), for the purpose of SID +aging and CN muting, invalid SID resets the count of unusable frames, and if +muting already started previously, it is halted at the current (partially muted) +R0 value. + +Comfort noise interpolation +=========================== + +When our RxFE is invoked internally by our full speech decoder, the RxFE passes +some additional flags to the main body of the decoder. One of these flags +controls interpolation of R0 and LPC parameters for CNI, a function that is +required by the specs with bit-exact stipulation, but which cannot be +implemented at the level of speech parameters. + +The only case in which the behaviour of our libgsmhr1 full speech decoder +differs from ETSI original is when an invalid SID frame arrives immediately out +of reset, not preceded by any good speech, valid SID or even unusable frames. +In this case the original GSM 06.06 code uses initialized all-zero state of +pswOldFrmKsDec[] array, which cannot happen in any other case. In our +implementation we use LPC=164,171,cb instead, as already explained. + +Outside of this corner case, invalid SID frames are handled as follows +(unchanged between EISI original and our version): + +* If CN insertion period begins with an invalid SID, R0 and LPC are taken from + the last good speech frame, the one used by the speech ECU. These R0 and LPC + params are then fed into the prescribed bit-exact interpolation mechanism as + if CN insertion started with a valid SID frame with these parameters. + +* Any invalid SID frames that occur in the middle of a CN insertion period are + treated just like unusable frames for the purpose of interpolation. + +Return from CN insertion to speech state +======================================== + +Exit from DTX/CNI state happens upon receipt of a good speech frame, i.e., a +frame that meets this criterion: + + BFI == 0 && UFI == 0 && SID == 0 + +However, the original implementation in GSM 06.06 reference code exhibits this +flaw: if the speech ECU is in state 6 (see GSM 06.21 section 6.3) and then an +accepted SID frame (valid or invalid) puts us into DTX state, the first good +speech frame after this DTX pause will be dropped and replaced with fully muted +form of the last good speech frame from before the CN insertion period. This +effect happens no matter how long that DTX pause was - thus the last good speech +frame being regurgitated (with R0 reduced to 0) may be indefinitely old and out +of place. Furthermore, if the CNI-exiting good speech frame that is dropped +here is followed by BFI unusable frames, the ECU will return to state 6 and the +parameters (other than muted R0) of the last good speech frame from before the +DTX pause will continue being reused indefinitely. + +In our libgsmhr1 version, the state counter for the speech ECU is reset to 7 +(the initial home state) whenever our RxFE passes through DTX/CNI state. Since +only a good speech frame with BFI=0 and UFI=0 can make exit from CN insertion +state, this reset of ECU state ensures that this good speech frame will pass +through, and then the ECU will be in state 0 after this talkspurt-opening good +speech frame. + +Fully muted state after unusable frames in input +================================================ + +If the input to the speech decoder or TFO transform becomes nothing but BFI +unusable frames, what is the final fully muted or "decayed" output at the level +of modified speech parameters? In GSM-FR codec there is a special silence frame +defined in GSM 06.11 Table 1, and the final decayed state is a continuous output +of these fixed silence frames - irrespective of whether the Rx DTX handler got +to this fully decayed state from speech or CN muting. + +However, no equivalent fully decayed state with fixed output is defined for +GSM-HR. While this aspect is a non-normative "example" implementation detail, +in both GSM 06.06 reference code and Themyscira libgsmhr1 the fundamental state +of speech vs CNI persists indefinitely even when fully muted: + +* If an indefinitely long string of unusable frames occurs in speech state, + the speech ECU will be in state 6, and the output from the RxFE (externally + visible in the case of TFO) will endlessly repeat parameters of the last good + speech frame, except for R0 reduced to 0. + +* If an indefinitely long string of unusable frames occurs in DTX/CNI state, + the output form shown in GSM 06.22 Table 2, complete with bit-exact + pseudorandom sequence in unvoiced codevector parameters, will likewise + continue indefinitely. LPC parameters will remain from the most recently + received valid SID frame (or from the last good speech frame if CNI period + began with invalid SID and no valid SID was received afterward), but R0 will + be reduced to 0 by the CN muting logic. + +Because R0 is reduced to 0 in both cases, the above details are generally +invisible with full endpoint speech decoding. However, they become fully +visible in the case of TFO transform with DTXd=0. + +TFO transform with DTXd=1 +========================= + +The internal RxFE block that emits CN parameters during DTX/CNI state is correct +for the full endpoint speech decoder application and for TFO transform with +DTXd=0. The case of TFO transform with DTXd=1 is implemented by calling the +same RxFE block, then applying this simple modification to its output: if the +current frame was processed in DTX/CNI mode, the frame of CN parameters is +transformed into a downlink SID frame by replacing all speech parameters beyond +R0 and LPC with all-ones SID codeword. + +The internal RxFE block tells the TFO wrapper when this just-described +modification should be applied by way of an internal flag. This flag is set +in two cases: + +1) When the current frame was processed in DTX/CNI mode, or + +2) When the speech ECU applied substitution/muting handling to the current + frame, and the ECU state was 6 or 7 at the beginning of current frame + processing. + +The effects of this logic are as follows: + +1) DTX pauses in UL pass through into DTX pauses in DL, with unusable frames + and invalid SID replaced with the most recent valid SID, or with R0+LPC from + the last good speech frame in the case of initial invalid SID. The + spec-compliant Rx DTX handler in the destination MS can then produce the + most correct form of comfort noise, including interpolation of R0 and LPC + parameters. + +2) When the input to TFO transform is nothing but unusable frames, the downlink + radio leg should go into DTXd state in order to produce the desired reduction + in radio interference and BTS power consumption. This effect should happen + irrespective of whether the "fully decayed" state of RxFE is DTX/CNI muting + or speech ECU, as covered in the previous section. Our logic of turning + "fully decayed" ECU state into DTXd SID achieves the desired effect. + +Finally, there is one more modification applied only in the case of TFO +transform with DTXd=1 and not in other cases: muting of comfort noise. In the +case of full endpoint speech decoding or TFO transform with DTXd=0, when the +criterion for CN muting is first reached, the muting proceeds by decrementing +R0 by 2 on every frame, i.e., gradually. (See GSM 06.21 section 6.4.) However, +in the case of TFO transform with DTXd=1, CN muting is effected by reducing R0 +to 0 immediately as soon as CN muting criterion is reached. The rationale is +as follows: + +* A TRAU (or TRAU-emulating MGW that feeds Abis to a BTS) has no way of knowing + exactly which of its continuously emitted DL SID frames will actually get + transmitted on the air and seen by the MS. Therefore, a muting process that + gradually decrements R0 with every emitted SID frame would make no sense. + +* If the destination MS receives a SID update with R0=0 subsequent to whatever + previous SID it received with non-zero R0, the spec-required CN interpolation + logic in that MS will produce the desired effect of gradual muting over 240 ms + - not too far from the 320 ms muting time called for in GSM 06.21 section + 5.2.4. + +TFO transform homing +==================== + +ThemWi implementation of TFO transform includes the feature of in-band homing: +if the input to the transform is the spec-defined decoder homing frame (DHF), +this DHF is passed through to the output just like any other good speech frame, +but the internal state is reset to the initial "home" state. + +The check for DHF (all bits must match, plus (BFI == 0 && SID == 0) criterion) +and the resulting state reset happen at the end of frame processing, after the +output for the current frame has been generated. In the case of ThemWi TFO +transform for GSM-HR, there are two corner cases in which an incoming DHF may +be acted upon (produce state reset), but not appear in the output: + +1) The overall state of RxFE was speech (as opposed to DTX/CNI) and the speech + ECU state was 6 - the state in which the first received good speech frame + gets dropped. + +2) The overall state of RxFE was DTX/CNI and the incoming DHF is marked with + UFI=1. UFI is not a criterion for DHF detection, only BFI is, but UFI in + DTX/CNI state will cause current frame processing to treat the frame as + unusable.
