comparison doc/HR-codec-Rx-logic @ 632:7fc57e2a6784

beginning of GSM-HR documentation
author Mychaela Falconia <falcon@freecalypso.org>
date Thu, 19 Mar 2026 04:13:45 +0000
parents
children
comparison
equal deleted inserted replaced
631:6bad9af66f69 632:7fc57e2a6784
1 Rx DTX handler logic for GSM-HR speech codec
2 ============================================
3
4 With all 3 classic GSM speech codecs (FR, HR and EFR), as TCH UL Rx traffic on
5 the network side passes from the BTS to the TRAU, the first processing step
6 performed by the TRAU prior to actual speech decoding is an Rx DTX handler.
7 (For TCH DL Rx on the mobile side, exactly the same processing steps happen in
8 total, but because everything is integrated into a single device, interfaces
9 between steps may be implemented more loosely.)
10
11 For GSM-HR codec the 3 controlling specs for different parts of Rx DTX handler
12 logic are GSM 06.21, GSM 06.22 and GSM 06.41 - however, for the full details
13 these specs defer to the reference C code in GSM 06.06. This article explains
14 this logic from all aspects which we find important: what the Rx DTX logic was
15 in the original reference code from ETSI and how we adapted it in libgsmhr1,
16 both for the full speech decoder and for our implementation of TFO transform.
17
18 Normative vs freely changeable aspects
19 ======================================
20
21 In the case of error-free transmission, such that the receiver never encounters
22 a frame with BFI or UFI set except during continuation of a DTX pause (after
23 receiving a valid SID that begins comfort noise insertion) and is never asked
24 to begin CN insertion with an invalid SID, the full behaviour of the speech
25 decoder to the final linear PCM output is required to be bit-exact and gets
26 exercised by test sequences. This bit-exact behaviour includes non-error-
27 handling aspects of the Rx DTX handler and comfort noise generation, complete
28 with interpolation for periodic CN updates via subsequent SID frames.
29
30 However, the reference C implementation becomes a non-normative example
31 (allowing changes in logic without violating spec requirements) in the
32 following aspects:
33
34 * Handling of BFI and UFI outside of DTX pauses previously entered via a valid
35 SID, including most aspects of error concealment;
36
37 * Exact manner of comfort noise muting when expected SID updates fail to arrive;
38
39 * The exact logic to be applied when a CN insertion period begins with an
40 invalid SID frame.
41
42 Almost-modular nature of GSM-HR Rx DTX handler
43 ==============================================
44
45 An Rx DTX handler can be considered fully modular if its output (which is then
46 passed as input to the main body of the speech decoder) is a potentially
47 modified set of speech parameters that can be packed into a new speech frame
48 and transmitted through a second radio leg with no change in the final output
49 of the speech decoder. The Rx DTX handler implemented in the reference code
50 from ETSI (both spec-normative and "example" aspects as broken down above)
51 _almost_ meets this modularity criterion, but not fully. The following aspects
52 are non-modular:
53
54 * The interpolation of R0 and LPC parameters during comfort noise insertion
55 (bit-exact implementation considered normative) happens after expansion of
56 transmitted parameter bits into linear form. In the general case one cannot
57 produce a new set of encoded parameters (that can be transmitted through a
58 second radio leg) that will produce the same bit-exact result upon final
59 decoding.
60
61 * Handling of speech frames (not SID, outside of DTX pause state) that are
62 marked with BFI=0 and UFI=1 (unreliable frames) has both a modular and a
63 non-modular aspect. If R0 increment is either small enough to not trigger
64 any mitigation or large enough to where UFI is converted into BFI, the applied
65 handling is fully modular. However, if R0 increment falls into the narrow
66 window between the two thresholds, the applied handling (output signal
67 concealment per GSM 06.21 section 5.1.2) is non-modular: it happens deep in
68 the guts of the speech decoder and cannot be represented via a modified set
69 of speech parameters.
70
71 TFO transform derived from the reference Rx DTX handler
72 =======================================================
73
74 If one extracts the reference Rx DTX handler from GSM 06.06 code and removes
75 the two non-modular aspects detailed above, leaving only fully modular logic,
76 the result can be used as a TFO transform that implements the functions of
77 TS 28.062 section C.3.2.1.1, specifically Case 1 in which UL may have DTX, but
78 DL is required to consist of speech frames only.
79
80 How does one address the two non-modular aspects of the standard GSM-HR Rx DTX
81 handler that are not possible in TFO? The simplest implementation is to remove
82 them altogether:
83
84 * Comfort noise parameters are not interpolated, instead an abrupt change in R0
85 and LPC parameters occurs every 240 ms when a new SID frame arrives.
86
87 * UFI is simply dropped in the case when the standard decoder would apply output
88 signal concealment, i.e., the latter feature is given up.
89
90 Obviously this approach constitutes functional regression relative to the
91 standard speech decoder - thus we were initially hesitant to adopt it. However,
92 experiments with a real historical TRAU that supports TFO (Nokia TCSM2) reveal
93 that Nokia implemented exactly the same approach (minimal complexity at the
94 price of slight functional degradation) in their TRAU DSP firmware. Seeing
95 that a major classic vendor of GSM infrastructure implemented this simplistic
96 approach, we are now comfortable with doing the same - especially considering
97 the work scope limits explained in HR-codec-limits article.
98
99 In Themyscira libgsmhr1 implementation, a component has been factored out which
100 we call the Rx front end (RxFE). This RxFE is our cleaned-up reimplementation
101 of those parts of the original Rx DTX handler that are fully modular (including
102 the speech ECU and all CN parameters that aren't interpolated), plus some
103 additional internal flag inputs and outputs. Out of the latter internal flags,
104 some are used only by the full speech decoder, while others are used only by
105 the TFO transform. RxFE state, which also serves as the API-visible TFO
106 transform state, is a subset of full speech decoder state. However, the core
107 RxFE function is not exported directly as API; instead the TFO transform API
108 function is a TFO-specific wrapper around the RxFE.
109
110 Detailed RxFE logic and its evolution
111 =====================================
112
113 Now that we have covered the background of the previous sections, we can
114 properly examine the actual logic of our RxFE, the follow-up logic for CN
115 interpolation that exists only in the full decoder, and their origins in the
116 reference GSM 06.06 code.
117
118 Unless noted otherwise, all logic described in the following sections is the
119 same between ETSI original and the present Themyscira implementation. The
120 internal representation and code structure may be different, but the behavioral
121 logic remains the same unless explicitly called out otherwise.
122
123 Input frame classification
124 --------------------------
125
126 As the very first processing step for every incoming frame, BFI, UFI and SID
127 flags are combined per GSM 06.41 Table 1 to classify the frame as good speech,
128 valid SID, invalid SID or unusable for DTX purposes. Note that UFI turns valid
129 SID into invalid just like BFI, and for DTX purposes all non-SID frames marked
130 with UFI are considered "unusable". But as we shall see shortly, this
131 "unusable" classification matters only for DTX and not for speech ECU logic,
132 which is separate.
133
134 Speech vs CNI state
135 -------------------
136
137 RxFE state that carries from one frame to the next includes one very important
138 two-state flag: either speech or CNI (comfort noise insertion) mode. By
139 combining the 4 possible frame classifications from GSM 06.41 Table 1 (see
140 above) with these two possible carry-over states, we get 4 possible ways in
141 which the current frame may be handled:
142
143 Input frame class Previously speech Previously CNI
144 --------------------------------------------------------------
145 SID (valid or invalid) CNIFIRSTSID CNICONT
146 Good speech SPEECH SPEECH
147 Unusable SPEECH CNIBFI
148
149 Here we can see that unless we enter DTX/CNI state, neither BFI nor UFI moves
150 RxFE logic out of SPEECH handling. This SPEECH handling mode includes the ECU
151 and handles both good and bad speech frames. However, once DTX/CNI state has
152 been entered, then only a (BFI==0 && UFI==0 && SID==0) good speech frame can
153 effect exit from this state!
154
155 Speech ECU logic
156 ================
157
158 The frame-to-frame persistent state for the ECU consists of the state counter
159 variable (range [0,7]) described in GSM 06.21 section 6.3 and a saved copy of
160 the last good speech frame. The just-referenced spec section describes the
161 logic quite well, but a few additional notes are in order:
162
163 * The last good speech frame that gets regurgitated in substitution/muting
164 states of the ECU is not exactly the same as the actual last good speech frame
165 that went through:
166
167 + GSP0 parameters for the first 3 subframes are replaced with GSP0 parameter
168 for the last subframe;
169
170 + If the frame is voiced, LTP lag parameters are modified - read the code for
171 the details.
172
173 In the original ETSI implementation, these modifications are applied at the
174 time of substitution/muting output; in our implementation, they are applied
175 at the time when a good speech frame is saved. Our implementation approach
176 makes it clearer what state is actually retained, but the functional behaviour
177 is exactly the same.
178
179 * When that last good speech frame gets regurgitated during bad frame handling,
180 codevector parameters may be taken either from that saved last good speech
181 frame or from the current bad frame. Use of codevector parameters from the
182 current bad frame is possible only when the current bad frame and the saved
183 last good speech frame have the same voiced vs unvoiced mode. If this mode
184 matches for one frame and bad-frame codevector parameters get passed on, but
185 the next bad frame has incompatible mode, the saved last good speech frame
186 gets used in its entirety once again, subject only to the modifications
187 described above.
188
189 * Our Themyscira version features an extension: if BFI equals 2 instead of 1,
190 indicating BFI without payload bits, then there are no bad-frame codevector
191 parameters and the saved last good speech frame is used in its entirety,
192 just as if BFI frames always have the wrong voiced vs unvoiced mode.
193
194 BFI out of reset
195 ================
196
197 What happens if the very first input frame in reset state (after external reset
198 or after a decoder homing frame) is a bad frame per BFI, or per UFI treated as
199 BFI - what is the default "last" good speech frame? In ETSI original code it
200 is a frame of all zero parameters, but this oddity is not readily visible - the
201 final output of linear PCM is also all zeros, and all is well. In Themyscira
202 implementation, the output of our RxFE may be visible externally if it is used
203 as a TFO transform - hence more attention was given to this issue.
204
205 If we feed all zeros as PCM input to a homed standard GSM-HR speech encoder, we
206 get this frame, repeating endlessly as long as all-zeros PCM input continues:
207
208 R0=00 LPC=164,171,cb Int=0 Mode=0
209 s1=00,00,00 s2=00,00,00 s3=00,00,00 s4=00,00,00
210
211 This frame differs from all-zero params only in the LPC set, and this sane-LPC
212 silence frame is the one we have adopted as our reset-default fallback frame.
213
214 When libgsmhr1 full speech decoder engine is used, as opposed to TFO transform,
215 there is an additional check. If the current state is the special home state
216 (logic required for spec-mandated EHF output with repeated DHF input) and the
217 input frame has BFI flag set (no other flags are considered in this case), the
218 PCM output is set to all zero samples without leaving the home state. However,
219 the regular speech ECU and its last good frame default can still be reached if
220 BFI is clear, UFI is set and R0 is high.
221
222 Comfort noise logic in RxFE
223 ===========================
224
225 GSM 06.22 spec treats the required bit-exact CN generator as a single entity -
226 however, in our implementation it is split between the RxFE and the main body
227 of the full speech decoder. The bit-exact result in the case of full speech
228 decoding remains the same, but our arrangement allows non-interpolated CN
229 generation in the TFO transform as well.
230
231 When our RxFE is used as a TFO transform with DTXd=0 (the mode that includes CN
232 generation), CN output from the transform matches GSM 06.22 Table 2, with the
233 exception of R0 and LPC parameters. These R0 and LPC parameters will be filled
234 as follows:
235
236 * If CN insertion period begins with a valid SID, R0 and LPC are taken from
237 that SID.
238
239 * If CN insertion period begins with an invalid SID, R0 and LPC are taken from
240 the last good speech frame, the one used by the speech ECU. Directly out of
241 reset (or after a DHF), these parameters are as shown above:
242
243 R0=00 LPC=164,171,cb
244
245 * Any time a new valid SID frame arrives during a CN insertion period, R0 and
246 LPC parameters change to this new SID.
247
248 * Any time the input during CN insertion is either an unusable frame or an
249 invalid SID, R0 and LPC parameters remain unchanged from the most recently
250 received valid SID, or from the last good speech frame if only invalid SID
251 frames have been received in the entire CN insertion period so far.
252
253 Comfort noise muting
254 ====================
255
256 Per GSM 06.21 sections 5.2.3 and 5.2.4, when SID frames fail to arrive for 3
257 consecutive TAF positions, generated comfort noise needs to be muted. We
258 implement this logic in our RxFE, and the actual logic is unchanged from ETSI
259 reference code - it is described in GSM 06.21 section 6.4.
260
261 This SID aging and CN muting logic works by counting unusable frames received
262 in between SID updates. In the original GSM 06.06 code the criterion to start
263 CN muting is:
264
265 TAF == 1 && CNIBFI_count >= 25
266
267 In our version we changed it to:
268
269 CNIBFI_count >= (TAF ? 25 : 36)
270
271 When TAF is indicated correctly, once every 12 frames and with the flag always
272 present at least in BFI frames (consider GSM 08.61 TRAU-8k format), our extended
273 criterion is equivalent to the original; however, our version will also produce
274 eventual CN muting if TAF is missing.
275
276 For the purpose of this logic, invalid SID is as good as valid: while it is
277 treated just like unusable frames (CNIBFI) for the purpose of R0 and LPC
278 parameters and their interpolation (see next section), for the purpose of SID
279 aging and CN muting, invalid SID resets the count of unusable frames, and if
280 muting already started previously, it is halted at the current (partially muted)
281 R0 value.
282
283 Comfort noise interpolation
284 ===========================
285
286 When our RxFE is invoked internally by our full speech decoder, the RxFE passes
287 some additional flags to the main body of the decoder. One of these flags
288 controls interpolation of R0 and LPC parameters for CNI, a function that is
289 required by the specs with bit-exact stipulation, but which cannot be
290 implemented at the level of speech parameters.
291
292 The only case in which the behaviour of our libgsmhr1 full speech decoder
293 differs from ETSI original is when an invalid SID frame arrives immediately out
294 of reset, not preceded by any good speech, valid SID or even unusable frames.
295 In this case the original GSM 06.06 code uses initialized all-zero state of
296 pswOldFrmKsDec[] array, which cannot happen in any other case. In our
297 implementation we use LPC=164,171,cb instead, as already explained.
298
299 Outside of this corner case, invalid SID frames are handled as follows
300 (unchanged between EISI original and our version):
301
302 * If CN insertion period begins with an invalid SID, R0 and LPC are taken from
303 the last good speech frame, the one used by the speech ECU. These R0 and LPC
304 params are then fed into the prescribed bit-exact interpolation mechanism as
305 if CN insertion started with a valid SID frame with these parameters.
306
307 * Any invalid SID frames that occur in the middle of a CN insertion period are
308 treated just like unusable frames for the purpose of interpolation.
309
310 Return from CN insertion to speech state
311 ========================================
312
313 Exit from DTX/CNI state happens upon receipt of a good speech frame, i.e., a
314 frame that meets this criterion:
315
316 BFI == 0 && UFI == 0 && SID == 0
317
318 However, the original implementation in GSM 06.06 reference code exhibits this
319 flaw: if the speech ECU is in state 6 (see GSM 06.21 section 6.3) and then an
320 accepted SID frame (valid or invalid) puts us into DTX state, the first good
321 speech frame after this DTX pause will be dropped and replaced with fully muted
322 form of the last good speech frame from before the CN insertion period. This
323 effect happens no matter how long that DTX pause was - thus the last good speech
324 frame being regurgitated (with R0 reduced to 0) may be indefinitely old and out
325 of place. Furthermore, if the CNI-exiting good speech frame that is dropped
326 here is followed by BFI unusable frames, the ECU will return to state 6 and the
327 parameters (other than muted R0) of the last good speech frame from before the
328 DTX pause will continue being reused indefinitely.
329
330 In our libgsmhr1 version, the state counter for the speech ECU is reset to 7
331 (the initial home state) whenever our RxFE passes through DTX/CNI state. Since
332 only a good speech frame with BFI=0 and UFI=0 can make exit from CN insertion
333 state, this reset of ECU state ensures that this good speech frame will pass
334 through, and then the ECU will be in state 0 after this talkspurt-opening good
335 speech frame.
336
337 Fully muted state after unusable frames in input
338 ================================================
339
340 If the input to the speech decoder or TFO transform becomes nothing but BFI
341 unusable frames, what is the final fully muted or "decayed" output at the level
342 of modified speech parameters? In GSM-FR codec there is a special silence frame
343 defined in GSM 06.11 Table 1, and the final decayed state is a continuous output
344 of these fixed silence frames - irrespective of whether the Rx DTX handler got
345 to this fully decayed state from speech or CN muting.
346
347 However, no equivalent fully decayed state with fixed output is defined for
348 GSM-HR. While this aspect is a non-normative "example" implementation detail,
349 in both GSM 06.06 reference code and Themyscira libgsmhr1 the fundamental state
350 of speech vs CNI persists indefinitely even when fully muted:
351
352 * If an indefinitely long string of unusable frames occurs in speech state,
353 the speech ECU will be in state 6, and the output from the RxFE (externally
354 visible in the case of TFO) will endlessly repeat parameters of the last good
355 speech frame, except for R0 reduced to 0.
356
357 * If an indefinitely long string of unusable frames occurs in DTX/CNI state,
358 the output form shown in GSM 06.22 Table 2, complete with bit-exact
359 pseudorandom sequence in unvoiced codevector parameters, will likewise
360 continue indefinitely. LPC parameters will remain from the most recently
361 received valid SID frame (or from the last good speech frame if CNI period
362 began with invalid SID and no valid SID was received afterward), but R0 will
363 be reduced to 0 by the CN muting logic.
364
365 Because R0 is reduced to 0 in both cases, the above details are generally
366 invisible with full endpoint speech decoding. However, they become fully
367 visible in the case of TFO transform with DTXd=0.
368
369 TFO transform with DTXd=1
370 =========================
371
372 The internal RxFE block that emits CN parameters during DTX/CNI state is correct
373 for the full endpoint speech decoder application and for TFO transform with
374 DTXd=0. The case of TFO transform with DTXd=1 is implemented by calling the
375 same RxFE block, then applying this simple modification to its output: if the
376 current frame was processed in DTX/CNI mode, the frame of CN parameters is
377 transformed into a downlink SID frame by replacing all speech parameters beyond
378 R0 and LPC with all-ones SID codeword.
379
380 The internal RxFE block tells the TFO wrapper when this just-described
381 modification should be applied by way of an internal flag. This flag is set
382 in two cases:
383
384 1) When the current frame was processed in DTX/CNI mode, or
385
386 2) When the speech ECU applied substitution/muting handling to the current
387 frame, and the ECU state was 6 or 7 at the beginning of current frame
388 processing.
389
390 The effects of this logic are as follows:
391
392 1) DTX pauses in UL pass through into DTX pauses in DL, with unusable frames
393 and invalid SID replaced with the most recent valid SID, or with R0+LPC from
394 the last good speech frame in the case of initial invalid SID. The
395 spec-compliant Rx DTX handler in the destination MS can then produce the
396 most correct form of comfort noise, including interpolation of R0 and LPC
397 parameters.
398
399 2) When the input to TFO transform is nothing but unusable frames, the downlink
400 radio leg should go into DTXd state in order to produce the desired reduction
401 in radio interference and BTS power consumption. This effect should happen
402 irrespective of whether the "fully decayed" state of RxFE is DTX/CNI muting
403 or speech ECU, as covered in the previous section. Our logic of turning
404 "fully decayed" ECU state into DTXd SID achieves the desired effect.
405
406 Finally, there is one more modification applied only in the case of TFO
407 transform with DTXd=1 and not in other cases: muting of comfort noise. In the
408 case of full endpoint speech decoding or TFO transform with DTXd=0, when the
409 criterion for CN muting is first reached, the muting proceeds by decrementing
410 R0 by 2 on every frame, i.e., gradually. (See GSM 06.21 section 6.4.) However,
411 in the case of TFO transform with DTXd=1, CN muting is effected by reducing R0
412 to 0 immediately as soon as CN muting criterion is reached. The rationale is
413 as follows:
414
415 * A TRAU (or TRAU-emulating MGW that feeds Abis to a BTS) has no way of knowing
416 exactly which of its continuously emitted DL SID frames will actually get
417 transmitted on the air and seen by the MS. Therefore, a muting process that
418 gradually decrements R0 with every emitted SID frame would make no sense.
419
420 * If the destination MS receives a SID update with R0=0 subsequent to whatever
421 previous SID it received with non-zero R0, the spec-required CN interpolation
422 logic in that MS will produce the desired effect of gradual muting over 240 ms
423 - not too far from the 320 ms muting time called for in GSM 06.21 section
424 5.2.4.
425
426 TFO transform homing
427 ====================
428
429 ThemWi implementation of TFO transform includes the feature of in-band homing:
430 if the input to the transform is the spec-defined decoder homing frame (DHF),
431 this DHF is passed through to the output just like any other good speech frame,
432 but the internal state is reset to the initial "home" state.
433
434 The check for DHF (all bits must match, plus (BFI == 0 && SID == 0) criterion)
435 and the resulting state reset happen at the end of frame processing, after the
436 output for the current frame has been generated. In the case of ThemWi TFO
437 transform for GSM-HR, there are two corner cases in which an incoming DHF may
438 be acted upon (produce state reset), but not appear in the output:
439
440 1) The overall state of RxFE was speech (as opposed to DTX/CNI) and the speech
441 ECU state was 6 - the state in which the first received good speech frame
442 gets dropped.
443
444 2) The overall state of RxFE was DTX/CNI and the incoming DHF is marked with
445 UFI=1. UFI is not a criterion for DHF detection, only BFI is, but UFI in
446 DTX/CNI state will cause current frame processing to treat the frame as
447 unusable.