3GPP TS 28.062 section C.3.2.1.1 for EFR: seeking help

Fri Mar 31 18:59:38 UTC 2023

Hello GSM community,

I realize that most of you over in Osmocom land would much rather see
me submit Gerrit patches than write lengthy ML posts, but right now I
really need some help with the algorithmic logic of a feature before I
can develop patches implementing said feature - so please bear with
me.

The fundamental question is: what is the most correct way for a GSM
network (let's ignore divisions between network elements for the
moment) to construct the DL speech frame stream for call leg B if it
is coming from the UL of call leg A?  I am talking about call scenarios
where call leg A and call leg B use the same codec, thus no transcoding
is done (TrFO), and let me also further restrict this question to
old-style FR/HR/EFR codecs, as opposed to AMR.

At first the answer may seem so obvious that many people will probably
wonder why I am asking such a silly question: just take the speech
frame stream from call leg A UL, feed it to call leg B DL and be done
with it, right?  But the question is not so simple.  What should the
UL-to-DL mapper do when the UL stream hits a BFI instead of a valid
speech frame?  What should this mapper do if call leg A does DTXu but
there is no DTXd on call leg B?

The only place in 3GPP specs where I could find an answer to this
question is TS 28.062 section C.3.2.1.1.  Yes, I know that it's the
spec for in-band TFO within G.711, a feature which I reason no one
other than me probably cares about, but that particular section - I am
talking about section C.3.2.1.1 specifically, you can ignore the rest
of TFO for the purpose of this question - seems to me like it should
apply to _any_ scenario where an FR/HR/EFR frame stream is directly
passed from call leg A to call leg B without transcoding, including
scenarios like a self-contained Osmocom network with OsmoMSC switching
from one MS to another without any external MNCC.

Let us first consider the case of FR1 codec, which is the simplest.
Suppose call leg A has DTXu but call leg B has no DTXd - one can't do
DTXd on C0, so if 200 kHz of spectrum is all you got, operating a BTS
with just C0, then no one can do DTXd.  When Alice on call leg A is
silent, her MS will send a SID every 480 ms and have its Tx off the
rest of the time, and the frame stream from the BTS serving her call
leg will exhibit a SID frame in every 24th position and BFI placemarkers
in all other positions.

So what should the DL frame stream going to Bob look like in this
scenario?  My reading of section C.3.2.1.1 (second paragraph from the
top is the one that covers this scenario) tells me that the *network*
(set aside the question of which element) is supposed to turn that
stream of BFIs with occasional interspersed SIDs into a stream of
valid *speech* frames going to Bob, a stream of valid speech frames
representing comfort noise as produced by a network-located CN
generator.  The spec says in that paragraph: "The Downlink TRAU Frames
shall not contain the SID codeword, but parameters that allow a direct
decoding."

Needless to say, there is no code anywhere in Osmocom currently that
does the above, thus current Osmocom is not able to produce the fancy
TrFO behavior which the spec(s) seem to call for.  (I said "spec(s)"
vaguely because I only found a spec for TFO, not for TrFO, but I don't
see any reason why this aspect of TFO spec shouldn't also apply to
TrFO when the actual problem at hand is exactly the same.)

But no no no guys, I am *not* bashing Osmocom here, I am seeking to
improve it!  As it happens, fully implementing the complete set of
TS 28.062 section C.3.2.1.1 rules (I shall hereafter call them C3211
rules for short) for the original FR1 codec would be quite easy, and I
already have a code implementation which I am eyeing to integrate into
Osmocom.  Themyscira libgsmfrp is a FLOSS library that implements a
complete, spec-compliant Rx DTX handler for FR1, and it is 100% my own
original work, not based on ETSI or TI or any other sources, thus no
silly license issues - and I am eyeing the idea of integrating the
same functions, appropriately renamed, repackaged and re-API-ed, into
libosmocodec, and then invoking that functionality in OsmoBTS, in the
code path that goes from RTP Rx to feeding TCH DL to PHY layers.

But while FR1 is easy, doing the same for EFR is where the real
difficulty lies, and this is the part where I come to the community
for help.  The key diff between FR1 and EFR that matters here is how
their respective Rx DTX handlers are defined in the specs: for FR1 the
Rx DTX handler is a separate piece, with the interface from this Rx
DTX handler to the main body of the decoder being another 260-bit FR1
frame (this time without possibility of SID or BFI), and the specs for
DTX (06.31 plus 06.11 and 06.12) define and describe the needed Rx DTX
handler in terms of emitting that secondary 260-bit FR1 frame.  Thus
implementing this functionality in Themyscira libgsmfrp was a simple
matter of taking the logic described in the specs and turning it into
code.

But for EFR the specs do not define the Rx DTX handler as a separate
piece, instead it is integrated into the guts of the full decoder.
There is a decoder, presented as published C source from ETSI, that
takes a 244-bit EFR frame, which can be either speech or SID, *plus* a
BFI flag as input, and emits a block of 160 PCM samples as output -
all Rx DTX logic is buried inside, intertwined with the actual speech
decoder operation, which is naturally quite complex.

I've already spent a lot of time looking at the reference C
implementation of EFR from ETSI - I kinda had to, as I did the rather
substantial work of turning it into a usable function library, with
state structures and a well-defined interface instead of global vars
and namespace pollution - the result is Themyscira libgsmefr - but I
am still nowhere closer to being able to implement C3211 functionality
for this codec.

The problem is this: starting with a EFR SID frame and previous history
of a few speech frames (the hangover period), how would one produce
output EFR speech frames (not SID) that represent comfort noise, as
C3211 says is required?  We can all easily look at ETSI's original
code that generates CN as part of the standard decoder: but that code
generates linear PCM output, not secondary EFR speech frames that
represent CN.  There is the main body of the speech decoder, and there
are conditions throughout that slightly modify this decoder logic in
subtle ways for CN generation and/or for ECU-style substitution/muting
- but no guidance for how one could construct "valid speech" EFR
frames that would produce a similar result when fed to the standard
decoder in the MS after crossing radio leg B.

This is where I could really use some input from more senior and more
knowledgeable GSM-ers: does anyone know how mainstream commercial GSM
infra vendors (particularly "ancient" ones of pure T1/E1 TDM kind)
have solved this problem?  What do _they_ do in the scenario of call
leg A with DTXu turning into call leg B without DTXd?

Given that those specs were written in the happy and glorious days
when everyone used 2G, when GSM operators had lots of spectrum, and
when most networks operated large multi-ARFCN BTSes with frequency
hopping, I figure that almost everyone probably ran with DTXd enabled
when that spec section was written - hence if I wonder if the authors
of the TFO spec failed to appreciate the magnitude of what they were
asking implementors to do when they stipulated that a UL-to-DL mapping
from DTXu-on to DTXd-off "shall" emit no-SID speech frames that
represent TFO-TRAU-generated CN.  And if I wonder if the actual
implementors ignored that stipulation even Back In The Day...

Here is one way how we might be able to "cheat" - what if we implement
a sort of fake DTXd in OsmoBTS for times when real DTXd is not possible
because we only have C0?  Here is what I mean: suppose the stream of
TCH frames about to be sent to the PHY layer (perhaps the output of my
proposed, to-be-implemented UL-to-DL mapper) is the kind that would be
intended for DTXd-enabled DL in the original GSM architecture, with
all speech pauses filled with repeated SIDs, every 20 ms without fail.
A traditional DTXd BTS is supposed to transmit only those SIDs that
either immediately follow a speech frame or fall in the SACCH-aligned
always-Tx position, and turn the Tx off at other times.  We can't
actually turn off Tx at those "other" times when we are C0 - but what
if we create a "fake DTXd" effect by transmitting a dummy FACCH
containing an L2 fill frame at exactly the same times when we would do
real DTXd if we could?  The end effect will be that the spec-based Rx
DTX handler in the MS will "see" the same "thing" as with real DTXd:
receiving FACCH in all those "empty" 20 ms frame windows will cause
that spec-based Rx DTX handler to get BFI=1, exactly the same as if
radio Tx were truly off and the MS were listening to radio noise.

Anyway, I would love to hear other people's thoughts on these ideas,
especially if someone happens to know how traditional GSM infra vendors
handled those pesky requirements of TS 28.062 section C.3.2.1.1 for
UL-to-DL mapping.

Sincerely,
Your GSM-obsessed Mother Mychaela