view doc/Loadtools-performance @ 618:6824c4d55848

doc/Loadtools-performance: program-m0 slowness documented
author Mychaela Falconia <falcon@freecalypso.org>
date Tue, 25 Feb 2020 18:40:00 +0000
parents 39b74c39d914
children 8c6e7b7e701c
line wrap: on
line source

Here are the expected run times for the flash dump2bin operation of dumping the
entire flash content of a Calypso GSM device:

Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud:
12m53s

The same 4 MiB flash dump at 812500 baud: 1m50s

Dump of 8 MiB flash (e.g., Mot C155/156) at 812500 baud: 3m40s

Because of the architecture of fc-loadtool and its loadagent back-end, the run
time of a flash dump operation depends only on the serial baud rate and the
size of the flash area to be dumped; it should not depend on the USB-serial
adapter type or any host system properties, as long as the host system and
serial adapter combination supports the desired baud rate.  In contrast, flash
programming and fc-xram loading operations are quite different in that their
run times do depend on the host system and USB-serial adapter or other serial
port hardware - this host system dependency exists because of the way these
operations are implemented in our architecture.

Here are some examples of expected flash programming times, all obtained on the
Mother's Slackware 14.2 host system, using the flash program-bin command as
opposed to program-m0 or program-srec:

Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware
image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s

Flashing the same OM GTA02 modem with the same fw image, using a CP2102
USB-serial cable at 812500 baud: 1m52s

Flashing a Magnetite hybrid fw image (2378084 bytes) into an FCDEV3B board
(S71PL129N flash chip) via an FT2232D adapter at 812500 baud: 2m11s

These times are just for the flash program-bin operation, not counting the
flash erase which must be done first.  Flash erase times are determined
entirely by physical processes inside the flash chip and are not affected by
software design or the serial link: for each sector to be erased, fc-loadtool
issues the sector erase command to the flash chip and then polls the chip for
operation completion status; the polling is done over the serial link and thus
may seem very slow, but the extra bit of latency added by the finite polling
speed is still negligible compared to the time of the actual sector erase
operation inside the flash chip.  In contrast, the execution time of a flash
program-bin operation is a sum of 3 components:

* The time it takes for the bits to be transferred over the serial link;
* The time it takes for the flash programming operation to complete on the
  target (physics inside the flash chip);
* The overhead of command-response exchanges between fc-loadtool and loadagent.

If you are starting out with a firmware image in m0 format, converting it to
binary with mokosrec2bin (like our FC Magnetite build system always does) and
then flashing via program-bin is faster than flashing the original m0 image
directly via program-m0.  Following the last example above of flashing a
Magnetite hybrid fw image into an FCDEV3B, the flashing operation via
program-bin took 2m11s; flashing the same image via program-m0 took 3m54s.

Flashing via program-bin is faster than program-m0 or program-srec because the
program-bin operation uses a larger unit size internally.  fc-loadtool
implements all flash programming operations by sending AMFW or INFW commands to
loadagent; each AMFW or INFW command carries a string of 16-bit words to be
programmed.  Our program-bin operation programs 256 bytes at a time, i.e.,
sends one AMFW or INFW command per 256 bytes of image payload; our program-m0
and program-srec operations program one S-record at a time, i.e., each S-record
in the source image turns into its own AMFW or INFW command to loadagent.  In
the case of m0 images produced by TI's hex470 post-linker, each S-record carries
30 bytes of payload, thus flashing that m0 image directly with program-m0 will
proceed in 30-byte units, whereas converting it to binary and then flashing with
program-bin will proceed in 256-byte units.  The smaller unit size slows down
the overall operation by increasing the overhead of command-response exchanges.

XRAM loading via fc-xram is similar to flash program-m0 and program-srec in that
fc-xram sends a separate ML command to loadagent for each S-record, thus the
total XRAM image loading time is not only the serial bit transfer time, but also
the overhead of command-response exchanges between fc-xram and loadagent.  Going
back to the same FC Magnetite fw image that can be flashed into an FCDEV3B in
2m11s via program-bin or in 3m54s via program-m0, doing an fc-xram load of that
same fw image (built as ramimage.srec) into the same FCDEV3B via the same
FT2232D adapter at 812500 baud takes 2m54s - thus we can see that fc-xram
loading is faster than flash program-m0 or program-srec, but slower than flash
program-bin.

Why does XRAM loading take longer than flashing?  Shouldn't it be faster because
the flash programming step on the target is replaced with a simple memcpy()?
Answer: fc-xram is currently slower than flash program-bin because the latter
sends 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a
time; the division of the image into S-records is determined by the tool that
generates the SREC image, but TI's hex470 post-linker generates images with 30
bytes of payload per S-record.  Having the operation proceed in smaller chunks
increases the overhead of command-response exchanges and thus increases the
overall time.