view doc/Loadtools-performance @ 630:8c6e7b7e701c

doc/Loadtools-performance: updates for new program-m0 and setserial
author Mychaela Falconia <falcon@freecalypso.org>
date Sat, 29 Feb 2020 21:22:27 +0000
parents 6824c4d55848
children e66fafeeb377
line wrap: on
line source

Dumping and programming flash
=============================

Here are the expected run times for the flash dump2bin operation of dumping the
entire flash content of a Calypso GSM device:

Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud:
12m53s

The same 4 MiB flash dump at 812500 baud: 1m50s

Dump of 8 MiB flash (e.g., Mot C155/156) at 812500 baud: 3m40s

Because of the architecture of fc-loadtool and its loadagent back-end, the run
time of a flash dump operation depends only on the serial baud rate and the
size of the flash area to be dumped; it should not depend on the USB-serial
adapter type or any host system properties, as long as the host system and
serial adapter combination supports the desired baud rate.  In contrast, flash
programming and fc-xram loading operations are quite different in that their
run times do depend on the host system and USB-serial adapter or other serial
port hardware - this host system dependency exists because of the way these
operations are implemented in our architecture.

Here are some examples of expected flash programming times, all obtained on the
Mother's Slackware 14.2 host system:

Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware
image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s

Flashing the same OM GTA02 modem with the same fw image, using a CP2102
USB-serial cable at 812500 baud: 1m52s

Flashing a Magnetite hybrid fw image (2378084 bytes) into an FCDEV3B board
(S71PL129N flash chip) via an FT2232D adapter at 812500 baud: 2m11s

These times are just for the flash program-bin operation, not counting the
flash erase which must be done first.  Flash erase times are determined
entirely by physical processes inside the flash chip and are not affected by
software design or the serial link: for each sector to be erased, fc-loadtool
issues the sector erase command to the flash chip and then polls the chip for
operation completion status; the polling is done over the serial link and thus
may seem very slow, but the extra bit of latency added by the finite polling
speed is still negligible compared to the time of the actual sector erase
operation inside the flash chip.  In contrast, the execution time of a flash
program-bin operation is a sum of 3 components:

* The time it takes for the bits to be transferred over the serial link;
* The time it takes for the flash programming operation to complete on the
  target (physics inside the flash chip);
* The overhead of command-response exchanges between fc-loadtool and loadagent.

Programming flash using program-m0 or program-srec
==================================================

Prior to fc-host-tools-r12 flash programming via flash program-m0 or
program-srec commands was much slower than flash program-bin.  The reason for
this performance discrepancy was that the original implementation of these
commands from 2013 was very straightforward: they operated in one pass, reading
the S-record image file, and as each individual S-record was read, it was turned
into an AMFW or INFW command to loadagent.  In the case of *.m0 files generated
by TI's hex470 post-linker, each S-record carries 30 bytes of payload, thus the
flashing operation proceeded in 30-byte units, incurring the overhead of a
command-response exchange for every 30 bytes.  In contrast, our current flash
program-bin implementation sends 256 bytes of payload per each AMFW or INFW
command; this larger unit size decreases the overhead of command-response
exchanges between fc-loadtool and loadagent.

Why do we need flash program-m0 and program-srec commands at all, why not
simply convert all SREC images to straight binary first and then program with
flash program-bin?  The reason is that S-record images can contain multiple
discontiguous program regions with gaps in between.  All of our current
FreeCalypso firmwares built with TI's TMS470 toolchain contain a few small gaps
in the fwimage.m0 file, filled with 0xFF bytes when converted to straight binary
with mokosrec2bin, but TI's own firmwares built for 8 MiB flash configurations
often had much bigger gaps in them.

As of fc-host-tools-r12 we finally have a more efficient solution for flashing
discontiguous SREC images: our new implementation of flash program-m0 and
program-srec commands begins with a preliminary pass (pure host operation, no
target interaction) of reading the S-record image file; the payload bits are
written into a temporary binary file (automatically deleted afterward), while
the address and length of each discontiguous region are remembered internally.
Then the actual flash programming operation proceeds just like program-bin,
reading from the internal binary file and sending 256 bytes of payload at a time
to loadagent, but using the remembered knowledge of where the discontiguous
regions lie.

XRAM loading via fc-xram
========================

Our current fc-xram implementation is similar to the old 2013 implementation of
flash program-m0 and program-srec commands in that fc-xram sends a separate ML
command to loadagent for each S-record, thus the total XRAM image loading time
is not only the serial bit transfer time, but also the overhead of command-
response exchanges between fc-xram and loadagent.  The flash programming times
listed above include flashing an FC Magnetite fw image into an FCDEV3B, which
took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as
ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500
baud takes 2m54s.

Why does XRAM loading take longer than flashing?  Shouldn't it be faster because
the flash programming step on the target is replaced with a simple memcpy()?
Answer: fc-xram is currently slower than flash program operations because the
latter send 256 bytes at a time to loadagent, whereas fc-xram sends one
S-record at a time; the division of the image into S-records is determined by
the tool that generates the SREC image, but TI's hex470 post-linker generates
images with 30 bytes of payload per S-record.  Having the operation proceed in
smaller chunks increases the overhead of command-response exchanges and thus
increases the overall time.

Additional complication with FTDI adapters and newer Linux kernel versions
==========================================================================

If you are using an FTDI adapter and a Linux kernel version newer than early
2017 (the change was introduced between 4.10 and 4.11), then you have one
additional complication: a change was made to the ftdi_sio driver in the Linux
kernel that makes many loadtools operations (basically everything other than
flash dumps which are entirely target-driven) unbearably slow (much slower than
the Slackware 14.2 reference times given above) unless you execute a special
setserial command first.  After you plug in your FTDI-based USB-serial cable or
connect the USB cable between your PC or laptop and your FTDI adapter board,
causing the corresponding ttyUSBx device to appear, execute the following
command:

setserial /dev/ttyUSBx low_latency

(Obviously change ttyUSBx to your actual ttyUSB number.)  Execute this
setserial command before running fc-loadtool or fc-xram, and then hopefully you
should get performance that is comparable to what I get on classic Slackware.
I say "hopefully" because I am not able to test it myself - I refuse to run any
OS that can be categorized as "modern" - but field reports of performance on
non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.