# HG changeset patch
# User Mychaela Falconia <falcon@freecalypso.org>
# Date 1583011347 0
# Node ID 8c6e7b7e701c56a942b8c28e0c964cc2eaabfb69
# Parent  0f70fe9395c4f40b8988ed4ae1297f72ab50fd56
doc/Loadtools-performance: updates for new program-m0 and setserial

diff -r 0f70fe9395c4 -r 8c6e7b7e701c doc/Loadtools-performance
--- a/doc/Loadtools-performance	Sat Feb 29 09:10:39 2020 +0000
+++ b/doc/Loadtools-performance	Sat Feb 29 21:22:27 2020 +0000
@@ -1,3 +1,6 @@
+Dumping and programming flash
+=============================
+
 Here are the expected run times for the flash dump2bin operation of dumping the
 entire flash content of a Calypso GSM device:
 
@@ -19,8 +22,7 @@
 operations are implemented in our architecture.
 
 Here are some examples of expected flash programming times, all obtained on the
-Mother's Slackware 14.2 host system, using the flash program-bin command as
-opposed to program-m0 or program-srec:
+Mother's Slackware 14.2 host system:
 
 Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware
 image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s
@@ -47,44 +49,84 @@
   target (physics inside the flash chip);
 * The overhead of command-response exchanges between fc-loadtool and loadagent.
 
-If you are starting out with a firmware image in m0 format, converting it to
-binary with mokosrec2bin (like our FC Magnetite build system always does) and
-then flashing via program-bin is faster than flashing the original m0 image
-directly via program-m0.  Following the last example above of flashing a
-Magnetite hybrid fw image into an FCDEV3B, the flashing operation via
-program-bin took 2m11s; flashing the same image via program-m0 took 3m54s.
+Programming flash using program-m0 or program-srec
+==================================================
+
+Prior to fc-host-tools-r12 flash programming via flash program-m0 or
+program-srec commands was much slower than flash program-bin.  The reason for
+this performance discrepancy was that the original implementation of these
+commands from 2013 was very straightforward: they operated in one pass, reading
+the S-record image file, and as each individual S-record was read, it was turned
+into an AMFW or INFW command to loadagent.  In the case of *.m0 files generated
+by TI's hex470 post-linker, each S-record carries 30 bytes of payload, thus the
+flashing operation proceeded in 30-byte units, incurring the overhead of a
+command-response exchange for every 30 bytes.  In contrast, our current flash
+program-bin implementation sends 256 bytes of payload per each AMFW or INFW
+command; this larger unit size decreases the overhead of command-response
+exchanges between fc-loadtool and loadagent.
 
-Flashing via program-bin is faster than program-m0 or program-srec because the
-program-bin operation uses a larger unit size internally.  fc-loadtool
-implements all flash programming operations by sending AMFW or INFW commands to
-loadagent; each AMFW or INFW command carries a string of 16-bit words to be
-programmed.  Our program-bin operation programs 256 bytes at a time, i.e.,
-sends one AMFW or INFW command per 256 bytes of image payload; our program-m0
-and program-srec operations program one S-record at a time, i.e., each S-record
-in the source image turns into its own AMFW or INFW command to loadagent.  In
-the case of m0 images produced by TI's hex470 post-linker, each S-record carries
-30 bytes of payload, thus flashing that m0 image directly with program-m0 will
-proceed in 30-byte units, whereas converting it to binary and then flashing with
-program-bin will proceed in 256-byte units.  The smaller unit size slows down
-the overall operation by increasing the overhead of command-response exchanges.
+Why do we need flash program-m0 and program-srec commands at all, why not
+simply convert all SREC images to straight binary first and then program with
+flash program-bin?  The reason is that S-record images can contain multiple
+discontiguous program regions with gaps in between.  All of our current
+FreeCalypso firmwares built with TI's TMS470 toolchain contain a few small gaps
+in the fwimage.m0 file, filled with 0xFF bytes when converted to straight binary
+with mokosrec2bin, but TI's own firmwares built for 8 MiB flash configurations
+often had much bigger gaps in them.
 
-XRAM loading via fc-xram is similar to flash program-m0 and program-srec in that
-fc-xram sends a separate ML command to loadagent for each S-record, thus the
-total XRAM image loading time is not only the serial bit transfer time, but also
-the overhead of command-response exchanges between fc-xram and loadagent.  Going
-back to the same FC Magnetite fw image that can be flashed into an FCDEV3B in
-2m11s via program-bin or in 3m54s via program-m0, doing an fc-xram load of that
-same fw image (built as ramimage.srec) into the same FCDEV3B via the same
-FT2232D adapter at 812500 baud takes 2m54s - thus we can see that fc-xram
-loading is faster than flash program-m0 or program-srec, but slower than flash
-program-bin.
+As of fc-host-tools-r12 we finally have a more efficient solution for flashing
+discontiguous SREC images: our new implementation of flash program-m0 and
+program-srec commands begins with a preliminary pass (pure host operation, no
+target interaction) of reading the S-record image file; the payload bits are
+written into a temporary binary file (automatically deleted afterward), while
+the address and length of each discontiguous region are remembered internally.
+Then the actual flash programming operation proceeds just like program-bin,
+reading from the internal binary file and sending 256 bytes of payload at a time
+to loadagent, but using the remembered knowledge of where the discontiguous
+regions lie.
+
+XRAM loading via fc-xram
+========================
+
+Our current fc-xram implementation is similar to the old 2013 implementation of
+flash program-m0 and program-srec commands in that fc-xram sends a separate ML
+command to loadagent for each S-record, thus the total XRAM image loading time
+is not only the serial bit transfer time, but also the overhead of command-
+response exchanges between fc-xram and loadagent.  The flash programming times
+listed above include flashing an FC Magnetite fw image into an FCDEV3B, which
+took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as
+ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500
+baud takes 2m54s.
 
 Why does XRAM loading take longer than flashing?  Shouldn't it be faster because
 the flash programming step on the target is replaced with a simple memcpy()?
-Answer: fc-xram is currently slower than flash program-bin because the latter
-sends 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a
-time; the division of the image into S-records is determined by the tool that
-generates the SREC image, but TI's hex470 post-linker generates images with 30
-bytes of payload per S-record.  Having the operation proceed in smaller chunks
-increases the overhead of command-response exchanges and thus increases the
-overall time.
+Answer: fc-xram is currently slower than flash program operations because the
+latter send 256 bytes at a time to loadagent, whereas fc-xram sends one
+S-record at a time; the division of the image into S-records is determined by
+the tool that generates the SREC image, but TI's hex470 post-linker generates
+images with 30 bytes of payload per S-record.  Having the operation proceed in
+smaller chunks increases the overhead of command-response exchanges and thus
+increases the overall time.
+
+Additional complication with FTDI adapters and newer Linux kernel versions
+==========================================================================
+
+If you are using an FTDI adapter and a Linux kernel version newer than early
+2017 (the change was introduced between 4.10 and 4.11), then you have one
+additional complication: a change was made to the ftdi_sio driver in the Linux
+kernel that makes many loadtools operations (basically everything other than
+flash dumps which are entirely target-driven) unbearably slow (much slower than
+the Slackware 14.2 reference times given above) unless you execute a special
+setserial command first.  After you plug in your FTDI-based USB-serial cable or
+connect the USB cable between your PC or laptop and your FTDI adapter board,
+causing the corresponding ttyUSBx device to appear, execute the following
+command:
+
+setserial /dev/ttyUSBx low_latency
+
+(Obviously change ttyUSBx to your actual ttyUSB number.)  Execute this
+setserial command before running fc-loadtool or fc-xram, and then hopefully you
+should get performance that is comparable to what I get on classic Slackware.
+I say "hopefully" because I am not able to test it myself - I refuse to run any
+OS that can be categorized as "modern" - but field reports of performance on
+non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.