# HG changeset patch # User Mychaela Falconia # Date 1583011347 0 # Node ID 8c6e7b7e701c56a942b8c28e0c964cc2eaabfb69 # Parent 0f70fe9395c4f40b8988ed4ae1297f72ab50fd56 doc/Loadtools-performance: updates for new program-m0 and setserial diff -r 0f70fe9395c4 -r 8c6e7b7e701c doc/Loadtools-performance --- a/doc/Loadtools-performance Sat Feb 29 09:10:39 2020 +0000 +++ b/doc/Loadtools-performance Sat Feb 29 21:22:27 2020 +0000 @@ -1,3 +1,6 @@ +Dumping and programming flash +============================= + Here are the expected run times for the flash dump2bin operation of dumping the entire flash content of a Calypso GSM device: @@ -19,8 +22,7 @@ operations are implemented in our architecture. Here are some examples of expected flash programming times, all obtained on the -Mother's Slackware 14.2 host system, using the flash program-bin command as -opposed to program-m0 or program-srec: +Mother's Slackware 14.2 host system: Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s @@ -47,44 +49,84 @@ target (physics inside the flash chip); * The overhead of command-response exchanges between fc-loadtool and loadagent. -If you are starting out with a firmware image in m0 format, converting it to -binary with mokosrec2bin (like our FC Magnetite build system always does) and -then flashing via program-bin is faster than flashing the original m0 image -directly via program-m0. Following the last example above of flashing a -Magnetite hybrid fw image into an FCDEV3B, the flashing operation via -program-bin took 2m11s; flashing the same image via program-m0 took 3m54s. +Programming flash using program-m0 or program-srec +================================================== + +Prior to fc-host-tools-r12 flash programming via flash program-m0 or +program-srec commands was much slower than flash program-bin. The reason for +this performance discrepancy was that the original implementation of these +commands from 2013 was very straightforward: they operated in one pass, reading +the S-record image file, and as each individual S-record was read, it was turned +into an AMFW or INFW command to loadagent. In the case of *.m0 files generated +by TI's hex470 post-linker, each S-record carries 30 bytes of payload, thus the +flashing operation proceeded in 30-byte units, incurring the overhead of a +command-response exchange for every 30 bytes. In contrast, our current flash +program-bin implementation sends 256 bytes of payload per each AMFW or INFW +command; this larger unit size decreases the overhead of command-response +exchanges between fc-loadtool and loadagent. -Flashing via program-bin is faster than program-m0 or program-srec because the -program-bin operation uses a larger unit size internally. fc-loadtool -implements all flash programming operations by sending AMFW or INFW commands to -loadagent; each AMFW or INFW command carries a string of 16-bit words to be -programmed. Our program-bin operation programs 256 bytes at a time, i.e., -sends one AMFW or INFW command per 256 bytes of image payload; our program-m0 -and program-srec operations program one S-record at a time, i.e., each S-record -in the source image turns into its own AMFW or INFW command to loadagent. In -the case of m0 images produced by TI's hex470 post-linker, each S-record carries -30 bytes of payload, thus flashing that m0 image directly with program-m0 will -proceed in 30-byte units, whereas converting it to binary and then flashing with -program-bin will proceed in 256-byte units. The smaller unit size slows down -the overall operation by increasing the overhead of command-response exchanges. +Why do we need flash program-m0 and program-srec commands at all, why not +simply convert all SREC images to straight binary first and then program with +flash program-bin? The reason is that S-record images can contain multiple +discontiguous program regions with gaps in between. All of our current +FreeCalypso firmwares built with TI's TMS470 toolchain contain a few small gaps +in the fwimage.m0 file, filled with 0xFF bytes when converted to straight binary +with mokosrec2bin, but TI's own firmwares built for 8 MiB flash configurations +often had much bigger gaps in them. -XRAM loading via fc-xram is similar to flash program-m0 and program-srec in that -fc-xram sends a separate ML command to loadagent for each S-record, thus the -total XRAM image loading time is not only the serial bit transfer time, but also -the overhead of command-response exchanges between fc-xram and loadagent. Going -back to the same FC Magnetite fw image that can be flashed into an FCDEV3B in -2m11s via program-bin or in 3m54s via program-m0, doing an fc-xram load of that -same fw image (built as ramimage.srec) into the same FCDEV3B via the same -FT2232D adapter at 812500 baud takes 2m54s - thus we can see that fc-xram -loading is faster than flash program-m0 or program-srec, but slower than flash -program-bin. +As of fc-host-tools-r12 we finally have a more efficient solution for flashing +discontiguous SREC images: our new implementation of flash program-m0 and +program-srec commands begins with a preliminary pass (pure host operation, no +target interaction) of reading the S-record image file; the payload bits are +written into a temporary binary file (automatically deleted afterward), while +the address and length of each discontiguous region are remembered internally. +Then the actual flash programming operation proceeds just like program-bin, +reading from the internal binary file and sending 256 bytes of payload at a time +to loadagent, but using the remembered knowledge of where the discontiguous +regions lie. + +XRAM loading via fc-xram +======================== + +Our current fc-xram implementation is similar to the old 2013 implementation of +flash program-m0 and program-srec commands in that fc-xram sends a separate ML +command to loadagent for each S-record, thus the total XRAM image loading time +is not only the serial bit transfer time, but also the overhead of command- +response exchanges between fc-xram and loadagent. The flash programming times +listed above include flashing an FC Magnetite fw image into an FCDEV3B, which +took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as +ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500 +baud takes 2m54s. Why does XRAM loading take longer than flashing? Shouldn't it be faster because the flash programming step on the target is replaced with a simple memcpy()? -Answer: fc-xram is currently slower than flash program-bin because the latter -sends 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a -time; the division of the image into S-records is determined by the tool that -generates the SREC image, but TI's hex470 post-linker generates images with 30 -bytes of payload per S-record. Having the operation proceed in smaller chunks -increases the overhead of command-response exchanges and thus increases the -overall time. +Answer: fc-xram is currently slower than flash program operations because the +latter send 256 bytes at a time to loadagent, whereas fc-xram sends one +S-record at a time; the division of the image into S-records is determined by +the tool that generates the SREC image, but TI's hex470 post-linker generates +images with 30 bytes of payload per S-record. Having the operation proceed in +smaller chunks increases the overhead of command-response exchanges and thus +increases the overall time. + +Additional complication with FTDI adapters and newer Linux kernel versions +========================================================================== + +If you are using an FTDI adapter and a Linux kernel version newer than early +2017 (the change was introduced between 4.10 and 4.11), then you have one +additional complication: a change was made to the ftdi_sio driver in the Linux +kernel that makes many loadtools operations (basically everything other than +flash dumps which are entirely target-driven) unbearably slow (much slower than +the Slackware 14.2 reference times given above) unless you execute a special +setserial command first. After you plug in your FTDI-based USB-serial cable or +connect the USB cable between your PC or laptop and your FTDI adapter board, +causing the corresponding ttyUSBx device to appear, execute the following +command: + +setserial /dev/ttyUSBx low_latency + +(Obviously change ttyUSBx to your actual ttyUSB number.) Execute this +setserial command before running fc-loadtool or fc-xram, and then hopefully you +should get performance that is comparable to what I get on classic Slackware. +I say "hopefully" because I am not able to test it myself - I refuse to run any +OS that can be categorized as "modern" - but field reports of performance on +non-Slackware systems running newer Linux kernels (4.11 or later) are welcome.