FreeCalypso > hg > freecalypso-tools
comparison doc/Loadtools-performance @ 630:8c6e7b7e701c
doc/Loadtools-performance: updates for new program-m0 and setserial
| author | Mychaela Falconia <falcon@freecalypso.org> |
|---|---|
| date | Sat, 29 Feb 2020 21:22:27 +0000 |
| parents | 6824c4d55848 |
| children | e66fafeeb377 |
comparison
equal
deleted
inserted
replaced
| 629:0f70fe9395c4 | 630:8c6e7b7e701c |
|---|---|
| 1 Dumping and programming flash | |
| 2 ============================= | |
| 3 | |
| 1 Here are the expected run times for the flash dump2bin operation of dumping the | 4 Here are the expected run times for the flash dump2bin operation of dumping the |
| 2 entire flash content of a Calypso GSM device: | 5 entire flash content of a Calypso GSM device: |
| 3 | 6 |
| 4 Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud: | 7 Dump of 4 MiB flash (e.g., Openmoko GTA01/02 or Mot C139/140) at 115200 baud: |
| 5 12m53s | 8 12m53s |
| 17 run times do depend on the host system and USB-serial adapter or other serial | 20 run times do depend on the host system and USB-serial adapter or other serial |
| 18 port hardware - this host system dependency exists because of the way these | 21 port hardware - this host system dependency exists because of the way these |
| 19 operations are implemented in our architecture. | 22 operations are implemented in our architecture. |
| 20 | 23 |
| 21 Here are some examples of expected flash programming times, all obtained on the | 24 Here are some examples of expected flash programming times, all obtained on the |
| 22 Mother's Slackware 14.2 host system, using the flash program-bin command as | 25 Mother's Slackware 14.2 host system: |
| 23 opposed to program-m0 or program-srec: | |
| 24 | 26 |
| 25 Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware | 27 Flashing an Openmoko GTA02 modem (K5A3281CTM flash chip) with a new firmware |
| 26 image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s | 28 image (2376448 bytes), using a PL2303 USB-serial cable at 115200 baud: 7m35s |
| 27 | 29 |
| 28 Flashing the same OM GTA02 modem with the same fw image, using a CP2102 | 30 Flashing the same OM GTA02 modem with the same fw image, using a CP2102 |
| 45 * The time it takes for the bits to be transferred over the serial link; | 47 * The time it takes for the bits to be transferred over the serial link; |
| 46 * The time it takes for the flash programming operation to complete on the | 48 * The time it takes for the flash programming operation to complete on the |
| 47 target (physics inside the flash chip); | 49 target (physics inside the flash chip); |
| 48 * The overhead of command-response exchanges between fc-loadtool and loadagent. | 50 * The overhead of command-response exchanges between fc-loadtool and loadagent. |
| 49 | 51 |
| 50 If you are starting out with a firmware image in m0 format, converting it to | 52 Programming flash using program-m0 or program-srec |
| 51 binary with mokosrec2bin (like our FC Magnetite build system always does) and | 53 ================================================== |
| 52 then flashing via program-bin is faster than flashing the original m0 image | |
| 53 directly via program-m0. Following the last example above of flashing a | |
| 54 Magnetite hybrid fw image into an FCDEV3B, the flashing operation via | |
| 55 program-bin took 2m11s; flashing the same image via program-m0 took 3m54s. | |
| 56 | 54 |
| 57 Flashing via program-bin is faster than program-m0 or program-srec because the | 55 Prior to fc-host-tools-r12 flash programming via flash program-m0 or |
| 58 program-bin operation uses a larger unit size internally. fc-loadtool | 56 program-srec commands was much slower than flash program-bin. The reason for |
| 59 implements all flash programming operations by sending AMFW or INFW commands to | 57 this performance discrepancy was that the original implementation of these |
| 60 loadagent; each AMFW or INFW command carries a string of 16-bit words to be | 58 commands from 2013 was very straightforward: they operated in one pass, reading |
| 61 programmed. Our program-bin operation programs 256 bytes at a time, i.e., | 59 the S-record image file, and as each individual S-record was read, it was turned |
| 62 sends one AMFW or INFW command per 256 bytes of image payload; our program-m0 | 60 into an AMFW or INFW command to loadagent. In the case of *.m0 files generated |
| 63 and program-srec operations program one S-record at a time, i.e., each S-record | 61 by TI's hex470 post-linker, each S-record carries 30 bytes of payload, thus the |
| 64 in the source image turns into its own AMFW or INFW command to loadagent. In | 62 flashing operation proceeded in 30-byte units, incurring the overhead of a |
| 65 the case of m0 images produced by TI's hex470 post-linker, each S-record carries | 63 command-response exchange for every 30 bytes. In contrast, our current flash |
| 66 30 bytes of payload, thus flashing that m0 image directly with program-m0 will | 64 program-bin implementation sends 256 bytes of payload per each AMFW or INFW |
| 67 proceed in 30-byte units, whereas converting it to binary and then flashing with | 65 command; this larger unit size decreases the overhead of command-response |
| 68 program-bin will proceed in 256-byte units. The smaller unit size slows down | 66 exchanges between fc-loadtool and loadagent. |
| 69 the overall operation by increasing the overhead of command-response exchanges. | |
| 70 | 67 |
| 71 XRAM loading via fc-xram is similar to flash program-m0 and program-srec in that | 68 Why do we need flash program-m0 and program-srec commands at all, why not |
| 72 fc-xram sends a separate ML command to loadagent for each S-record, thus the | 69 simply convert all SREC images to straight binary first and then program with |
| 73 total XRAM image loading time is not only the serial bit transfer time, but also | 70 flash program-bin? The reason is that S-record images can contain multiple |
| 74 the overhead of command-response exchanges between fc-xram and loadagent. Going | 71 discontiguous program regions with gaps in between. All of our current |
| 75 back to the same FC Magnetite fw image that can be flashed into an FCDEV3B in | 72 FreeCalypso firmwares built with TI's TMS470 toolchain contain a few small gaps |
| 76 2m11s via program-bin or in 3m54s via program-m0, doing an fc-xram load of that | 73 in the fwimage.m0 file, filled with 0xFF bytes when converted to straight binary |
| 77 same fw image (built as ramimage.srec) into the same FCDEV3B via the same | 74 with mokosrec2bin, but TI's own firmwares built for 8 MiB flash configurations |
| 78 FT2232D adapter at 812500 baud takes 2m54s - thus we can see that fc-xram | 75 often had much bigger gaps in them. |
| 79 loading is faster than flash program-m0 or program-srec, but slower than flash | 76 |
| 80 program-bin. | 77 As of fc-host-tools-r12 we finally have a more efficient solution for flashing |
| 78 discontiguous SREC images: our new implementation of flash program-m0 and | |
| 79 program-srec commands begins with a preliminary pass (pure host operation, no | |
| 80 target interaction) of reading the S-record image file; the payload bits are | |
| 81 written into a temporary binary file (automatically deleted afterward), while | |
| 82 the address and length of each discontiguous region are remembered internally. | |
| 83 Then the actual flash programming operation proceeds just like program-bin, | |
| 84 reading from the internal binary file and sending 256 bytes of payload at a time | |
| 85 to loadagent, but using the remembered knowledge of where the discontiguous | |
| 86 regions lie. | |
| 87 | |
| 88 XRAM loading via fc-xram | |
| 89 ======================== | |
| 90 | |
| 91 Our current fc-xram implementation is similar to the old 2013 implementation of | |
| 92 flash program-m0 and program-srec commands in that fc-xram sends a separate ML | |
| 93 command to loadagent for each S-record, thus the total XRAM image loading time | |
| 94 is not only the serial bit transfer time, but also the overhead of command- | |
| 95 response exchanges between fc-xram and loadagent. The flash programming times | |
| 96 listed above include flashing an FC Magnetite fw image into an FCDEV3B, which | |
| 97 took 2m11s; doing an fc-xram load of the same FC Magnetite fw image (built as | |
| 98 ramimage.srec) into the same FCDEV3B via the same FT2232D adapter at 812500 | |
| 99 baud takes 2m54s. | |
| 81 | 100 |
| 82 Why does XRAM loading take longer than flashing? Shouldn't it be faster because | 101 Why does XRAM loading take longer than flashing? Shouldn't it be faster because |
| 83 the flash programming step on the target is replaced with a simple memcpy()? | 102 the flash programming step on the target is replaced with a simple memcpy()? |
| 84 Answer: fc-xram is currently slower than flash program-bin because the latter | 103 Answer: fc-xram is currently slower than flash program operations because the |
| 85 sends 256 bytes at a time to loadagent, whereas fc-xram sends one S-record at a | 104 latter send 256 bytes at a time to loadagent, whereas fc-xram sends one |
| 86 time; the division of the image into S-records is determined by the tool that | 105 S-record at a time; the division of the image into S-records is determined by |
| 87 generates the SREC image, but TI's hex470 post-linker generates images with 30 | 106 the tool that generates the SREC image, but TI's hex470 post-linker generates |
| 88 bytes of payload per S-record. Having the operation proceed in smaller chunks | 107 images with 30 bytes of payload per S-record. Having the operation proceed in |
| 89 increases the overhead of command-response exchanges and thus increases the | 108 smaller chunks increases the overhead of command-response exchanges and thus |
| 90 overall time. | 109 increases the overall time. |
| 110 | |
| 111 Additional complication with FTDI adapters and newer Linux kernel versions | |
| 112 ========================================================================== | |
| 113 | |
| 114 If you are using an FTDI adapter and a Linux kernel version newer than early | |
| 115 2017 (the change was introduced between 4.10 and 4.11), then you have one | |
| 116 additional complication: a change was made to the ftdi_sio driver in the Linux | |
| 117 kernel that makes many loadtools operations (basically everything other than | |
| 118 flash dumps which are entirely target-driven) unbearably slow (much slower than | |
| 119 the Slackware 14.2 reference times given above) unless you execute a special | |
| 120 setserial command first. After you plug in your FTDI-based USB-serial cable or | |
| 121 connect the USB cable between your PC or laptop and your FTDI adapter board, | |
| 122 causing the corresponding ttyUSBx device to appear, execute the following | |
| 123 command: | |
| 124 | |
| 125 setserial /dev/ttyUSBx low_latency | |
| 126 | |
| 127 (Obviously change ttyUSBx to your actual ttyUSB number.) Execute this | |
| 128 setserial command before running fc-loadtool or fc-xram, and then hopefully you | |
| 129 should get performance that is comparable to what I get on classic Slackware. | |
| 130 I say "hopefully" because I am not able to test it myself - I refuse to run any | |
| 131 OS that can be categorized as "modern" - but field reports of performance on | |
| 132 non-Slackware systems running newer Linux kernels (4.11 or later) are welcome. |
