Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP-IDF OTA update with cellular network interface #236

Closed
ws998116 opened this issue May 20, 2024 · 6 comments
Closed

ESP-IDF OTA update with cellular network interface #236

ws998116 opened this issue May 20, 2024 · 6 comments

Comments

@ws998116
Copy link

Hi, I need the ability to do an OTA update for my MCU, an ESP32, over a SARA-R5 cell network. I am currently using the esp-idf with ubxlib and have an idea on how to approach this, but wanted to get feedback before I attempted it.

My thought is that if I use the PPP interface as in this sockets example, I should be able to use the esp-idf OTA API similar to this example.

Is this approach correct?

  1. Initialize esp-idf network interface (like this).
  2. Initialize ubxlib network interface and bring up the cellular network.
  3. Use esp-idf OTA API to download and install update.

A few questions...

  1. Do I load TLS credentials onto the ESP32, SARA-R5, or both?
  2. Would I be able to use the built-in WiFi interface on the ESP32 in parallel to this?

I wasn't able to find an existing example for this scenario but if it's out there, please point me in that direction. Thanks!

@RobMeades
Copy link
Contributor

Hi there: yup, that sounds like the right approach to use the native ESP-IDF IP stack and OTA download. The other option you have is to use cellular sockets but not cellular HTTP (so avoiding any internal HTTP-download file-size limitations within the cellular module) to download your update file and then locally apply it within ESP-IDF. I wrote an example of how to do that for another customer here; the example does not include locally applying the file within ESP-IDF as I didn't know how to do that.

The reason I mention the latter is that using PPP will invoke CMUX and, since both CMUX and PPP are framed protocols, should your UART suffer any character loss the connection will likely mess up. That said, we have other customers doing exactly this (on ESP-IDF) who have not had a problem that I am aware of, I just thought I'd mention it.

On your questions:

Do I load TLS credentials onto the ESP32, SARA-R5, or both?

If you're using PPP, and therefore ESP-IDF's IP stack, the cellular module will not need to know about any TLS credentials.

Would I be able to use the built-in WiFi interface on the ESP32 in parallel to this?

I would have thought so but this is really an ESP-IDF question; I don't know how they do routing within their IP stack.

We don't have an example of doing FOTA stuff as it tends to end up being quite platform/application specific.

@ws998116
Copy link
Author

Thanks for your thoughts @RobMeades. I went ahead and tried the PPP approach and was able to successfully download and apply a small .bin file (169 KB).

Success Log

I (22632) cellular: Attempting to download update
I (24102) esp-x509-crt-bundle: Certificate validated
I (26492) esp-x509-crt-bundle: Certificate validated
I (28022) esp_https_ota: Starting OTA...
I (28022) esp_https_ota: Writing to partition subtype 16 at offset 0x180000
I (69232) esp-x509-crt-bundle: Certificate validated
I (93552) esp-x509-crt-bundle: Certificate validated
I (112252) esp_image: segment 0: paddr=00180020 vaddr=3f400020 size=04a80h ( 19072) map
I (112262) esp_image: segment 1: paddr=00184aa8 vaddr=3ffbdb60 size=033b0h ( 13232)
I (112262) esp_image: segment 2: paddr=00187e60 vaddr=40080000 size=081b8h ( 33208)
I (112272) esp_image: segment 3: paddr=00190020 vaddr=400d0020 size=18af8h (101112) map
I (112302) esp_image: segment 4: paddr=001a8b20 vaddr=400881b8 size=01968h (  6504)
I (112312) esp_image: segment 0: paddr=00180020 vaddr=3f400020 size=04a80h ( 19072) map
I (112322) esp_image: segment 1: paddr=00184aa8 vaddr=3ffbdb60 size=033b0h ( 13232)
I (112322) esp_image: segment 2: paddr=00187e60 vaddr=40080000 size=081b8h ( 33208)
I (112332) esp_image: segment 3: paddr=00190020 vaddr=400d0020 size=18af8h (101112) map
I (112362) esp_image: segment 4: paddr=001a8b20 vaddr=400881b8 size=01968h (  6504)
I (112442) cellular: OTA Succeed, Rebooting...

Unfortunately, I've run into issues when trying to download a realistic update (1.3 MB). I'm not sure if there a timeout or memory limit that I'm hitting, or if there's some loss which is causing a problem like you mentioned.

Fail Log

D (299821) transport_base: remain data in cache, need to read again
D (299831) HTTP_CLIENT: need_read=512, byte_to_read=512, rlen=506, ridx=512
D (299831) HTTP_CLIENT: http_on_body 506
D (299841) cellular: HTTP_EVENT_ON_DATA, len=506
D (299841) HTTP_CLIENT: is_data_remain=1, is_chunked=1, content_length=-1
D (299671) event: no handlers have been registered for event ESP_HTTPS_OTA_EVENT:5 posted to loop 0x3ffb8224
D (299861) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
D (299871) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
D (299881) event: no handlers have been registered for event ESP_HTTPS_OTA_EVENT:5 posted to loop 0x3ffb8224
D (299891) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
D (299901) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
D (299911) event: no handlers have been registered for event ESP_HTTPS_OTA_EVENT:5 posted to loop 0x3ffb8224
D (299921) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
D (299931) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:4 posted to loop 0x3ffb8224
E (322481) esp-tls-mbedtls: read error :-0x004C:
I (322481) esp-tls-mbedtls: (FFFFFFB4): UNKNOWN ERROR CODE (004C)
E (322481) transport_base: esp_tls_conn_read error, errno=Software caused connection abort
D (322491) HTTP_CLIENT: need_read=6, byte_to_read=6, rlen=-76, ridx=1018
W (322491) HTTP_CLIENT: esp_transport_read returned:-76 and errno:113
E (322501) HTTP_CLIENT: transport_read: error - -1 | ESP_FAIL
D (322511) esp_https_ota: Written image length 774138
D (322511) HTTP_CLIENT: is_data_remain=1, is_chunked=1, content_length=-1
D (322521) event: no handlers have been registered for event ESP_HTTPS_OTA_EVENT:5 posted to loop 0x3ffb8224
E (322521) transport_base: poll_read select error 113, errno = Software caused connection abort, fd = 54
D (322541) HTTP_CLIENT: need_read=1024, byte_to_read=512, rlen=-2, ridx=0
E (322551) HTTP_CLIENT: transport_read: error - 57347 | ERROR
D (322551) HTTP_CLIENT: Chunks were not completely read
D (322561) cellular: HTTP_EVENT_ERROR
D (322561) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:0 posted to loop 0x3ffb8224
E (322561) esp_https_ota: data read -1, errno 0
D (322581) event: no handlers have been registered for event ESP_HTTPS_OTA_EVENT:8 posted to loop 0x3ffb8224
D (322581) cellular: HTTP_EVENT_DISCONNECTED
D (322591) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:6 posted to loop 0x3ffb8224
D (322591) cellular: HTTP_EVENT_DISCONNECTED
D (322601) event: no handlers have been registered for event ESP_HTTP_CLIENT_EVENT:6 posted to loop 0x3ffb8224
E (322611) cellular: Firmware upgrade failed

I tried implementing your sockets example, but I wasn't able to make a connection to my server. This is likely something that's wrong on my end. I will try to see if I can figure that out.

The issues I'm having don't seem to be related to ubxlib or the modem itself, but I'd like to leave this issue open for a few days while I try to figure it out if that's ok. There doesn't seem to be very much consistency to when the errors occur, but it's always the same esp_tls_conn_read error. I'm open to any advice or ideas. Thanks!

Another Fail Log

I (23672) cellular: Attempting to download update
I (25192) esp-x509-crt-bundle: Certificate validated
I (27452) esp-x509-crt-bundle: Certificate validated
I (28932) esp_https_ota: Starting OTA...
I (28932) esp_https_ota: Writing to partition subtype 16 at offset 0x180000
I (53982) esp-x509-crt-bundle: Certificate validated
I (77352) esp-x509-crt-bundle: Certificate validated
I (98912) esp-x509-crt-bundle: Certificate validated
I (128422) esp-x509-crt-bundle: Certificate validated
I (159622) esp-x509-crt-bundle: Certificate validated
I (182312) esp-x509-crt-bundle: Certificate validated
I (204312) esp-x509-crt-bundle: Certificate validated
I (230082) esp-x509-crt-bundle: Certificate validated
I (258122) esp-x509-crt-bundle: Certificate validated
I (291182) esp-x509-crt-bundle: Certificate validated
I (324542) esp-x509-crt-bundle: Certificate validated
I (365642) esp-x509-crt-bundle: Certificate validated
I (389302) esp-x509-crt-bundle: Certificate validated
E (416952) esp-tls-mbedtls: read error :-0x004C:
E (416952) transport_base: esp_tls_conn_read error, errno=Software caused connection abort
W (416952) HTTP_CLIENT: esp_transport_read returned:-76 and errno:113
E (416962) HTTP_CLIENT: transport_read: error - -1 | ESP_FAIL
E (417042) transport_base: poll_read select error 113, errno = Software caused connection abort, fd = 54
E (417042) HTTP_CLIENT: transport_read: error - 57347 | ERROR
E (417052) esp_https_ota: data read -1, errno 0
E (417052) cellular: Firmware upgrade failed

@RobMeades
Copy link
Contributor

RobMeades commented May 21, 2024

I'd like to leave this issue open for a few days while I try to figure it out if that's ok

Of course, it will be interesting to see how things go.

LWIP errno 113 is EHOSTUNREACH, which seems a bit odd as the host had been reachable but, given the frequent "certificate validated" prints, it might be that the OTA client is opening a new TCP connection for each HTTP chunk and the establishment of one of these TCP connections has failed. Not sure what error code -76/0x4c is though, and it looks like mbed TLS doesn't either.

I guess you have the UART flow control lines connected? That might become important if there is a lot of data to download.

If you suspect a PPP issue it might be that one or both of these would give you diagnostic information, if they don't drown the debug output of course:

CONFIG_LWIP_PPP_DEBUG_ON=y
CONFIG_LWIP_DEBUG=y

Another debug option would be to attach something like a Saleae probe to the UART lines and capture the entire transaction; might be a big file of course but would have all of the raw detail we might need in it.

@ws998116
Copy link
Author

Thanks for the insight.

the OTA client is opening a new TCP connection for each HTTP chunk

I believe you are correct on this.

I guess you have the UART flow control lines connected? That might become important if there is a lot of data to download.

Flow control is not connected on my hardware. More on this later.

I enabled the debug logging as you suggested and it doesn't give too much info, but there are a few things to note. First, the update is aborted after what appears to be a timeout (see consecutive pppos_netif_output[1]: proto=0x21, len = 40 in the log below). Another interesting thing is that every now and then I get this pppos_input[1]: Dropping bad fcs 0xd9bb proto=0x21. I think this is a checksum failure, which would indicate corrupt data.

Debug log

ppp_input[1]: ip in pbuf len=1400
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_netif_output[1]: proto=0x21, len = 40
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 6 bytes
ppp_input[1]: ip in pbuf len=1400
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_input[1]: got 128 bytes
pppos_netif_output[1]: proto=0x21, len = 40
pppos_netif_output[1]: proto=0x21, len = 40
pppos_netif_output[1]: proto=0x21, len = 40
pppos_netif_output[1]: proto=0x21, len = 40
pppos_netif_output[1]: proto=0x21, len = 40
E (151707) esp-tls-mbedtls: read error :-0x004C:
E (151707) transport_base: esp_tls_conn_read error, errno=Software caused connection abort
W (151707) HTTP_CLIENT: esp_transport_read returned:-76 and errno:113
E (151717) HTTP_CLIENT: transport_read: error - -1 | ESP_FAIL
E (151727) transport_base: poll_read select error 113, errno = Software caused connection abort, fd = 54
E (151737) HTTP_CLIENT: transport_read: error - 57347 | ERROR
E (151737) esp_https_ota: data read -1, errno 0
E (151747) cellular: Firmware upgrade failed
E (151747) cellular: Something went wrong.

This, along with your comment about hardware flow control, led me to try different hardware that does have hardware flow control (SparkFun SARA-R5 breakout board with an ESP32 devkit). Surprisingly, this board downloaded the update on the first try (and at a much faster speed)! It also never reported the checksum failure. This definitely indicates that I have a hardware issue. Do you really think that UART flow control is that important? I think another hardware difference with the SparkFun board is that it has level-shifting chips, whereas I am using transistors to match voltage levels.

I appreciate your help!

@RobMeades
Copy link
Contributor

RobMeades commented May 21, 2024

Do you really think that UART flow control is that important?

We always advise that it is connected; even at 115200 it can be an issue. Thing is that you are running two framed protocols: CMUX and then PPP on top of that. In particular, CMUX is running a CMUX control channel, an AT channel (which will be doing nothing unless you use a ubxlib API at the same time) and a PPP channel. The code on the MCU side within ubxlib has to pull apart the CMUX channels and reply to the control messages, otherwise the CMUX protocol will get upset, then there's PPP on top of that which will throw an entire PPP message away if there's a single bit in error.

Basically, the link is critically sensitive to bit errors on the UART.

In terms of level shifting chips v transistors, I couldn't really say; all I know is that standard bi-directional level-shifting chips definitely don't work with flow control lines, for some reason, but I guess the SparkFun HW has properly designed level-shifting HW. For this reason we actually end up without flow control on a few of our test instances and we relatively rarely have a problem but, on the other hand, we aren't doing 1.3 Mbyte downloads over PPP either.

So, if at all possible, have HW flow control I think.

@ws998116
Copy link
Author

Thanks for your suggestions @RobMeades!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants