Copyright © https://mongoose-os.com

Mongoose OS Forum

frame

glitch in UART code

Hi,
The scenario:

ESP32 running Mongoose OS 1.20.1, UART1 connected to a microcontroller which sends telemetry msgs. The ESP32 forwards
them via MQTT.
Application was first done via simple modification of init.js and conf9.json, it was running OK for several days
with mos 1.18.
Then we needed to add WiFi configuration so I made a proper project, compiled, and uploaded to the device.
It ran for some time and then I experienced this glitch. Upgraded to 1.20.1, rebuilt the app, provided for some
debugging via MQTT, and left it running again.
Within 48 hours it occurred again.

The observed problem:

Msgs are not forwarded since the ESP fails to check the msg framing. Read on.
This looks like "a glitch", and it happened not only once but twice today.

First time:

--- This is working
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8103 1511398284.247 1|Received UART data: [T{"H": 30, "C": 29, "EV": 0, "delta": 3}]
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8104 1511398284.260 1|Published: 1
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8105 1511398286.926 1|Received UART data: [T{"H": 30, "C": 30, "EV": 0, "delta": 2}]
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8106 1511398286.939 1|Published: 1
--- Now the glitch
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8107 1511398288.265 1|Received UART data: [T{"H": 30, "C": 29, "EV": 0, "delta": 2}
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8108 1511398288.273 1|*** DISCARDED ***
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8109 1511398289.606 1|Received UART data: ][T{"H": 30, "C": 30, "EV": 0, "delta": 1}
/Cika/debug/PSA/Dispenser/stdout = @SergioLab 8110 1511398289.611 1|*** DISCARDED ***
---

Notice here (msg 8107) that the ESP has missed the trailing ']' and then (msg 8109) has added it to the front of
the next incoming message. Since it sees an incorrect framing, it does not forward messages anymore.
Rambling:
Received data length seems to be correct, and so it leaves 1 char in the buffer forever. Looks like somehow
(glitch in the queue handling function ?) it lost one char in the count and it stayed inside the buffer, pointer correct (for 1 char present), but char count is now 0.
Every time a new msg comes in, the right number of chars is counted, and it comes out, but is preceded by the old char
left alone inside the buffer, so one is still left and this repeats forever

More info attached, including debug logs. Needed a hardware reboot for it to work again, and no, it is not the microcontroller, attached is also a logic analyzer capture where you can see the microcontroller sends always a correct frame.

I'm available for chat, same user (scaprile)

Regards

Comments

  • Actually UART data isn't guaranteed to be received by exact frames it is sent. So, it is legal to first receive [T{"H": 30, "C": 29, "EV": 0, "delta": 2} and then immediately receive ].

    However, given the log you posted, it looks like the ] does not immediately follow the preceding data, this looks suspicious indeed. We'll try to look into it.

    But in any case, your current code:

        if ((data.slice(0, 1) === '[') && (data.slice(ra-1, ra) === ']')) {
          ....
    

    Is fragile, since as I said above, the data might be received in portions which don't match exact frames as they were sent. The code should collect received data in some buffer, and parse each particular message (and remove that data from the buffer) only when a complete message is received.

    Thanked by 1scaprile
  • Thanks for your comment,
    For now, I send my messages spaced enough and each char in a message tight enough.
    I will properly write "message-aware" UART code when proper development on the ESP32 module starts, we've chosen to use dead-time signalling for our message framing system. That will be C code anyway.

    As you mention, there is no ´]´ alone after the message, looks like the fifo code got interrupted when it should have not, and even though the queue ingress remained pointing at the right place (2nd position in buffer), the char count was set to zero. Then for every n new chars coming in, they will be taken out starting with the former left out one, and since n chars will egress, there will always be one lurking inside wreaking havoc.
    The fact that it does not go away on a software reset makes me think something happened inside the ESP32 UART. If you don't reset its FIFO when setting up the driver, only a hardware reset would restore. Unfortunately I'm too busy with the main board code, so I can't peek myself, and I'm the only developer.

    Thanks for your support.

  • farzanfarzan Singapore
    edited December 2017

    Hello,
    I have also been experiencing similar issues with the ESP32. In my application, I'm communicating with an external device via UART. It is a master-slave type of configuration in which I need to send a command and wait for the device to reply. The steps followed by the program are:
    * On reset, the default baud rate of the device is 600, which I change to 9600 after sending the appropriate instruction.
    * Both devices then start communicating on a baud rate of 9600.
    * After sending a command, the program expects to see a reply of the appropriate length.
    * If no reply is received (or number of bytes is less than expected) within the timeout interval of 1 second, it discards any data and attempts to resend the command.
    * Complete data is published periodically to MQTT.
    Summarizing the key observations of the issue I am facing:
    * The code starts off with no issues.
    * After a random interval of time (I've observed times as short as 1 hour to as long as several days), my program detects a UART Receive timeout (i.e. UART dispatcher sees less than the expected number of bytes).
    * Everything received on the UART after that is 'corrupted', as it seems to contain byte(s) from previous data (even though mgos_uart_read_avail returns 0 prior to the next sent command).
    * After a software reset, I am not able to receive anything via UART (this part differs from the scaprile's issue, probably because the communication reverts to 600 baud on reset in my case).
    * A hardware reset fixes the issue until it occurs again.
    I currently have 2 prototypes deployed in a test-bed with the same firmware (built using mos 1.18). What is interesting is that in the first few weeks of continuous operation, there were only 1-2 occurrences of this issue across both devices. But recently, this frequency has increased to 1-2 occurrences daily for just 1 of the devices, while the other is still behaving as before. As yet, I have no reason to suspect it is a hardware problem since both devices are in the same environment.
    Because of the semi-deployed nature of this project, I don't have access to any debug logs via UART. I'll investigate the issue further when I have the next version of the prototype for bench testing (expected in January) but would be grateful if a solution is found before then.

  • @farzan Well, you are describing (more or less) exactly what I've reported.
    Did you try the newest release ?
    I noticed this started when I did some changes, like adding wifi soft-AP and tweaking the "linger time" parameter. Managed to reproduce the bug again without the soft-AP enabled, and seen they refactored the timers code for ESP32. Haven't had the time to reflash and let it run yet. Perhaps this week or next one.

    PS: Don't waste your time with logging, I've setup level 3 (debug) logging and captured log data during days, there's nothing in there regarding this.

  • farzanfarzan Singapore
    @scaprile I was about to go to the recent release. Need to do some changes in my code to make it compatible. I stopped when I read your post and saw you had already tried later releases.
    The latest esp-idf has some bug fixes related to uart hardware fifo. I'm wondering if that might fix the problem. Guess we need to wait for the next mos release incorporating esp-idf 3.0 to find out.
    I'm currently at a loss to explain why the issue is so random, and why one of my devices has suddenly become more susceptible to it
  • @farzan I'll try to explain it for you:
    To me, and that is why I used the word "glitch" but I haven't seen the code, this is some sort of a race condition in the code. It depends on many internals, among them the rate of UART reading code relative to the data ingress. Many variables can affect that, even a slightly different clock speed will effect a major difference among devices, you have to fall just in time to trigger the defect.
    It looks like when the input interrupt comes in the middle of the queue pop function handling character count. Suddenly there is one extra character in the FIFO but the queue missed it and silenced interrupts. It will stay there until another character (from the next message) wakes the subsystem, but then N chars will be counted and N retrieved, leaving the last one in the new message behind.
    You have to hit the code in the right moment to make it happen, and that only "succeeds" after many trials. Both devices have different clocks so rx ingress and queue retrieve are not in sync. When clock drifting makes them "closer", the chance to hit in the right moment is higher.
    My bet is that if you leave your devices running for a long time, you will see alternate (and different length) periods with more and less gltches.
    Since I am a hardware guy, I don't feel comfortable digging into this OS stuff, and also don't have the time to do it; I've chosen this path so I can concentrate on the application. So I guess we'll have to wait for 3.0. I will try to find the time to test the newer Mongoose-OS, my code is still trivial JS so it is only unplugging+flashing+plugging+log watching

    Thanked by 1farzan
  • farzanfarzan Singapore
    @scaprile Thanks, you've explained it very nicely. Hopefully, we'll have a fix soon. Being a hardware guy myself, I would also prefer to avoid going deep into the OS
  • @farzan On thursday I've built with latest release and left it running. No glitches so far, and usually they showed within a 48-hour period. You should upgrade. Regards

  • scaprilescaprile Argentina
    edited December 2017

    @farzan Well... I just posted and it glitched...

  • I believe this is fixed; I got a stable repro for that glitch (sending a 250K file through uart was enough: a few bytes were usually missing), fixed with this https://github.com/cesanta/mongoose-os/commit/c8b54e901bfee412cdda9cdc42902885860432e7

    You'll need to use latest mos, or it will be released in 1.24, after a couple of weeks.

    Thanked by 1farzan
  • farzanfarzan Singapore

    Just wanted to confirm, I rebuilt my program with mos 1.25 and it's been running continuously without issues for about a week. Thanks for the fix @dimonomid

  • scaprilescaprile Argentina

    Great!
    Thanks guys for your support.

Sign In or Register to comment.