Copyright © https://mongoose-os.com

Mongoose OS Forum

frame

Core 0 Panic'ed

2

Comments

  • So there may be some red herrings here. I added some external +5V power from a desk supply, and things are working more reliably, It seems that if it boots into a state that "works", it stays working for some time, but if it comes up "broken", it stays that way.

    One odd thing that I've seen, with and without the external power, is an occasional difference in the startup messages:

    mgos_atca_init ATECC508 @ 0x60: rev 0x5000 S/N 0x123df869b6090d5ee, zone lock status: no, yes; ECDH slots: 0xe0

    vs

    mgos_atca_init ATECC508 @ 0x60: rev 0x5000 S/N 0x123df869b6090d5ee, zone lock status: yes, yes; ECDH slots: 0xe0

    Both zones are locked, and the chip behaves that way. Not seeing any correlation, just another oddity that may or may not be a clue.

  • rojerrojer Dublin, Ireland

    no, --clean and/or removing build should be all that's needed for a clean build.
    the issues with power are a complete mystery to me.
    fwiw, my test rig is a DevKitC with ATECC508, SSD1306 I2C and SPI displays and an MCP9808 temperature sensor, all powered from 3.3V rail coming from the regulator on the devkit board.
    i have no issues communicating to any of the I2C devices, signing and verification works reliably as well.

    you don't happen to have a logic analyzer that you could connect to the I2C bus, do you? i'd like to see "bad" and "good" captures.
    i think as an interim workaround i'll provide a way to fall back to bit-banging I2C impl.

  • Thanks, @rojer. I honestly don't know what to report at this point. I think there may be noise issues or something, as moving it to a different spot on my desk seems to have cleared up a lot of the issues. Turning up the debug level seems to aggravate it. Best case, it will run great for several minutes, then quietly stop taking incoming network connections (but no core dump, no error messages). If I run its monitoring app, which makes a rest call every three seconds, it will fail much faster, often hitting the watchdog panic. The "background" tasks that are always running are a high-frequency timer that drives a string of nine neopixels, and the AWS shadow calls. I would imagine that the cpu should be able to keep up with all that, but I could see how tickling the watchdog could get lost in that.

  • Oh, and I don't have a logic analyser, but I may be able to get my hands on one. That would be interesting to see.

  • rojerrojer Dublin, Ireland

    ok, i made it possible to switch between HW and bitbang I2C impl on ESP32 (here) and also optimized it a bit. please try and see if it gets better.
    in particular, we no longer disable ints during bitbanging - this may stretch some bits, but that should be safe, we're not breaking the protocol.

  • rojerrojer Dublin, Ireland

    oh, and btw - you mentioned neopixel. yeah, those are fairly power-hungry, driving them takes a fair amount of current. whonknows, maybe turnign on hw I2C just tipped it over the edge.

  • Yeah, that's why I power them separately. Little piggies!

    I have a logic analyzer on the way. Should have more data next week.

  • @rojer I have some captures. It looks like SLC gets running and never stops, like it's not getting a NAK or something. I can post the data somewhere, if that helps, otherwise I have a bunch of screen shots. I'm using a Saleae Logic 4:







  • rojerrojer Dublin, Ireland

    that looks like https://github.com/espressif/esp-idf/issues/606 except i thought i found a workaround for it.
    can you post the fragment where SCL loop begins? does it look like what i posted on the bug?

  • This section? It doesn't look exactly the same, but does have similar results.

  • rojerrojer Dublin, Ireland

    no, where constant SCL pulsing starts. here i see device sending remaining data, but host is not really listening anymore, it's stuck.

  • Sorry, still trying to comprehend what I'm seeing. Here is one I caught just after reset.

    What I assume is the initial wake-up:

    Then the next event, in detail:

    and zoomed out a bit:

    and at the end, I assume a reset you do after the 10k iterations:

    Hope that helps.

  • rojerrojer Dublin, Ireland

    ok. i think i see the same thing, only triggered at a different point.

    can you confirm that switching to bitbang i2c (MGOS_ENABLE_I2C_GPIO: 1) solves the problem?

  • I will verify this tomorrow.
  • Hi, @rojer. So bit-banging seems to work. I have other issues there, like the NeoPixel timer not getting enough attention, and it suddenly refusing incoming connections after a few minutes, but I think it's unrelated.

  • rojerrojer Dublin, Ireland

    definitely not related to I2C. thanks for confirming. i'll try to find more time to produce an isolated simple repro and poke ESP folks harder.
    i want to know more about other issues. timer not getting enough attention... multicore could help here, it would at least off-load network processing to the other core.
    however, seeing all the activity happening upstream having to do with fixing multicore issues, i'm a bit wary. but if you're feeling adventurous, you can turn off CONFIG_FREERTOS_UNICORE.
    i suggest you wait a bit for the build image update, which will update the IDF (2.0-r5, PR is already in review).

    re: stopping accepting connections - i think i've seen this happen once. i'll spend more time this week looking at server performance, it seems a bit low. i'll keep an eye on this too.

  • Cool. I was thinking about enabling the other core. Would I then have to manage the allocation of tasks to it, or will RTOS take care of that? And if it does, would I want to add semaphores? The simplest thing I could do is just set my timer tasks to run on the second core.

  • And I can catch the server failing, so I'll up the debug level and see if that gives any helpful information.

  • rojerrojer Dublin, Ireland

    you can try with no changes at all. many of the tasks are networking-related, so those should be off-loaded automatically. mgos main task is started with no pinning to any core, so i expect that it should freely migrate between them.
    if you run your own timer tasks, it's up to you what to do there.

    i should generally caution you that mgos is not (yet) truly multithreaded - a few key places are protected by locks, but gnereally we expect that things will stay on mgos main task. mgos_* apis are expected to be called from the main task, for example. mg_* networking services as well.
    if you use multitasking, this, in fact, may be the reason for instability.

  • carlcarl US
    edited May 24

    Nothing fancy, just calling mgos_set_timer to blast out NeoPixel data to make a rainbow effect. The point is to give the app something heavy to do while it's being polled for state. You can imagine how that goes, but here it is on the logic analyzer:

    Closer up:

    Of course the beautiful rainbow effect pauses and flashes occasionally. Not a big deal for this toy app, but might be if it was doing something more critical.

  • rojerrojer Dublin, Ireland

    yeah, ok, this is expected. mgos timers run in the same event loop as everything else. so here i would make neopixel timer a freertos timer, or even make it a separate task with priority higher than MGOS_TASK_PRIORITY. it's a fairly isolated piece, just banging out its bits, so should be safe to do.
    even without multicore, it should be able to preempt mgos task that may be doing i2c at the moment to do the neopixel shiny thing. this would stretch out i2c, but that should be fine.

  • Yeah, that worked great. Using the FreeRTOS timer made things better, but separate task was best.

    Before:

    Using freertos timer:

    Using freertos task:

  • But still seeing the original issue:

    ...
    mgos_http_ev         0x3ffb6354 HTTP connection from 192.168.1.8:55990
    ssl_socket_send      0x3ffb6354 31 -> 31
    Guru Meditation Error of type LoadProhibited occurred on core  0. Exception was unhandled.
    Register dump:
    PC      : 0x40124c4e  PS      : 0x00060430  A0      : 0x801219d6  A1      : 0x3ffc9a70
    A2      : 0x3ffd5738  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x00000000
    A6      : 0x7ff00000  A7      : 0x00000000  A8      : 0x8011e4ec  A9      : 0x3ffc9a40
    A10     : 0x3ffb6354  A11     : 0x00000000  A12     : 0x00000003  A13     : 0x00000005
    A14     : 0x00000000  A15     : 0x00000830  SAR     : 0x00000001  EXCCAUSE: 0x0000001c
    EXCVADDR: 0x00000004  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000
    
    Backtrace: 0x40124c4e:0x3ffc9a70 0x401219d6:0x3ffc9a90 0x401216aa:0x3ffc9ab0 0x4012218d:0x3ffc9b10 0x4012c86b:0x3ffc9b30 0x40082a51:0x3ffc9b50 0x401463e0:0x3ffc9b70
    
    --- BEGIN CORE DUMP ---
    {"arch": "ESP32", "cause":28,
    "REGS": {"addr": 1073504832, "data": "
    TkwSQNYZEoBwmvw/OFf9PwAAAAAAAAAAAAAAAAAA8H8AAAAA7OQRgECa/D9UY/s/AAAAAAMAAAAFAAAAAAAAADAIAADvvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e
    776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e
    776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd7vvq3e776t3u++rd5sxABAd8QAQAAAAAABAAAAAAAAAAEAAADvvq3e
    776t3iAEBgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"},
    "DRAM": {"addr": 1073405952, "data": "
    s8y79lSh8cnEv7KNKcXqbQAAAABMKQhAAAAAAAAAAACQFPw/kBT8PwAAAAAAAAAAqcT5P6nE+T8BAAAAAAAAAEoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABKAAAA
    ...
    
    % xtensa-esp32-elf-addr2line -a -f -p -e build/objs/fw.elf 0x40124c4e:0x3ffc9a70 0x401219d6:0x3ffc9a90 0x401216aa:0x3ffc9ab0 0x4012218d:0x3ffc9b10 0x4012c86b:0x3ffc9b30 0x40082a51:0x3ffc9b50 0x401463e0:0x3ffc9b70
    0x40124c4e: mg_ssl_if_handshake at /mongoose-os/mongoose/mongoose.c:1759
    0x401219d6: mg_lwip_ssl_do_hs at /mongoose-os/mongoose/mongoose.c:1759
    0x401216aa: mg_lwip_if_poll at /mongoose-os/mongoose/mongoose.c:1759
    0x4012218d: mg_mgr_poll at /mongoose-os/mongoose/mongoose.c:1759
    0x4012c86b: mongoose_poll at /mongoose-os/fw/src/mgos_mongoose.c:56
    0x40082a51: mgos_mg_poll_cb at /mongoose-os/fw/platforms/esp32/src/esp32_main.c:159
    0x401463e0: mgos_task at /mongoose-os/fw/platforms/esp32/src/esp32_main.c:227
    
  • rojerrojer Dublin, Ireland

    hm. ok, i will take a look at it soon.

  • rojerrojer Dublin, Ireland

    couldn't repro... been stress-testing HTTPS serving for a while now - works fine, with two concurrent fetchers.
    do you have HTTPS enabled? can you turn on debug.level up to 4 and mbedtls_level to 3?

  • I'll give it a try. At the moment I'm hitting watchdog timeouts when I set the debug level that high.

    So I've got both HTTPS and MQTT/AWS using ATCA:4, polling HTTPS every three seconds.

  • carlcarl US
    edited May 26

    Not the same error, but using wget rather than Chrome, I was able to catch it not accepting connections. Note that the MQTT pings continue.

    I saved the boot sequence, in case you need that (01-boot.txt). Once that settled, I started a looped wget, with no delay between. The tenth request stalled. Annotated console log attached (02-start-polling.txt).

    Hope that helps.

  • rojerrojer Dublin, Ireland

    ok, i'll poke at it some more

Sign In or Register to comment.