Copyright © https://mongoose-os.com

Mongoose OS Forum

frame

ESP32 Onewire Reliability Issues

mookiedogmookiedog Washington, USA

This might be a general Onewire question or a Mongoose library implementation thing, but how reliable is Onewire communication expected to be?

I have a two separate systems, each with one or more Dallas DS18B20 sensors. The wire lengths are short - two feet tops. I am noticing that I get a pretty consistent level of communication errors in the both systems. For example, one system takes a temperature reading of one of its sensors every 5 seconds. After close to 3 days of operation, that system reports that its success rate at reading the temp sensor is only 97.9% (979 errors in 47351 attempts). The success rate is pretty consistent for me: over the long run, the success rate always ranges around 98%. I have tried pullup resistors from about 10K to 2K, and there is no change to the success rate.

So: why is it not closer to 100%? Is the Onewire protocol simply that unreliable? A 98% success rate for the other serial links like UART communications or I2C or SPI would be abysmal.

In a related vein: when the Onewire library gets initialized, it is a fairly common occurrence for the number of sensors that are detected to be wrong. To look at that specific issue, I changed one test system to re-initialize the Onewire library once per second. All the test does is log how many sensors it believes are attached after each attempt. For the record, that system has exactly 2 sensors attached. After running for over half an hour, this is what I got:

0 sensors observed: 63 times
1 sensor observed: 44 times
2 sensors observed: 2005 times

That would be a success rate of 94.7%.

Once again, is Onewire just lame, or is there something else going on?

Getting the sensor count being wrong as the system boots can be extremely problematic. If a system is "unlucky" and sees the wrong number of sensors as it boots, then it is screwed up from that point on. To mitigate that, I added wrapper code around the initialization that makes multiple initialization attempts, and then "votes" on the correct answer. So far, that has always worked to get me the right sensor count at least, but then I'm back to a 2% error rate on any subsequent Onewire communication.

Is this a standard Onewire experience, or should I be looking for a problem?

Comments

  • mookiedogmookiedog Washington, USA

    OK, I'm looking for problems.

    A logic analyzer shows that a typical good transfer looks like a giant, continuous wad of serial activity:

    After collecting a number of bad transfer events, it would appear that a bad transfer always appears to have one or more interruptions during the transfer. In the screen shot below, a failed transfer appears to have been stalled (DATA stuck at '1') twice during the latter half of the transfer:

    And just to make things more interesting, there are rare times when I have seen good transfers get stalled momentarily but without error, as seen below:

    In theory, Onewire allows arbitrary delays in between bits (where the DATA bus would be parked at '1'), so those stalled 'good' transfers should be fine. The interesting part is that the stalled failing transfers should be fine too, and clearly, they are not.

    Something seems amiss. I will keep looking.

  • mookiedogmookiedog Washington, USA

    OK, I found something. If I disable interrupts around my Onewire calls to read a device temperature, then all Onewire reads start working with 100% reliability over thousands of sequential read operations. For example:

    mgos_ints_disable();
        float degreesF = s_dt->getTempF(s_sens[0]);
     mgos_ints_enable();

    The same applies during the initialization of the Onewire library. If I disable interrupts, then I get the proper sensor count every single time:

    mgos_ints_disable();
          s_dt->begin();
    mgos_ints_enable();

    Reading the underlying library code shows that the Onewire read-bit code does attempt to lock out interrupts during the critical Onewire timing section:

    bool mgos_onewire_read_bit(struct mgos_onewire *ow) {
      if (ow == NULL) return 0;
      mgos_ints_disable();
      mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_OUTPUT);
      mgos_gpio_write(ow->pin, LOW);
      mgos_usleep(3);
      mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_INPUT);
      mgos_gpio_set_pull(ow->pin, MGOS_GPIO_PULL_UP);
      mgos_usleep(10);
      bool res = mgos_gpio_read(ow->pin);
      mgos_ints_enable();
      mgos_usleep(52);
      return res;
    }

    Even though it would appear that the timing is being managed properly, I think there is a bug.

  • mookiedogmookiedog Washington, USA

    I found the source of my Onewire errors: once in a while, the Onewire bit receive routine will mis-read a '0' bit on the bus as a '1' bit.

    To debug the issue, I modified the Arduino onewire library code in a few ways: 1) I changed the management of the critical sections to be on a per-byte basis, not a per-bit basis, 2) I added a 300 uSec delay between bytes so that they are easily delineated on a logic analyzer and 3), I added another signal on GPIO 26 to mark various events during the Onewire bit receive operation. The read_bit routine now looks like this:
    `
    #define DELAY_A 3
    #define DELAY_E 10
    #define DELAY_F 52

    bool mgos_onewire_read_bit(struct mgos_onewire *ow) {
    mgos_gpio_set_mode(26, MGOS_GPIO_MODE_OUTPUT);
    mgos_gpio_write(26, 1);

    if (ow == NULL) return 0;
    //mgos_ints_disable();
    mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_OUTPUT);
    mgos_gpio_write(ow->pin, LOW);

    mgos_gpio_write(26, 0);
    mgos_usleep(DELAY_A); // was 3
    mgos_gpio_write(26, 1);
    mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_INPUT);
    mgos_gpio_set_pull(ow->pin, MGOS_GPIO_PULL_UP);

    mgos_usleep(DELAY_E); // was 10
    bool res = mgos_gpio_read(ow->pin);
    mgos_gpio_write(26, 0);
    mgos_usleep(DELAY_F); // was 52
    //mgos_ints_enable();
    return res;
    }`

    (Sorry, the code formatting markdown is not working again)

    The DELAY_A/E/F refer to the various delays as per the Onewire spec. It should be noted that the 3/10/52 selections in the existing library are not quite within the Onewire min/typ/max limits, but in this case, their errors should mostly be to our benefit.

    With this new read_bit routine in place, I can demonstrate Onewire errors of the variety I have been observing:

    In the screen shot above, you can clearly see the 300 uSec delays I added to delineate the received bytes. However, you can also see that one of the delays is longer than that, at slightly over 1 millisecond in duration. The only explanation is that there must have been an interrupt during that period. If we zoom in on the first two bits received after that interrupt, we see the following:

    In the second screen shot, you can see two '0' bits being transferred. They are known to be '0' bits because the DATA line goes low for about 30 uSec. A Onewire '1' bit would show the DATA line being low for less than 15 uSec. The interesting part is to see what is happening on the signal called SAMPLE. From the modified read_bit code above, you can see that we drive SAMPLE high as soon as we enter the bit receive routine mgos_onewire_read_bit(). Sample is then driven low just before the usleep(3) and high again right after usleep(3) completes. That allows the 3 usec delay to be measured. It measures out at just under 5 uSec, but that is to be expected for such short delays plus the additional code to wiggle the SAMPLE line. Getting back to the screen shot, as soon as SAMPLE goes high, we tri-state the DATA bus and then start the 10 uSec delay before sampling the bus state to read the bit. Finally SAMPLE is driven low just after reading the bit state from the bus.

    ...And the problem should be quite obvious now: the supposed 10 uSec delay is actually closer to 30 uSec, and the data bus gets sampled after the zero bit's time slot has completed.

    The strange part is that if you look at the next bit being received, the same 10 uSec delay is much shorter, although still close to 20 uSec. The 3 uSec delays look about identical between bits 0 and 1, but the 10 uSec delay is significantly longer during bit 0. Weird. Interrupts are disabled at the byte level, so it should not be some random interrupt that stretched bit 0's 10 usec pulse.

    There are still a number of things to consider.

    First, this extra-long delay only happens after an interrupt, and a pretty lengthy one at that. There is about a millisecond of code executing in between the received bytes. That is enough code running that it becomes likely that our bit-receive code got flushed from the processor's icache. That would mean that when we get back to our own code, it is going to run more slowly during the first bit as it gets reloaded into the cache. I think this is unlikely though. That's because the extra long delay was the _second _time that we called usleep() after the interrupt. Any icache fetching delays would affect the first call after the interrupt, not the second. The only code that has not already been executed would be the code to set the pullup. So I guess it could be that. I could probably come up with an experiment to prove that one way or another...

    Secondly, is it possible that there is some system event like an unmaskable interrupt or exception which is screwing up the usleep(10) timing? The ESP32 is a complex system, and vast amounts of it are basically opaque to those of us operating at this level.

    Thirdly, in experimenting with WIFI_PS_MODEM power save mode, it appeared like all of the timing delays for the first bit after an interrupt would be running about half as fast as subsequent bits. That is not the case here: the 3 usec delay looks like it is identical on bit 0 and bit 1. It's only the 10 uSec delay that is way too long. But it makes me wonder if various interrupt handlers might be messing with selecting different processor speeds, and the switching between them might not be perfect.

    Finally, I am wondering what the implementation of usleep() looks like. Is it possible that usleep() is somehow calculating the passage of time incorrectly after certain interrupts? Is the mgos_usleep() code available to see somewhere?

  • mookiedogmookiedog Washington, USA

    Thanks very much! I was hoping to see the delay calculation based off a hardware timer.

    I think I have the problem nailed down. Today's investigation points towards a problem resulting from the bit-reading getting flushed from the icache, which was one of the possibilities I mentioned in an earlier post. To confirm that, I inserted some bit toggling code throughout mgos_onewire_read_bit() to get a bit of insight into the timing involved with each step that the routine performs. What I found is that the first time the routine is called after one of the long interrupt situations, certain parts of mgos_onewire_read_bit() take noticeably longer than 'normal'. For example, the call to set DATA to its tristated input mode is typically about 5 uSec, but on the first time through after an interrupt it can be as long as 13.2 uSec. Similarly, the call to enable the pullup on the DATA input is normally about 4 uSec, but once again, first time after a long interrupt it can be as long as 8 uSec. So in my previous comment, I could not understand how an icache issue would cause the delay routine to be slower the second time through. Today's testing showed that the delay routine was not the thing going slower: it was the two calls to manipulate DATA GPIO just before the delay. I had assumed that the GPIO maniuplation code would be really fast, but even when icache-resident, it took a lot longer than I had been assuming.

    To test the icache flushing theory, I added the following lines to the start of mgos_onewire_read_bit():

    bool mgos_onewire_read_bit(struct mgos_onewire *ow) {
      // This first set of calls is just to make sure the icache is primed
      mgos_gpio_set_mode(26, MGOS_GPIO_MODE_INPUT);
      mgos_gpio_set_pull(26, MGOS_GPIO_PULL_UP);
      mgos_usleep(1);
      ...

    Pre-executing the calls (using a random GPIO so that the onewire bus is not affected), and doing the shortest possible time delay ensures that all the instructions in those routines become resident in the icache so that when the timing critical code runs moments later, the routines will run at a consistent speed since there will be no icache misses.

    ...and the results are very encouraging over the first hour of testing read transfers once per second:
    [Jan 8 16:43:22.360] Disconnection reports: 1 disconnects in 3692 attempts (100.0% success rate)
    On the one had, I have never seen the OneWire interface be so reliable! On the other hand, the single error that still occurred is probably due to some other part of mgos_onewire_read_bit() that got flushed from the icache slowing things down. All the test did was to make sure that most of the code was in the icache, not all of it.

    So: how to really fix it? Ultimately, bit-banging solutions like this don't need speed as much as they need consistency. One possible solution would be to have mgos_onewire_read_bit() start by disabling the icache, and re-enabling it as it leaves. The bit read instructions will run slowly, but consistently.

    I suspect that there must be an SDK call to enable or disable the icache. Does anyone out there know what it is?

  • mookiedogmookiedog Washington, USA

    For the benefit of Google search, here is a concise description of the existing OneWire library's unreliable read issue as I see it:

    The OneWire library makes an assumption that the execution of its "bit banged" software bit receive operation will occur in constant time. On an ESP32, it is possible that a side effect of interrupt execution elsewhere in the system will result in some or all of the OneWire library's bit-read routine getting displaced from the internal cached memory. Since the detection of OneWire '0' bits is very time-sensitive, if the first bit after the interrupt is a '0', the delays involved while the bit-read routine gets re-cached may cause that '0' bit to be interpreted as a '1'. This only affects '0' bits: the timing to detect of a '1' bit is open-ended from a master's point of view.

    As far as complete solutions go, I'm not sure there is a great one here.

    From what I can find on the web, it seems that the ESP32 is strictly a RAM-based processor, containing no embedded flash for code storage. All ESP32 code must be stored in an external serial flash. The ESP32's icache and MMU serve to move the contents of the serial flash into the internal SRAM as required. That would explain why the cache misses in the Onewire bit-banging routines appeared to be so expensive: dealing with a miss involves going off-chip to access an SPI serial flash. A side effect of all that is that I'm not sure that the ESP32 is really designed to operate with its icache disabled. My attempt to [mostly] preload the cache before receiving the a byte is working well. After running overnight, the system has had only 1 error in 44,335 tries. That is certainly better than the original 2% error rate, or 1 error per 50 tries. But putting that sort of a solution into the library certainly craps up the library sources for all the other processors that don't suffer from this issue. I have problem modifying a local copy of the library to serve my needs, but it just means that this issue will continue to bite ESP32 users who continue to use the standard library.

  • mookiedogmookiedog Washington, USA

    For the record, here is the final replacement code I came up with for the OneWire routine that reads data bits. The original library code was exhibiting an error rate of over 2% in my application. This version has performed over 500K read operations on a pair of test systems without a single error so far.

    Once again, my apologies for the lack of code formatting:

    You can use Markdown in your post.

    Nope. See below for a perfect example. The entire codeblock is in between backquotes, but for some reason, the markdown process is only formatting 4 random lines in the middle of the codeblock.

    `bool mgos_onewire_read_bit(struct mgos_onewire *ow) {
    bool res;

    if (ow == NULL) return 0;
    mgos_ints_disable();

    // Issue: if any of the code executed in this critical section is not cache-resident
    // and the bit being received is a '0', the execution delays involved in caching the
    // code may cause the '0' bit to be interpreted as a '1'.
    //
    // Fix: We ensure that the critical code is cached by executing it twice: once to make sure it is cached,
    // and a second time to actually perform the time-critical read operation.
    //
    // The first time through:
    // - the Onewire data pin is driven to '1' instead of '0' to avoid starting the Onewire READ operation
    // - the loop timing delays are set to minimal values since they are not needed first time through
    // The second time through, the code is known to be cache-resident. Therefore:
    // - The OneWire data pin gets driven to '0' which starts the time-critical READ operation
    // - timing delays are executed with their proper values

    const int delay_param_F = 52; // Min 50, Typ. 55, Max

    // Init params to have little or no effect the first time through:
    int delay_param_A = 1;
    int delay_param_E = 1;
    int data_state = 1;

    int iterations = 2;

    do {
    mgos_gpio_write(ow->pin, data_state); // To avoid potential glitches, set the new output state...
    mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_OUTPUT); // ...before changing the mode
    mgos_usleep(delay_param_A);
    mgos_gpio_set_pull(ow->pin, MGOS_GPIO_PULL_UP); // To avoid potential glitches, enable pullups...
    mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_INPUT); // ...before changing the mode
    mgos_usleep(delay_param_E);
    res = mgos_gpio_read(ow->pin);

    // Prepare for the actual read operation:
    data_state = 0;
    delay_param_A = 6;      // Min 5, Typ. 6, Max 15    Test system started producing errors at < 5 
    delay_param_E = 9;      // Min 5, Typ. 9, Max 12    Test system started producing errors at > 17
    

    } while (--iterations);

    mgos_ints_enable();
    mgos_usleep(delay_param_F);
    return res;
    }
    `

  • @mookiedog , in markdown, a single backquote is used for oneline expressions, like this one. If you need to have multiline block of code, wrap it into three backquotes:

    ```
    foo
    bar
    ```
    
  • mookiedogmookiedog Washington, USA

    Thanks, that solves a mystery for me. I have been using the web editor's 'paragraph' pulldown menu, and then selecting 'code'. It only inserts single backquotes. But weirdly, single back-quotes work most of the time to properly format multiple-line entries.

    Anyway, here is a more readable version of the fix:

    bool mgos_onewire_read_bit(struct mgos_onewire *ow) {
      bool res;
    
      if (ow == NULL) return 0;
      mgos_ints_disable();
    
      // Issue: if any of the code executed in this critical section is not cache-resident
      // and the bit being received is a '0', the execution delays involved in caching the
      // code may cause the '0' bit to be interpreted as a '1'.
      //
      // Fix: We ensure that the critical code is cached by executing it twice: once to make sure it is cached,
      // and a second time to actually perform the time-critical read operation.
      //
      // The first time through:
      //   - the Onewire data pin is driven to '1' instead of '0' to avoid starting the Onewire READ operation
      //   - the loop timing delays are set to minimal values since they are not needed first time through
      // The second time through, the code is known to be cache-resident.  Therefore:
      //   - The OneWire data pin gets driven to '0' which starts the time-critical READ operation
      //   - timing delays are executed with their proper values
    
      const int delay_param_F = 52;     // Min 50, Typ. 55, Max <none>
    
      // Init params to have little or no effect the first time through:
      int delay_param_A = 1;
      int delay_param_E = 1;
      int data_state = 1;
    
      int iterations = 2;
    
      do {
        mgos_gpio_write(ow->pin, data_state);                 // To avoid potential glitches, set the new output state...
        mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_OUTPUT);   // ...before changing the mode
        mgos_usleep(delay_param_A);
        mgos_gpio_set_pull(ow->pin, MGOS_GPIO_PULL_UP);       // To avoid potential glitches, enable pullups...
        mgos_gpio_set_mode(ow->pin, MGOS_GPIO_MODE_INPUT);    // ...before changing the mode
        mgos_usleep(delay_param_E);
        res = mgos_gpio_read(ow->pin);
    
        // Prepare for the actual read operation:
        data_state = 0;
        delay_param_A = 6;      // Min 5, Typ. 6, Max 15    Test system started producing errors at < 5 
        delay_param_E = 9;      // Min 5, Typ. 9, Max 12    Test system started producing errors at > 17
      } while (--iterations);
    
      mgos_ints_enable();
      mgos_usleep(delay_param_F);
      return res;
    }
    
Sign In or Register to comment.