Sunday, June 20, 2021

Solving lwip troubles.

Some time after the introduction of lwip core locking into ChibiOS port, I started to notice that MQTT still tends to disconnect and connect right after. Same problem as before core locking. This is not very good that all other lwip process seems to work well, but MQTT is keep on reconnecting. Well lwip is a big and complex library to jump in and decode all possible problems, and I decided to start with MQTT client part only. There are quite a few debugging facilities in lwip and I turned on many to start seeing the whole picture.

Nice informative way to see information about lwip is to enable LWIP_STATS. There are many options what to see, like to examine possible memory problems by defining:
#define LWIP_STATS_DISPLAY 1
#define MEMP_STATS 1
#define PBUF_STATS 1
#define SYS_STATS 1

And adding:

#ifdef LWIP_STATS
  stats_display(); // Show lwip statistics
#endif

into while (true) lwip thread.

I started to torture again OHS application fir frequent radio packets resulting into MQTT publish messages. All was fine even with heavy usage of web interface and MQTT subscribe. But after 5 days the statistics started to show an error on MEMP TCP PCB and the in used number of same did not cease even after no other TCP communication then MQTT was used. I increased the number of MEMP_NUM_TCP_PCB to 50 and started again. 5 days later the same problem.

Examining MQTT library or searching forums did not help much. In logs I found that the client receives an empty pbuf which is forcing the client to disconnect. Mosquitto broker then does not see the keep alive periodic packet and also closes the connection. But I started to measure the exact time when the MQTT get out of pbufs, and it a bit surprised me that it happens exactly after 4.9 days. To be precise after 2^32 / 10000 = 429496,7296 seconds. There are timers in lwip that make various time dependent tasks, like removing used pbufs or dhcp. Deep diving into it I found out that the problem is not in lwip, but in ChibiOS port. The sys_now() function does not overflow on u32_t boundary but actually much earlier, causing the timers wait forever.

A patch of the sys_now() function seems to fix the problem completely. It comes with new OHS version 1.3.8 with a few cosmetics enhancements described in particular commits on GitHub. 

It is recommended to upgrade if you use MQTT.