Hi,
I have 2 T3-BBs in production. One is working just fine, with a couple error bursts a day from my custom BMS. The other is being a problem, with several bursts per hour. I am still seeing between 1-7% or reads failing. I don’t see consecutive reading cycles failing, just every 5-15 minutes I get a burst, between 4 and 9 consecutive reads over several seconds. It recovers on its own and keeps on going.
I poll my ~8 sensors on this BB every 30s, plus checking sound binary outputs every couple minutes.. Thinking it might be that I was flooding the unit, I added a random 0-5s wait to make sure the reads were spread out. That didn’t help at all.
The BMS and the problem T3-B are plugged into the same gigE switch, abd broadcast traffic is on the order of a few per second. These are very low traffic numbers IMO.
Where do I go from here to sort out why one unit is failing so many reads?
-I would get started with a firmware update.
-Then check the program page, look for any programs which are in an endless loop. They will show up as ‘Loop’ in the scan time column. Search this forum for more info on loops.
-Next check the wireshark logs and look for any bad packets, check the timing between them in particular.
wireshark was a great idea.
I am not running any programs at all, just remote read/write from my BMS.
I want to attach a 50 kB capture in the thread, but couldn’t figure out how.
In the trace, starting at 10:06:09, my BMS code is sending out read requests at 1-2 per second. The bb doesn’t respond to any requests until 10:06:33. My code starts reporting errors at 10:06:21.
I prefer not to upgrade code while chasing problems. When things “just disappear”, it’s hard to know whether they are gone for good or will come back.
Are there logs on the t3-bb that might give a clue?
jerry
Chelsea will check over your logs but the first thing she will be asking about is what version you’re at and if its not the latest that would be the first step.
long final status.
Many thanks to Chelsea and Maurice for pointing me in the right direction.
I am writing my own BMS using bacnet, so I see all the details that most people will just pass over. In particular, all of my work is done asynchronously, so there is no blocking of bacnet requests.
It turns out that the bacnet code built into the T3 firmware has at its base a IP layer called uIP. Those designers made a strange decision not to support receiving more than one packet at a time. The system has to finish one request before it can accept another packet in. All my experience is with larger systems that can handle hundreds of incoming packets and queue them for processing.
For the bacnet reads, which are the bulk of my traffic, the turn around time is about 7ms. To be safe, I added code that guaranteed a 50ms space between my requests. This got rid of the error bursts. I now see single errors at a rate of .1-.2%, but no bursts and the retries always work.
It was suggested that I shift to using block reads which would be more efficient, but it means changing the design of the async device layer. That’s not worth it in my situation.
Thanks for the update.
The code has been around since the old days, it’s overdue for a performance update as you suggest. Chelsea has this on her task list to upgrade to asyncrounous networking.