Rushing in with bug fixes
After the release of integration of Nuimo-Click with Senic Hub, we
came across this bug where if a paired Sonos speaker went
down(unplugged, IP changes), both Nuimo-Control and Nuimo-Click will
become unresponsive. Nuimo-Control shows an
X, on its LED display
matrix when something goes wrong.
Look and feel of Nuimo-Click is very similar to traditional switches, which rarely malfunction. While carrying that impression(expectation that it will always work), when user realizes that click is not working, it irks.
We are using
DBus for managing connection between smart home devices
and the controllers. For communicating with Sonos we use websockets.
In above mentioned situation, when Sonos was not reachable, there was
timeout for such requests and
senic-core DBus API throws
nuimo_app ERROR [MainProcess-ipc_thread][dbus.proxies:410] Introspect error on :1.16:/ComponentsS ervice: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy bl ocked the reply, the reply timeout expired, or the network connection was broken.
Given that there was no timeout, this error/exception is not instant,
it takes a while for the DBus API to raise it. And in the meantime,
hub remains in a state of suspension. Also, events are queued for
processing, so this issue was also clogging up events. As mentioned
earlier, a bad UX, sort of show-stopper, and on top of that we were
on tight schedule to ship. I rushed in with first PR, adding timeout
DBus method calls. All the places where I was calling
methods, I wrapped them around
except block, pass extra
timeout argument and handle the
DBusException. This fix, although
it worked, was HUGE. Changeset was sprawling through lot of files. It
was not comforting to get it reviewed thoroughly and to make sure that
I was not introducing new regression(we are still in process of adding
unittests to the stack, maybe will write a post around it).
Initially when we were working on design of
senic-core, we had
decided to handle exceptions at their point of origin. With respect to
that decision, my first PR(impulsive fix), was in totally wrong
direction. While I waited for reviews and comments, I started looking
into websocket library, which was waiting forever for reply from a
host which was not reachable. As I googled for terms like
timeout, I quickly came across lot of links and pointers
on how to set a
timeout for a websocket connection. This was
promising. I quickly put together something as simple as:
# library creating websocket connection REQUEST_TIMEOUT = 5 ws = websocket.create_connection( url, header=headers, sslopt=sslopt, origin=origin ) ws.settimeout(REQUEST_TIMEOUT) # connection instance will now raise WebSocketTimeoutException
Sonos library interaction already had set of
except blocks, I
WebSocketTimeoutException to those exceptions and handled it
there itself. This fix was small, precise and it was comforting in a
way. I tested out the fix, unplugging sonos speaker and interacted
with Nuimo-Control and within 5 seconds(timeout), I noticed
X on it.
I additionally confirmed that system was able to work with Hue and
interactions weren't clogging up. It was easier to get it reviewed
from colleague, got it verified and merge.
At times, symptoms can be distracting and putting a bandage won't fix the leak. Don't rush, take your time, identify root of the problem, and then fix it.
This situation also made us think about how to improve the way we are
using DBus APIs. We had put together the first working version of API
following a blog series around the same subject and from examples
which were shipped with
dbus-python (Python bindings for D-Bus)
package. There is lot of room for improvement. We tried to understand
better on how to use these APIs, document things and stress test them.
I will write about them too, sometime.
PS: The included pencil sketch work was done by Sudarshan