Rushing in with bug fixes

After the release of integration of Nuimo-Click with Senic Hub, we came across this bug where if a paired Sonos speaker went down(unplugged, IP changes), both Nuimo-Control and Nuimo-Click will become unresponsive. Nuimo-Control shows an X, on its LED display matrix when something goes wrong.

Look and feel of Nuimo-Click is very similar to traditional switches, which rarely malfunction. While carrying that impression(expectation that it will always work), when user realizes that click is not working, it irks.

We are using DBus for managing connection between smart home devices and the controllers. For communicating with Sonos we use websockets. In above mentioned situation, when Sonos was not reachable, there was no timeout for such requests and senic-core DBus API throws following error:

nuimo_app ERROR [MainProcess-ipc_thread][dbus.proxies:410] Introspect error on :1.16:/ComponentsS
ervice: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy bl
ocked the reply, the reply timeout expired, or the network connection was broken.

Given that there was no timeout, this error/exception is not instant, it takes a while for the DBus API to raise it. And in the meantime, hub remains in a state of suspension. Also, events are queued for processing, so this issue was also clogging up events. As mentioned earlier, a bad UX, sort of show-stopper, and on top of that we were on tight schedule to ship. I rushed in with first PR, adding timeout for all DBus method calls. All the places where I was calling DBus methods, I wrapped them around try, except block, pass extra timeout argument and handle the DBusException. This fix, although it worked, was HUGE. Changeset was sprawling through lot of files. It was not comforting to get it reviewed thoroughly and to make sure that I was not introducing new regression(we are still in process of adding more unittests to the stack, maybe will write a post around it).

Initially when we were working on design of senic-core, we had decided to handle exceptions at their point of origin. With respect to that decision, my first PR(impulsive fix), was in totally wrong direction. While I waited for reviews and comments, I started looking into websocket library, which was waiting forever for reply from a host which was not reachable. As I googled for terms like websocket client, timeout, I quickly came across lot of links and pointers on how to set a timeout for a websocket connection. This was promising. I quickly put together something as simple as:

# library creating websocket connection
REQUEST_TIMEOUT = 5
ws = websocket.create_connection(
	    url, header=headers,
	    sslopt=sslopt, origin=origin
	)
ws.settimeout(REQUEST_TIMEOUT)

# connection instance will now raise WebSocketTimeoutException

Sonos library interaction already had set of try, except blocks, I added WebSocketTimeoutException to those exceptions and handled it there itself. This fix was small, precise and it was comforting in a way. I tested out the fix, unplugging sonos speaker and interacted with Nuimo-Control and within 5 seconds(timeout), I noticed X on it. I additionally confirmed that system was able to work with Hue and interactions weren't clogging up. It was easier to get it reviewed from colleague, got it verified and merge.

At times, symptoms can be distracting and putting a bandage won't fix the leak. Don't rush, take your time, identify root of the problem, and then fix it.

This situation also made us think about how to improve the way we are using DBus APIs. We had put together the first working version of API following a blog series around the same subject and from examples which were shipped with dbus-python (Python bindings for D-Bus) package. There is lot of room for improvement. We tried to understand better on how to use these APIs, document things and stress test them. I will write about them too, sometime.

PS: The included pencil sketch work was done by Sudarshan