bugs and lost opportunities

I think bugs in code are good, they give opportunity to correct ourselves. Just like in any skill, mistakes eventually becomes your wisdom, bugs are those mistakes for coding. But at times, because of the deadlines or bugs-in-production situation, we miss those learning moments. We keep those lines commented out, in hope that as things cool down we will revisit them and have our ahhaaa moments.

I had blogged previously about using Zulip as platform for connecting to customers over chat and having a control over pairing of customers to sales representatives. I was running that service on EC2 Small instance which was had very less resources compared to required specs mentioned in zulip documentation. And true enough just one day before the last day of yearly filing, 30th of March, 2015, server yielded.

I was frantically combing through code, optimizing it at random places, performing CPR. Most of the fixes as I brought up the services all the zombie browser clients would try to connect and it would crash again. I wasn't understanding what is causing problem and in attempts to fix it, possibly introducing new ones without realising. There were no error logs in application stack, none in system logs, I was looking at resources(CPU, loadavg, RAM), services running(db, rabbitmq, nginx) and I wasn't able to make sense out of them. As server started, loadavg spiked, memory usage was able to fit in RAM, but frontend kept on showing 500.

On 31st, instead of continuing the wild goose chase, we decided to setup fresh instance of chat system on EC2 Medium instance and revisit this server later. The new system came up, for rest of the filing season, it managed load decently, we created backup image of it and were able to upgrade instance to large EC2 instance before the peak. But now, as things got stable, focus shifted to making sure that existing service were always working. I think the older machine logs if looked closely could have given insight on what and how things went wrong and by not looking at them, that opportunity got missed. And as time passes by, inertia to revisit old mistakes gets bigger and bigger.