In June this year we migrated our entire operational platform across to the same environment as Xero. The primary drivers for this upgrade were:
- Leveraging Xero’s dedicated operations team
- Increased capacity of new server infrastructure. We provisioned 2 more Windows 2008R2 IIS servers to support our growing customer base. This doubles our customer facing IIS web server capacity. Each system is running Windows 2008R2 with IIS 7.5. They are virtualised on top of ESXi 5.1 with 16GB of RAM and eight logical processors per virtual system.
- Significantly improved monitoring across the entire platform. The servers are monitored using Nagios which alerts on a number of important metrics not the least being CPU load, memory load, disk capacity and website availability. The system utilisation history is presented using data output by Nagios into a tool called Graphite which helps us evaluate load trends.
- To provide capacity for constantly increasing growth of customer numbers
As part of this upgrade we migrated our back end database to SQL Server 2012. This was a significant undertaking in itself and required a large development and even larger QA testing effort.
Despite the huge investment in this upgrade, we encountered a number of teething issues across the new platform and database infrastructure post migration that impacted performance for all users. We experienced five short unexpected and unrelated outages, from June through until the end of August.
Each outage presented unique learnings. The operations and database teams were able to quickly rectify the issues, but also to provide key statistics and audit logs back to the development team to tweak and tune the application. The performance of the entire platform has improved by over significantly across this period and WorkflowMax is running faster than ever before.
One issue that we still have to contend with is month end billing. At the start and end of each month we hit peak demand as the majority of customers carry out their invoicing at this time. The Estimated Billings routine is the main culprit of this peak load as it consumes a huge amount of processing power to calculate the value of invoicing.
To overcome this final hurdle related to the application performance we are working on two solutions
1) We added two additional dedicated servers, on to the platform this week. This will provide significant additional capacity ensuring that we have the processing power to manage any peak loading during month end billing in the short term
2) The root cause of the issue is how the Estimated Billings report is calculated. It takes much more processing power than it should.
Over the past three months, our development team has been working on a complete rewrite of our underlying billing engine. The new framework is referred to as the Work in Progress (WIP) Ledger, as it is a running ledger of all time/cost entries against a job and any corresponding invoicing related to the job. The WIP ledger opens up a huge amount of reporting functionality to us, but importantly it completely resolves all issues to do with the Estimated Billings Report as rather than having to calculate what is available to invoice we can read the value straight from the database.
We know how critical our services are to our customers’ businesses, and we know that any disruption to our service is unacceptable. So as always we are sincere in our apologies for any disruptions any of our outages have caused, however we have confidence that the continuing improvements we are making to the platform and product in combination with the additional capacity we continue to add will ensure that any future outages are kept to an absolute minimum.
If you have any further questions, or would like to discuss please feel free to email me directly firstname.lastname@example.org , Co-Founder.