Disclaimer: As a practice, I do not use this space to directly promote RDM and our solution. This post discusses a real life example of a topic of technical interest, using an RDM-specific experience. For more heavy-handed self-promotion, feel free to call or email.
Like most software companies, RDM is loud and proud in proclaiming our solution to be scalable. Nothing we have seen in the field has led us to believe anything otherwise, but we have always pondered what the load capacity of a single NEOCAST Media Server should be pegged at. We have our own selfish reasons for wanting to understand that number. First, to establish a clear threshold at which we would add another server to the rack, thereby avoiding any potential traffic jams or degradation in service. Add it too soon, and you have expended capital unnecessarily. Add it too late, and you have disappointed customers. Second, providing a scalability test framework would allow the team to test the impact of any new feature on server performance without having to take a “put it in production and wait” approach.
Until recently, we had set an arbitrary and comfortably conservative number for our threshold, and held to it. We had also designed capacity test plans, but did not have an efficient way to execute them. Even the design was tricky, because it is nearly impossible to design a generic network customer. Every customer’s size, content strategy, update frequency, content types, reporting requirements, usage of features such as SpotSwap and other important criteria vary as much as snowflakes.
Recently, we were challenged by a potential customer to produce and run a capacity test that could be replicated and verified by a third party if need be. In this case, we were able to use that network’s key parameters as the gold standard for testing capacity. So what remained was the method for creating an army of “zombie” media players and the method for having them “attack” a production server until it cried “uncle”. Since our infrastructure is replicated exactly between our primary and disaster recovery sites, we were able to temporarily recommission the DR site to serve as the target for the zombies. To the engineering team’s credit, they devised a way to create the army of simulated players utilizing a customized OS image deployed to virtual servers in Amazon’s EC2 cloud. What they did was write a simulated NEOCAST Media Player that could spawn any number of zombie players with the exact network characteristics of the potential customer. Tests were run to find out what the maximum number of simulated players each virtual server could handle, so we would know the natural increment of players we would add with each server. The simulated players were then programmed to begin their cycle of call-ins, status reports, log dumps and content updates randomly over a 15-minute period, and to continue running until the test ended. Then the fun began, as more and more virtual servers were deployed from the EC2 cloud to communicate with the server. We had defined a number of “fail” conditions that would be indicators of having reached capacity, and kept adding players to the zombie army until we got to a fail condition.
The process was eye-opening in many respects. During initial dry runs of the test plan, we were able to uncover and patch a few bottlenecks that were not obvious when not running a server to the “max.”. We also came up with several potential code optimizations based on analysis of the load test data, but actually ran the official test without implementing them, as we wanted to report “as-is” metrics to the customer. We could have tweaked up the results quite dramatically by quietly introducing the code optimizations, but we carry this millstone called integrity around our necks, and we chose instead to disclose the optimizations as potential upside. Hopefully, that did not go unnoticed, because that is how we roll. And we had some fun internally by having all the employees post their guesses as to the final capacity number, with the winner picking the restaurant for our next team dinner.
As a result of the process, we ended up with an improved capacity test plan, a great method for executing the test, clear direction for throughput optimizations going forward, a nice benchmark number for server capacity, and a baseline against which to measure the impact of new features. The potential customer was able to assess our claim of scalability against competitors based on a test that they had a hand in designing. It was a process well worth the investment of time and money. From the cloud we got some clarity. And it doesn’t look like I will get off cheaply on that team dinner, but at least I won’t have to invite all those zombies.