When you run a SaaS-based platform service especially if you run a transparent programmatic DSP platform, your highest motivation is to keep up with the SLA commitments and maintain the lowest possible operational cost in order to run a profitable business while offering the best price possible for clients. The biggest challenge you have as a technology owner is to look into the server infrastructure cost aspect. Trust me, this aspect of technology ownership would come to your attention quite quickly, as it’s one of the most expensive parts of the company‘s costs. This is such a delicate matter that if you even leave one single loose end, you would be burning a lot of money. Hence reducing your margins and increasing the cost of operation.
So here I found myself contemplating Kayzen‘s server infrastructure cost. As Kayzen‘s co-founder and CTO, I felt this is something we have to address for sure. We decided to go on a review of our full deployment process in the larger context. We are a SaaS DSP platform, handling around 1.2M+ bid requests per second with controlled and strict latency requirements (order of 20-25 ms). This generates a massive amount of data and also incurs a huge amount of outbound network traffic usage. The deployment is of the size of 4k+ CPU cores, 10 PB of RAM, and outbound network usage of around 1 PB per month. This creates a multiple facet real-world problem and challenging scenario, which every engineer wants to get exposure to and ultimately solve.
The challenges
- The software quality.
- Amount of testing done. Of course, you will be doing feature testing. Here it is more about the performance testing of your application code.
- POCs you run to test your application in both test and production (or like) environments.
- The complexity of the deployment.
- How strong are your engineering and DevOps teams?
- Thinking about the future. Every piece of code written today will have a strong need to revisit after you managed to run your business successfully.
Engineering Considerations
- Writing the code to extract every cycle of hardware resource you throw on it.
- Profiling your code, find bottlenecks at each level, the code itself, systems calls, network calls, redundant calls, lot of string manipulations, blocking calls, I/O waits etc.
- Optimize, optimize and optimize, until you reach a point where further optimization is possible only with a rewrite.
- Performance testing generates the load to a level that your system cannot handle anymore and finds the limits of application and hardware.
- Check the libraries used in your applications, don’t be afraid of fixing bugs or tweaking those libraries if the vendor allows you or if they are open source.
- Keep your code as close to the hardware as possible. Try avoiding abstractions and wrappers as much as possible. This will help your application to view the real processing power of the underlying hardware.
- Having the right set of tools in place. These tools should help you to monitor and measure the performance of your applications and servers. One I really recommend is Graphana. Graphana is a visual tool that allows you to raise alarms from the graph metrics. I found it highly beneficial.
Hardware Considerations
- Performance testing is the “key”. It is not just limited to test the performance of your application code it is also about finding a vertical limit of the hardware, which is best suitable for your workload. Try different combinations to find the best match.
- Every machine you add comes with a constant base cost, which seems to be hidden and could be the killer. The space in the cabinet, motherboard, power supply, cooling, networking gear etc. For every server you add, this hidden cost also adds up. So adding a number of commodity type servers may not be the best choice always.
- Don’t let your system sit ideal, if they do, then you are paying for what you are not using.
- Consolidate, this is the key when you want to have a robust deployment with backup servers/nodes for almost every critical service/application. Consider deploying multiple services together instead of putting them all in a number of small servers. In some scenarios, you would need a clear separation, however, this should be a conscious choice.
- Standardize your server types or put them in buckets depending on your payload. Here are some examples for buckets, CPU frequency intensive, CPU count intensive, storage heavy, mixed-loads, storage can also be thought of in further buckets like HHDs and SSDs.
Server Configuration Considerations
- Closely observe your application, pay close attention to its needs. It might need more number of CPUs, or just a better CPU frequency or both.
- Don’t just split one server into two servers if your RAM requirements are more. If adding RAM can save a server, go for it.
- SSDs can also help you to save a lot of servers. Typically, one SSD can give you approximately double the write performance compared to HDDs. And also your CPU cycles are better utilized with SSDs as I/O is faster so lesser blocking calls.
- More SSDs is to compare the performance of NVMe vs SATA, this requires you to run your application in A/B test mode for some time. Typically, NVMe disks can help you get faster ingestion and hence less overall compute is needed to do the same job compared to SATA.
Takeaways
These are my takeaways from the massive deployment process we handled. In terms of considerations, you should consider this type of deployment when you are building the applications and your business has become stable, this should be the right time to remove bottlenecks from the system. This is the first blog post in a series of posts I intended to write. On the next post, I would share with you the strategies, deployment planning and the road to massive cost savings.
Some food for thought
- Are you relying on free tools provided by various cloud providers? If your answer is yes, think twice. It can give you a quick initial scratchpad for faster prototyping but in the long term, it leads to vendor lock-in and higher costs of operations.
- Are you looking for the flexibility to shrink your server footprint every now and then? The basic question to answer is how seasonal your workload is? Do you really shrink it or autoscale every night, every week or every month?
- Are you open to test your applications both on Virtual and Physical servers? Will your application run out of the box on physical servers or would you need to make changes to support either of them. These are the design and deployment strategies that should be thought through during the initial phases. They might appear as something you can ignore in the early stages, but they would become a major challenge later on when you need to choose between virtual and physical.