One of the key components of TeamCity’s ecosystem is the bundled Amazon Cloud Agents plugin, which allows our customers to leverage cloud agents to scale the performance of their build farms on demand. Given its widespread use and importance, ensuring its optimal performance is essential.
As our user base and workload have grown, we’ve noticed some initial performance oversights becoming more pronounced, prompting a closer look at the plugin’s performance.
Performance issues with the Amazon Cloud Agents plugin
The main culprit for performance issues with the plugin was thread management. Each Cloud Profile would create its own pool to manage instance operations and additional service threads for internal purposes. When the number of Cloud Profiles was low enough, the performance hit was hardly noticeable.
However, the impact of this issue worsened as users kept adding more and more profiles. Given that TeamCity itself is a complex system operating hundreds of threads continuously, adding a considerable amount of additional threads is not something we should take lightly.
Implementing parameterizable shared thread pools for recurring and one-off tasks solves this problem. This approach will allow for asynchronous operations, like instance provision requests that don’t wait for an instance to start, to be executed promptly without needlessly straining the system.
But what happens when the number of threads exceeds the measured optimal amount for a system?
The short answer: it causes gradual performance degradation. Eventually, even a highly parallel system will suffer from excessive threads. Common problems include, but are not limited to, context switching and synchronization overhead (e.g. locks). Here, we’ll focus on context switching.
What exactly is context switching?
Context switching is a very complex topic with many technical details, but for the purposes of this post, we’ll keep it brief. A context switch is a fundamental OS process that saves the state of a running thread and restores the state of another thread. This includes saving and loading CPU registers, stack pointers, and other information crucial to continuing a thread’s execution from an arbitrary point.
What is the impact of this? Each thread is allocated a CPU time slice known as a “quantum”. However, the context switch overhead reduces the effective CPU time available for actual thread execution. This overhead might include processor cache misses and memory contention, depending on the system and workload.
As with any performance problem: measure, don’t guess.
The solution: applying the patch
Below are some performance charts that cover the period between July 1 and August 25. The patch that aims to fight the aforementioned issues was applied around the middle of this period, on August 9.
The first graph shows the thread count.
Although the reduction might look dramatic at first, the starting point is around 575 threads, so the overall reduction in threads is ~25%, or ~250 threads.
The next graph shows the queue size of builds waiting to run.
Before the patch was applied, the chart hit 16,000–20,000 builds on multiple occasions, with frequent spikes above 10,000 queued builds. After August 9, the chart clearly becomes more stable and shows much more moderate spikes.
Graph showing the number of starting cloud agents.
Lowering the average values means we process agents’ provision requests more efficiently and, as a result, cloud agents start noticeably faster. That’s exactly what we observed on August 9: the reduction of both spike frequency and their maximum values.
What does this mean for our users?
Performance metrics are a great tool to measure the result of your efforts, but one can argue these efforts are more or less futile if end-users gain no real advantages. With cloud agents, this is definitely not the case – users directly benefit from faster build processing, shorter build queues, and less time required for an agent to wind up.
And as a cherry on top, eliminating so many threads should raise the overall performance and responsiveness of the entire system, ultimately making it more stable and efficient.