TeaBreak (http://teabreak.pk) is a Pakistani Blog Aggregator. It started off in April 2008 with an aim to organize the expanding Pakistani blogosphere. The engine supports automated tagging and categorization of posts into relevant topics like Politics, sports, business, technology, etc.
Initially the project was completely based on WordPress and used a feed aggregation plugin called WP-O-Matic.
But in a matter of months TeaBreak grew to over 500 local blogs and the aggregation overhead along with serving growing traffic was enough to exhaust the VPS resources. The site was running on a decent VPS and even after a lot of fiddling around with MySQL and apache tweaks and optimization, the WordPress + WP-O-Matic solution was unfortunately not optimal.
The new distributed architecture
At that point I decided to split the monolithic system into stand-alone distributed systems. This way each system can be scheduled to run at a different time and possibly spread across multiple hosts.
So, the system was divided into:
1. WordPress Front-end
I didn’t want to re-invent the wheel so I still preferred to keep the front-end on WordPress because of it’s excellent support for managing posts and editorial-level features (like editing and tagging posts, etc.). Additionally, the plugin-base was quite attractive to retain WordPress as that would mean an almost instantaneous launch of a new feature (like polling).
This system is live at http://teabreak.pk.
2. CDN for static content
I achieved this by a modified version of CDN Rewrite plugin. The CDN is live at: http://cdn.teabrk.com.
3. Management System
This is a non-wordpress based system to deal with sign-ups, registrations, admin & editorial reviewing of new blogs. The basic aim was to have a system that can be modified as our demands and requirements shape up when working with external partners.
Because this is a totally separate and stand-alone sub-system we can modify it without constraining ourselves in the WordPress world. This system dispenses knowledge about which blogs are registered, active, approved, etc.
This area is live at: http://site.teabreak.pk.
4. Aggregation Engine
This is actually the part of the system that does most of the heavy lifting. The engine is written in Java and runs on Google AppEngine infrastructure. Among various functions, the engine primarily parses XML / RSS feeds, processes posts, tags and classifies them into relevant topics and puts these posts in the publishing queue.
5. Publishing Engine
This sub-system (part of the Aggregation Engine) picks posts up from the queue waiting to be published and posts them to TeaBreak’s WordPress front-end via XML-RPC API.
Conclusion & Aftermath
After a soft-launch in December 2010, this system went live in January 2011 when we deployed Version 3.0. The VPS is mostly sitting idle as it is now only serving WordPress requests. The CPU averages at a meager 20% load as compared to 78% before.
Offloading aggregation and processing bits of our system to Google AppEngine proved to be a good investment. The website has been running quite smoothly since the transition to the distributed system with almost no down-time / unresponsive moments.