AlokS Blog: May 2010

Building a Scale-able Ad Network

In this posting, I will provide the details our almost one year journey building out a scale-able Ad Network which can handle 10 million, 100 million, and yes in near future, 1 billion ads a day.

Phase 1: Building a CPA Network

Almost a year ago, we were mostly focused on a tool - Publisher Gallery - which publishers could use to find relevant creative/campaigns, and drag/drop into their playlist (usually matching a placement on their web site). At this time, as a business, we were mostly focused on being a performant (CPA) Ad Network. At that time, OpenX (Community Download) fit the bill perfectly as it was open-source, had rich set of APIs, and most imporantly, we found a way (XMLRPC) to talk to its API. We had to modify a few things in it but it worked as advertised. We build a simple AdOps tool to allow uploading IO/Creatives into the OpenX System while the functionality-rich Publisher Gallery allowed specific creative to be added to specific placement (Gallery indirectly called OpenX API).

Once we found the solution to serving ads (leveraging OpenX), we started building plumbing to make the network scaleable. First step was to use Amazon Web Services (AWS) to spin up as many OpenX Web Servers as necessary. Next step was to create replicated mysql servers so there is no single point of failure. This tested well for up to 10m a day.

One hard problem we had to solve at this stage was creating/managing AWS EC2 instances. After finding Amazon Management Console too primitive, we quickly moved to ElasticFox as a better way to manage the instances (BTW: Long term, we had to build our own version, taking away some functionality to avoid mistakes e.g. Delete an instance ).

However, updating EC2 instances with latest software packages was a huge chore and used to take us days. So, after a bit of research, we opted to use Chef as a way to manage installation and configuration of many of the EC2 instances. Our thanks to Chef team for helping us in the beginning. We still use this tool and can instantiate couple of dozens of EC2 instances in less than 30 minutes. But, using the tool to manage your ever-evolving software releases requires a dedicated engineer, so expensive way to manage AWS EC2 instances.

Next challenge was to gather all impression/revenue data from OpenX and make it available in Publisher Gallery and also surface it to AdOps/FinOps. For this purposes, we build a map/reduce process to periodically get data from OpenX and put summary tables into our databases (While the core code is not that complicated, we found it to be operationally challenging due to occasional network failures, need to rerun the process to restore data, and failures in databases themselves). Then the summary data was made available in the Publisher Gallery. For Ad Operations, we exposed the data, and many custom reports, via Jasper.

Final piece of puzzle was adding more fault tolerance. So, we added couple of HAProxy (Load Balancer) in front of OpenX Web Servers. We also added a heart-beat monitor to detect non-functioning load balancer and make working load balancer the active LB. Next, we added a second mysql instance as a replica of OpenX Database and setup continuous replication (yes, we were still at OpenX 2.6 which had single master database). We experimented with some solution (mysql proxy, HAProxy, AWS EIP) but no one solution provided a seamless and automated failover. So, we decided to stay with manual failover to replica database, should primary database fail (see later about OpenX Distributed Statistics).

Phase 2: Change over to CPM Network

Right about when we were about to release a feature-rich Gallery (v2) and had the CPA Ad Network functional, our core business shifted from CPA-focused to being CPM-focused. While functionally, our entire end-to-end system worked with minimal changes, all hell broke loose once the load starting arriving!

Our first weakness showed in how we tested OpenX. Real world (ad-request) load patterns were different from what we assumed during testing. So, we quickly added some more OpenX Web Servers to handle the incoming load.

Our next set of challenges was caused by our lack of experience with mysql administration and maintenance. First, we ran out of disk volume space which was solved by having mysql tables (and later binlog) on 100GB (and later terabyte) size Elastic Block Store (EBS) volumes. We also had to create an archive and delete process to reduce the raw impression table size so that read or write speeds were reasonable. Finally, we enabled full backup and restore system so despite failures, we can now always restore data.

We also learned that for rapid transactional writes of impression data, the default ISAM storage engine was inadequate (table locking). So, we migrated to InnoDb storage engine (row-level locking) and had row-level contentions at scale. This prompted us to use Percona's xtradb to allow rapid and multiple updates to the same row. However, our next bottleneck was our OpenX single master database deployment and we moved to OpenX Distributed Statistics.

Over time, we enabled a very detailed and elegant monitoring and alerting system using nagios and munin which allows us to be alerted in seconds, when failures do happen. We also got better at simply assuming that failures will happen so that the final system designs were more fault-tolerant and more redundant. For example, we setup a duplicate of full ad rendering system in a separate location (using AWS Availability Zone).

Phase 3: CPM Pacing Problem

Once substantial number of campaigns with varying parameters were loaded into the OpenX Ad Server, we started noticing that sometimes certain campaigns were over or under paced. Worse, sometimes, blank ads were served. The latter caused embarrassment all around.

So, we upgraded to OpenX 2.8.3 hoping that it will improve the pacing but it did not. When we talked to OpenX team, they recommended we abandon using OpenX Community Download version and move on to OpenX Enterprise (which they host). We tested pacing on their enterprise version and except for startup hiccup in first hour or two, it did remarkably better than their open-src version. However, we could not rapidly move a running network to the OpenX Enterprise as a) we found compatibility problems between our OpenX version and hosted version (In our a year, when we had found bugs/limitations in OpenX API, we had made direct calls to the OpenX database), b) getting data out of hosted OpenX was non-trivial work. By the way, we think very highly of the OpenX team and wish them all the success!

So, we went ahead with our own pacing solution (Project Olympus) which we grafted into OpenX 2.8.3(Community version) in the banner selection code path. This way, when a creative request came, we will look up in memory-based tables which campaign/creative made the best match and returned a good match. Rest of OpenX was used as is (e.g. logging the request). Using this system, we were able to manage the campaigns at an even pace while maintaining daily and overall target requirements put in by the AdOps team. Yes, we also handled frequency capping and geographic restrictions. Without spilling the beans, this was a best of breed design and it worked at scale!

Phase 4: Business Requirements Change Again

Due to market pressure, we had to rapidly become really large marketplace for publishers and advertisers. In order to do that we needed to latest and best of Ad Server features (e.g. Professional grade AdOp management and Reports, Real Time Bidding, Integration with DSPs, etc) and market place features (e.g. find advertiser and publishers outside our network).

In this phase, we are moving to a fully functioning Ad Server/Marketplace which has all these features and more. So, we can focus on bringing transparency and control to our publishers. More on this in later posts.

Thursday, May 27, 2010

Good primer on RTB

Friday, May 21, 2010

Digital Ad Market

Thursday, May 6, 2010

Building out a Scaleable Ad Platform

AlokS Blog

Blog Archive

About Me