The evolving Guardian.co.uk architecture


What were the challenges faced and how they were overcome.

Digital History

  • 1995 – web site launched with a simple portal – experimental project
  • 2006 – Europe largest online newspaper site
  • 2007 – aim to be the worlds leading liberal voice

But by 2007 had outgrown there existing vingnette architecture. So entered a 18month re-design.

  • agile build
  • 4 development team
  • 2> million pages to migrate
  • lots of new functionality.
  • build on a java stack (Spring, Hibernate, EHCache and Velocity)
  • Endeca as a search engine

Issues to deal with

  • Develop in parallel with existing site
  • Zero downtime – migrating section by section – used an Apache mod to allow selection of back-end to serve from
  • Architecture as the system develops – start simple with travel 14k articles – will fit in RAM so can worry less about DB, can turn DB off by stopping cache timeout, this allowing upgrades.
  • Co-located hosting across Manchester and London

Travel had been completed, load testing was performed and they selected other sites that were similar to travel, this allowing re-use of code and mitigating risk.  To further mitigate potential DB issues built in graceful degrade back to flat files.

More performance testing was carried out before moving onto the more complex sites with 200k articles.  After due to underestimating because of database problems the decision was made to move to a simple REST system for integration with 3rd parties; this providing them with a level of control and leaving the guardian just having to know the model for the data, rather than hold the data.

Finally the news site was migrated, this has 1 million + content pages.  This would lead to issues with related content and tags, as the database would not be able to handle the load.  Solved by using the search engine and database to determine which tags were the most used (this points to the most content).  The search engine looked at the database to find the tags – the database hold each tag and how many items it is referenced by – and returns the content ids back, this reducing the load on the database.  This was still not enough, but an Oracle consultant optomised the queries and performance increased (classic).

What does the future hold?

  • EHCache –> JBoss cache and memcached
  • Akami reverse proxy
  • Open Platform – access to all the guardian content XML, JSON, ATOM
  • Open Database
,

2 responses to “The evolving Guardian.co.uk architecture”

Leave a Reply

Your email address will not be published. Required fields are marked *