Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's good to see this written up somewhere. The technique is widely known in the Rails community but I haven't seen it explained in one place. (For example, the Rails API documentation and guides don't even mention that the cache helper can take an Active Record instance as the key, so you'd be hard pressed to discover this technique without reading the Action Pack source.)

As DHH says, the upside is that it's simple and, as long as you get the cascading updates right, makes sure you don't ever show stale content.

There are a couple of downsides.

Firstly, once a fragment becomes invalid, it's not regenerated until the next request that needs it, so someone somewhere gets a slow[er] page load; even worse, if yours is a large public site with lots of long-tail pages, that "someone" is likely to be the Googlebot, and the slower page times might adversely affect your search ranking. You could ameliorate this by having some kind of automated process for periodically pre-warming certain parts of the cache after certain kinds of update, but that can get difficult to maintain.

Secondly, this technique is very conservative: the goal is to invalidate all fragments which might have been affected by an update, rather than only the fragments which actually have been. (It's less brute-force than the even simpler "invalidate the entire cache every time any record is updated" strategy, but it's making a similar trade-off.) This works well for an application like Basecamp where the hierarchical presentation of the data matches its hierarchical organisation at the model level, but is more problematic in applications with lots of different ways of presenting the same underlying data; if you have lots of cached representations of projects which don't actually include any information about the child todo lists or todos, then tough shit, those are all going to be invalidated when a single todo is updated, and you'll have to spend (page load) time waiting for them to be regenerated. In principle you can get around this by having multiple timestamps per model to represent different kinds of update, but again that's more difficult to manage correctly over time.

An interesting alternative approach is James Coglan's experimental Primer (https://github.com/jcoglan/primer), which dynamically tracks which Active Record instances & attributes are used during the generation of a cached fragment, and then actively invalidates that cache when any of those attributes change. The downside to Primer is that it's more complex, not production-ready, and may contain traces of magic.



I'm interested in trying out this technique for a site we built but your comments about Googlebot and the associated downside with long-tail pages gives me pause. Do you have any suggestions on how to evaluate whether or not a technique like this would bring advantages, or is it better to simply give it a try and see how it works out?

In our case, the site features real estate listings, and at any given time there are about 50,000 listings. Each listing has its own page. There are numerous other pages, some of which are just index pages (lists of listings), a couple of search pages, etc. But the majority of pages are the single listing view pages, and then following that, the lists of listings. Let's call it around 60,000 pages or so.

The total number of pageviews per day (not counting search engine crawlers) is less than the total number of pages, but I would be willing to be that some individual listing pages are accessed at a much higher rate than others (namely, the newest ones). The listings themselves are updated on a daily basis, so the cache stores would get invalidated at least once a day.

In this situation, how do I determine the appropriate caching strategy?


If you have enough cache capacity to keep every individual listing in cache then you could simply write a script that periodically hits the detail page for each listing to make sure the latest version of each listing is always available in cache.


Best writeup on the topic of caching in rails that I've seen is http://broadcastingadam.com/2011/05/advanced_caching_in_rail...


Hey, I wrote that! Thanks for the compliment :D


Every rails developer should read this article.


This might already be covered, but unless you're running a threaded server or have developed some other sort of centralized locking, it's not hard to get race conditions in the cache generation.

For Basecamp it probably doesn't matter that much. For complex actions that serve mostly public requests that hit mostly "all at the same time" (think of a Media Embargo lifting on a new product announcement) it can cause real problems.

Ideally you want the first request to block the rest, so that the cache is only generated once, the other requests just spinning until the first is done, the cache is ready, and the lock released. That way all those other requests just consume sockets instead of unnecessary CPU/IO/DB time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: