When you add a web site like Flickr or Google Reader to FriendFeed, FriendFeed's servers constantly download your feed from the service to get your updates as quickly as possible. FriendFeed's user base has grown quite a bit since launch, and our servers now download millions of feeds from over 43 services every hour.
One of the limitations of this approach is that it is difficult to get updates from services quickly without FriendFeed's crawler overloading other sites' servers with update checks. Gary Burd and I have thought quite a bit about ways we could augment existing feed formats like Atom and RSS to make fetching updates faster and more efficient. Our proposal, which we have named Simple Update Protocol, or SUP, is below. You can read more details and check out sample code on Google Code. Discuss the proposal in the SUP FriendFeed room.
SUP is just a proposal at this stage. We are eager to get feedback and ideas, and we expect to update the protocol based on feedback over the next few months.
Simple Update Protocol
SUP (Simple Update Protocol) is a simple and compact "ping feed" that web services can produce in order to alert the consumers of their feeds when a feed has been updated. This reduces update latency and improves efficiency by eliminating the need for frequent polling.
Benefits include:
- Simple to implement. Most sites can add support with only few lines of code if their database already stores timestamps.
- Works over HTTP, so it's very easy to publish and consume.
- Cacheable. A SUP feed can be generated by a cron job and served from a static text file or from memcached.
- Compact. Updates can be about 21 bytes each. (8 bytes with gzip encoding)
- Does not expose usernames or secret feed urls (such as Google Reader Shared Items feeds)
SUP is designed to be especially easy for feed publishers to create. It's not ideal for small feed consumers because they will only be interested in a tiny fraction of the updates. However, intermediate services such as Gnip or others could easily consume a SUP feed and convert it into a subscribe/push model using XMPP or HTTP callbacks.
Sites wishing to produce a SUP feed must do two things:
- Add a special
<link>
tag to their SUP enabled Atom or RSS feeds. This<link>
tag includes the feed's SUP-ID and the URL of the appropriate SUP feed. - Generate a SUP feed which lists the SUP-IDs of all recently updated feeds.
Feed consumers can add SUP support by:
- Storing the SUP-IDs of the Atom/RSS feeds they consume.
- Watching for those SUP-IDs in their associated SUP feeds.
By using SUP-IDs instead of feed urls, we avoid having to expose the feed url, avoid URL canonicalization issues, and produce a more compact update feed (because SUP-IDs can be a database id or some other short token assigned by the service).
Because it is still possible to miss updates due to server errors or other malfunctions, SUP does not completely eliminate the need for polling. However, when using SUP, feed consumers can reduce polling frequency while simultaneously reducing update latency. For example, if a site such as FriendFeed switched from polling feeds every 30 minutes to polling every 300 minutes (5 hours), and also monitored the appropriate SUP feed every 3 minutes, the total amount of feed polling would be reduced by about 90%, and new updates would typically appear 10 times as fast.
You can see SUP <link>
elements in FriendFeed's Atom feeds (e.g., http://friendfeed.com/paul?format=atom), and you can see FriendFeed's SUP feed at http://friendfeed.com/api/sup.json.
Update: Several people have asked how using SUP compares with using HTTP If-Modified-Since headers. The two features are complementary. With SUP, feed consumers can monitor thousands of feeds with a single HTTP request (to fetch the latest SUP document) instead of having to request each feed individually. For example, each user's feed on FriendFeed has a unique SUP-ID (mine is "53924729"), but all of the feeds point to a single SUP URL, http://friendfeed.com/api/sup.json. Therefore, it's possible to watch for activity on thousands of separate FriendFeed URLs by polling just one URL, http://friendfeed.com/api/sup.json. If my SUP-ID appears in that SUP document, then you know that my feed has updated and it's time to fetch a new copy. This is substantially more efficient than polling each of those thousands of URLs individually.
34 comments:
So, basically the concept of a "push" and you guys are the cloud? Right?
What is wrong with doing an HTTP HEAD request and looking at the Last-modified: header like everyone else does and has done for years?
This is pants on head retarded. Just do HTTP HEAD requests and check the last-modified header.
David and Mike: our crawler already does HTTP If-Modified-Since requests. The main issue is that we still need to do one request for every feed we fetch. So if 80,000 Flickr users have connected their accounts to FriendFeed, we need to do 80,000 requests to Flickr to look for updates. And Flickr needs to look in their database for every requests (to see if an update has occurred). If we want updates to have at most 10 minutes of latency between publishing a photo and an update coming to FriendFeed, that is 133 requests per second. With SUP, we can look for updates for ALL feeds with a single requests, which reduces load on both ends. Does that make sense?
"With SUP, we can look for updates for ALL feeds with a single requests, which reduces load on both ends. Does that make sense?"
- Bret, can you expand on that a bit?
RaghuL With SUP, when we first crawl a feed for a particular user, e.g., http://friendfeed.com/bret?format=atom, we get the SUP-ID out of the feed. In the feed above, the SUP-ID given is "44123b23". We can then monitor the SUP feed, which gives the most recent updates from all feeds, identified by SUP-ID: http://friendfeed.com/api/sup.json. When we see "44123b23" in the SUP feed, we know http://friendfeed.com/bret?format=atom has been updated, so we fetch it again to get the updates. Since we only need to monitor on feed (http://friendfeed.com/api/sup.json) to look for all the SUP-IDs we are monitoring, we only need to poll one URL to trigger updates for all the feeds we are monitoring. We can get updates almost immediately after they happen, and we don't need to poll every feed to accomplish it.
@David..
Most HTTP HEAD requests on systems like PHP cause a full load on the server and the full page to be rendered.
Further, people FREAK out when they see you performing millions of HTTP HEADS on their servers because it shows up in their HTTP logs.
Here are my thoughts with how it relates to what we do with Spinn3r.
Thanks for the reply Bret.
Essentially what SUP then is, is a new parent 'What's changed' feed of all the affected children feed's SUP IDs.
It sounds like a great idea to cut down on all the to-fros most of which might be unchanged.
This is one of those simple ideas that one wonders why no one thought of before. It's a good proposal and a required one in the rapidly growing aggregation/Lifestreaming world. I am sure the proposal will be widely and quickly SUPported.
sounds a bit like newnews for nntp did to help news traverse to the various nntp servers faster. Here's a crazy thought maybe a new http verb "poll" in combination with if-modified-since. if used against a single feed it would limit the transmitted items only to new items. If used against a domain it could return modified pages/feeds for that domain. It could even increase the speed for crawlers for search engines. the accept header could even limit what types of updates are returned via poll.
Here's my comment from the Read Write Web article:
"""
"On July 21st, 2008, FriendFeed crawled Flickr 2.9 million times to get the latest photos of 45,754 users of which 6,721 of that 45,754 visited Flickr in that 24 hour period, and could have 'potentially' uploaded a photo."
Source: http://www.slideshare.net/kellan/beyond-rest (Slide 16)
RSS is simply not going to get us there.
"""
Given that this still requires the services y'all are scraping to Do Something, why not become the killer-app for XMPP?
Probably because spitting a feed of updated items is a 20 minute job, while implementing XMPP would be a major PITA.
I like this very much!
Did you think of using the existing RSS format for the list of feeds that are updated instead of the new json SUP format?
Very simple and clever. Friendfeed-like!
I don't see why it would not work, but still, it feels like another layer of duct tape.
Why insist on HTTP? Those 43 services that you pool constantly are the one who have the engineering resources to do this right.
Even Livejournal did it better 2 or 3 years ago with the TCP-based ping stream.
I understand that you have a problem to solve and you are looking for a quick way out, but really, do it right and move to a stream based solution, TCP, XMPP, other, take your pick.
Your first and third benefits incorrectly assume that all information resides in a single database, which is not the case for any site large enough to benefit from your proposal. Most large architectures work on the concept of sharding. Groups of systems that handle a smaller subset of the overall user population, without a need to know about each other. Unless I am misunderstanding some vital component, what you are asking providers to do is create a single point of aggregation that must know about all updates within all shards. This in itself creates a far more complex engineering problem than an end-site making 200 requests a second (which remember, in a sharded architecture, scales by adding a few extra queries each to a number of autonomous systems).
SUP would also multiply the volume of raw computing work that needs to be done by a data provider. In an RSS architecture, a system must generate output for only the users that are actively being polled. In a "SUP" system, resources must now be spent to generate output for all users, regardless of if anyone cares to come looking for it. Your claim that it reduces load on both ends is false, the only reduction in work is on the consuming end.
Your claims of not exposing secret feed URLs is also incorrect, as it appears there is no way to actually fetch the contents of an update without knowing the URL that originally led you to a SUP-ID.
I do not dispute that it works over HTTP, or that it is compact.
Guys, what do you think of Atom streams: http://updates.sixapart.com/ ?
So Flickr would have one SUP feed that lists any recently updated feeds? That could be 80,000 items, right? Is that going to scale?
Is there a document somewhere with the format of the <link> tags and the SUP feed? Examples aren't enough.
oh god... this is not a "simple idea that no one thought of before". This is weblog.com's changes.xml (courtesy of Dave Winer) all over again.
In fact, the inherent scaling problems in such a solution led to the RSS <cloud>-Element which attempted to solve the problem at the source, but through XML-RPC, which is why the cool kids don't support it.
(That and that it breaks if you get 1000s of timeouts from unavailable RPC endpoints)
Also: what Mike D said in his 2nd post.
His first one missed the point. HTTP HEAD is not going to solve this problem.
Please take this back to the drawing board. You're right, this problem needs a solution.
But the solution will not be simple, because it's a really complex problem.
Does FriendFeed even run a ping server?
I'm still having trouble envisioning what a provider's implementation of this SUP feed would look like. MikeD makes good points about federated databases making this far more complicated and expensive. Paul Watson rightly points out that this feed would be gigantic. If it were hypothetically to contain 80,000 items, Flickr's would completely turn over every 25 minutes or so.
Rather than continuing to complain endlessly, I'd like to hear from FriendFeed themselves how they would go about creating a SUP stream for FriendFeed itself. I want to be convinced that this is a valid easy-way-out of the much longer road of popularizing XMPP.
I like this idea, simple and elegant. I agree that constant polling sucks and is a terrible solution for what services like (friendfeed or Feedheads are doing), however, I struggle to imagine something like this being promptly adopted by most feed producers.
Google Reader's shared feed doesn't even make use of IF-Modified-since headers. They're google!!
Atom feeds solved this problem a long time ago, as evidenced at six apart with http://updates.sixapart.com
This solution is neither simple nor does it take into account years of prior art of solving this problem through push.
Mike D, other:
Re: multiple databases. First, you could serve multiple SUP feeds corresponding to your multiple databases. Second, if a client like FriendFeed or Gnip is hitting all your feeds, generating a single SUP feed is provably easier on your servers: you could poll your own RSS feeds on the server generating the SUP.
re: "In an RSS architecture, a system must generate output for only the users that are actively being polled." SUP is useful when two conditions are satisfied: a lot of your feeds are being hit by a single client, and many of those feeds are not updated during the polling interval. SUP is not designed for the use case you're talking about.
Re: "it appears there is no way to actually fetch the contents of an update without knowing the URL that originally led you to a SUP-ID" Yes, that's the point.
Re: push solutions. Maintaining and producing/consuming an open connection is harder for both ends.
I'm not sure SUP would help much for a large scale generator of feed-based activity like Netflix without some notion of subscription resulting in SUP feed per feed consumer. The personalized Netflix feeds generate generate about 6 million posts per day (about 2M each of queue adds, shipped DVDs and received DVDs). Given that any given feed consumer is likely only interested in a small fraction of the 8.4M+ subscribers, the signal-to-nose ratio in a general SUP feed would be quite low.
curl -N -s http://updates.sixapart.com/atom-stream.xml | grep --line-buffered "application/atom+xml" | grep --line-buffered 'rel="self"'
http://www.aaronsw.com/2002/atomstream/atomstream.py
But what do you do when you're polling thousands of SUP feeds individually? Obviously you'll need a way to see if any of the feeds in any of the SUPs you are tracking have updated in order to save even more bandwidth. Hmm... maybe we need another protocol ;-)
Here's an extended version of Atomstream.py with XMPP support (it's buggy and may crash after some time).
The script turns ATOM feeds into XHTML-IM messages.
Might be fun to add a track capability with Pusbub, though. But I digress.
As regards data exchange, XMPP Pubsub looks like an interesting solution, considering for example that contrarily to the SixApart ATOM stream, Friendfeed servers would only receive messages from subscribed XMPP Flikcr accounts, not all the data which would avoid filtering.
SUP looks interesting but so 20th century.
This is cool, but what about the xmpp pubsub? Isn't that what a lot of people are recommending now? What are your thoughts on this?
whats wrong with having a simple atom “update” feed listing recently changed feeds? that feed could use RFC5005 and consumers wouldn’t need to worry about polling faster than the SUP feed turns over completely
If I understand thus right we have a network efficiency problem and limitations based on data structure. Large providers may have millions or potentially billions of individual nodes that have a status update flag.
Current efficient polling is done by comparing their remote time stamp and the node's last update time.
Excuse me for the modest description this problem space is new to me (but important to what I'm working on).
Is there a viral hub heirarchy? This would be a limited number of remote servers who simply pass along the last update time to all subscribers (and any polling system could also serve last node update time). You would subscribe to a virtual server represented by a dynamic list of your nearest neighbors. A network similar to bittorrent but only concerned with last update time.
The storage could get problematic for large databases (even of a simple time) so each serving hub could keep what it needs for internal use and just push all the times to it's short list of clients.
I'm trying to think of a decentralized status architecture, would appreciate expert opinions
Looks like my last vague idea was not only already developed, but already implemented. Just finished reading a sentence of Louis Gray's post.
Post a Comment