06/01 Haystack
Overview
Network File System (NFS) is a distributed file system protocol, allowing a user on a client computer to access files over a computer network much like local storage is accessed.
- Features of FB users
- upload much each week
- visit often
- Long Tail Issue
- some popular photos accessed frequently
- so many photos accessed rarely
Goals of Haystack
- high throughout low latency
- provide a good user experience
- fault-tolerate
- handle server crashes and hard driver failures
- cost-effective
- save money over traditional approaches(reduce reliance on CDNs!)
- simplicity
- make it easy to implement and maintain
Features of Old Design
- each image is stored in its own file
- enormous amount of metadata (namespace directories and file inodes)
- the amount of metadata far exceeds the caching abilites of the NFS storage tier, resulting in mulitple I/O operations per photo upload or read requests
- high degree of reliance on CDNs = expensive
Haystack
Step
- web server receive the request
- uses Haystack Directory to construct URL
- http://⟨CDN⟩/⟨Cache⟩/⟨Machineid⟩/⟨Logical volume, Photo⟩
- from which CDN to request the photo
Haystack Directory
Main functions
- provides a mapping from logical volumes to physical volumes.
- Web servers use this mapping when uploading photos and also when constructing the image URLs for a page request.
- loads balances writes across logical volumes and reads across physical volumes.
- determines whether a photo request should be handled by the CDN or by the Cache.
- This functionality lets us adjust our dependence on CDNs.
- identifies those logical volumes that are read-only either because of operational reasons or because those volumes have reached their storage capacity. We mark volumes as read-only at the granularity of machines for operational ease.
Haystack Cache
- distributed hash table, uses photo's id to locate cached data
- receives HTTP requests for photos from CDNs and also directly from users’ browsers.
- If photo is in Cache, return the photo
- If photo is not in Cache, fetches photo from the Haystack Store and returns the photo
- Add a photo to Cache if two conditions are met:
- The request comes directly from a user(browser) and not the CDN
- if come from CDN, CDN could cache it.
- The photo is fetched from a write-enabled Store machine.
- which shows that this photo was uploaded recently
- achieve 80% hit ratio
- The request comes directly from a user(browser) and not the CDN
Haystack Store
- Read
- Write
- Delete
- Store machine sets the delete flag in both the in memory mapping and in the volumn file
Needle
- A Store machine represents a physical volume as a large file consisting of a superblock followed by a sequence of needles.
- Each needle represents a photo stored in Haystack.
- cookie: security cookie supplied by the client app to prevent brute force attack
Haystack
Question & Discussion
- Album level abstraction
- better if photos from the same album are placed sequentially or at least close toghether
- Privacy concerns
- Are cookies sufficient protection? Is there a better way?
- Security level of Facebook?
- How is consistency maintained between the Haystack and the CDN?