Twitter is built on open-source software—here are the projects released or contributed by them.
Kestrel is based on Blaine Cook’s “starling” simple, distributed message queue, with added features and bulletproofing, as well as the scalability offered by actors and the JVM.
Each server handles a set of reliable, ordered message queues. When you put a cluster of these servers together, with no cross communication, and pick a server at random whenever you do a set or get, you end up with a reliable, loosely ordered message queue.
In many situations, loose ordering is sufficient. Dropping the requirement on cross communication makes it horizontally scale to infinity and beyond: no multicast, no clustering, no “elections”, no coordination at all.
Kestrel is :
- Fast – It runs on the JVM so it can take advantage of the hard work people have put into java performance.
- Small – Currently about 2K lines of Scala (including comments), because it relies on Apache Mina (a rough equivalent of Danger’s ziggurat or Ruby’s EventMachine) and actors — and frankly because Scala is extremely expressive.
- Durable – Queues are stored in memory for speed, but logged into a journal on disk so that servers can be shut down or moved without losing any data.
- Reliable – A client can ask to “tentatively” fetch an item from a queue, and if that client disconnects from kestrel before confirming ownership of the item, the item is handed to another client. In this way, crashing clients don’t cause lost messages.
For more information about what it is and how to use it, check out the included guide.
Author’s address: Robey Pointer <[email protected]>
GITHUB : http://github.com/robey/kestrel
2. SCALA JSON
Scala JSON toolkit originally lifted from Martin Odersky et al’s Programming Scala book. We tightened up some edge cases and added complete test coverage.
Original code is under the Scala license (LICENSE.scala) and Twitter modifications are available under the Apache 2 license (LICENSE).
GITHUB : https://github.com/stevej/scala-json
Querulous is the agreeable way to talk the database.
- Handles all the JDBC bullshit so you don’t have to: type casting for primitives and collections, exception handling and transactions, and so forth;
- Fault tolerant: configurable strategies such as timeouts, mark-dead thresholds, and retries;
- Designed for operability: rich statistics about your database usage and extensive debug logging;
- Minimalist: minimal code, minimal assumptions, minimal dependencies. You write highly-tuned SQL and we get out of the way;
- Highly modular, highly configurable.
GITHUB : https://github.com/nkallen/querulous
4. FLOCK DB
FlockDB is a distributed graph database for storing adjancency lists. FlockDB is much simpler than other graph databases such as neo4j because it tries to solve fewer problems. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as web-sites.
Twitter uses FlockDB to store social graphs (who follows whom, who blocks whom) and secondary indices. As of April 2010, the Twitter FlockDB cluster stores 13+ billion edges and sustains peak traffic of 20k writes/second and 100k reads/second.
- a high rate of add/update/remove operations
- potientially complex set arithmetic queries
- paging through query result sets containing millions of entries
- ability to “archive” and later restore archived edges
- horizontal scaling including replication
- online data migration
GITHUB : https://github.com/twitter/flockdb
Gizzard is a Scala framework in that it offers a basic template for solving a certain class of problem. This template is not perfect for everyone’s needs but is useful for a wide variety of data storage problems. At a high level, Gizzard is a middleware networking service that manages partitioning data across arbitrary backend datastores (e.g., SQL databases, Lucene, etc.). The partitioning rules are stored in a forwarding table that maps key ranges to partitions. Each partition manages its own replication through a declarative replication tree. Gizzard supports “migrations” (for example, elastically adding machines to the cluster) and gracefully handles failures. The system is made eventually consistent by requiring that all write-operations are idempotent and commutative and as operations fail (because of, e.g., a network partition) they are retried at a later time.
GITHUB : https://github.com/twitter/gizzard
Snowflake is a network service for generating unique ID numbers at high scale with some simple guarantees.
GITHUB : https://github.com/twitter/snowflake