How a large site works
They have millions of visitors a day. What technologies do large sites use?
We will compare the infrastructure of some large sites, such as Tumblr (10 million visits per day), Facebook (250 million) and several more, to show possible options.

This review isn't complete because sites don't give details of their infrastructure, but the information I've been able to gather should be enough to give an idea of how they might work and software to handle huge amounts of data. And there will be surprises, especially around databases.
File system
The search engine has created a more suitable file system for its work, which consists of billions of read-only requests, as well as for sharing in clusters. GoogleFS or GFS is the owner and was introduced from the first days of the site, around 1998.
Others
They use similar systems available on the market, such as IBM GPFS, a parallel cluster system also chosen by many companies. Or HDFS, Hadoop Distributed File System, which is part of a suite of software created by the Apache Foundation for processing data in clusters. Adobe, AOL, Facebook, EBay, Hulu, Linkedln, and even IBM use Hadoop or some of its tools.
Database
One thing we'll notice with all these big sites is that none of them use Oracle! One of the reasons is that they need huge volumes of servers, and the license depends on the number of machines, the cost would be astronomical. In addition, with open source solutions, which everyone does, it's easier to make changes, and updates and fixes are faster. It mainly uses MySQL, the same software that supports your blog, but with additional tools and a good development team to optimize the software at will and make it very efficient.
Tumblr
The site is largely dependent on MySQL, which is accelerated by Memcache, HAProxy, and shardling, which is a form of database partitioning.
But it also heavily uses Redis, a key-value storage with quick access functions that is used for notifications in the user control panel. They have a limited retention period, which is suitable for Redis, which is also used for other ephemeral functions and even for the buffer as a replacement for Memcache (you can store a URL/page pair as a value key).
Tumblr uses HBase for URL shortcut, for history, statistics, and for messaging.
Despite the fact that Cassandra was created by Facebook, which donated the project to the Apache Foundation, the company is no longer on the user list. Facebook replaced it with HBase .
HBase is used for the mail system, but most of the storage is done using MySQL, supplemented by Memcache to speed up operations. There are more programmers working on HBase than on MySQL, although its use is more limited, which suggests that it is more difficult to implement.
Haystack is a specialized photo storage system. It is intermediate between SGDB and the file system, with an in-memory index that optimizes write and read.
Uses BigTable, which partially inspired Cassandra and HBase, mainly for storage. In fact, BigTable is even offered to users in the App Engine. BigTable runs in clusters of thousands of machines and has no performance issues with large amounts of data.
Others
Many Hadoop sites use HBase to store large amounts of data. This is SGDB, similar to Google's BigTable. It is better suited for mass storage with periodic updates, and Cassandra is better suited for transactions with continuous updates. It is also non-relational, unlike Cassandra, which is NoSQL.
Programming languages
For these large sites, the most used languages are Scala (designed as the name implies for the extension, "scale" in English) and PHP. Java is compatible with Scala, which can use its APIs.
Google also uses Python, but tends to replace it with Go, whose competition features make it a suitable place for websites. A JavaScript and Node pair is in development.
Dailymotion и Facebook
Both sites developed using PHP and retained it, despite the relative slowness. Dailymotion uses the Symfony framework. This is understandable, since the processing on the server is negligible compared to the video transfer time.
Facebook did not want to reprogram its entire system into another language, this would not be a problem in itself, but if errors appear in the new code, which is inevitable, it will affect millions of users. Therefore, he preferred to speed up PHP rather by creating a compiler in a binary language, as well as with a virtual machine for development .
Tumblr
Like many others, the site moved to The Rock. It started in PHP, another in Ruby, but when it comes to handling millions of requests per second, the Java virtual machine is more efficient.
In fact, it was balanced with JavaScript on Node. But at the time of the choice, it was not too known what Node cost for a site of such significance, libraries have recently been little stable. While Scala has Finagle and other tools developed for large sites. It is easier to choose JavaScript for the project being launched, while the API and the site are developing at the same time.
Other tools
Finagle is an RPC (Remote Procedure Call) tool, so for the client it is a way to launch requests to the server, which is suitable for a large number of users. Written in Scala, it runs on JVM with any communication protocol.
It was created by Twitter and is also used by Tumblr among others .
Kafka is an open distributed internal mail system created by Linkedln. It is used by Tumblr to store messages.
Thrift is a conductor for mixing services written in different programming languages, and the Apache project. It is used by Facebook, Tumblr and probably many others, among others .
Scribe is a user management system created by Facebook. It was used by Tumblr, who quickly dropped it because it couldn't handle the load.
Facebook example
Facebook works with a suite of software that can process billions of pages a day while remaining responsive. With the exception of Haystack and BigPipe, which are developed internally, all of these programs are open to use on any server.
Page language: The system, in particular applications, used a language derived from HTML and called FBML, which is now obsolete and replaced by HTML + JavaScript.
Databases: MySQL for day-to-day operations and created Cassandra, which has become an Apache project to replace MySQL for data of various sizes, more efficient for very important networks. But Cassandra now seems little or no use by Facebook.
This is complemented by Memcached, a buffer manager between applications and the database that avoids repeating frequent requests.>
Apollo is a noSQL system similar to HBase for a low latency system.
GraphQL is the database language used by mobile applications. Its peculiarity is that the request expressed in JSON has the same form as the response: the request contains field names, the response adds the data contained in these fields.
Programming languages: PHP has been compiled since February 2010 with the Hip Hop compiler only for Facebook, but with open source. Thus, the software is in binary code, but then the firm turned to a virtual machine and even developed its own version of PHP, Hack, which has statically typed variables.
But Facebook uses many languages along with PHP: Java, C++, Haskell, Ocaml and even D. To interact with each other, programs written in different languages, Thrift generates the appropriate code for web services.
Storage. In addition to MySQL for data, Haystack is a high-performance photo storage and access system. It manages 80 billion user-stored photos (as of June 2010).
Page Server: BigPipe controls page loading (wall, stream, chat, etc.) in parallel. It was made by Facebook.
User Management: Scribe manages access to the site for users.
Data analysis: Hadoop is another Apache project for bulk data calculations. Hive is an addition to Hadoop, allowing its calculations to be used with queries similar to SQL.
Content transfer: Varnish is an HTTP accelerator that acts as a buffer.
Open Graph replaced Facebook connect, a way to use Facebook services on the site, to which competitors and Google responded with Xauth.
The open graph involves making websites nodes of a single social network, the center of which would be Facebook. Member profiles, their relationships with others, are made available from websites and used by them .
Open Graph is undoubtedly interesting for corporate accounts (this is not its goal), but it is concerned about respect for privacy.
There was a lot of backlash and a blackout of the Web personality account following the Open Graph announcement.
- Open Graph Protocol. Use Facebook on your site.
- Out. Open Graph is built on Oauth, which protects access between sites.
Conclusion
This article describes the software. To go further and describe their interaction would be more difficult and actually depended on the activity of the site. You also need to know that using all these tools does not come down to unpacking from cardboard. For each of them, you must first check it on a specific and limited service before integrating it into the system and forcing a lot of users to use it. Even when it comes to MySQL, there is easy use with CMS and widespread use with all optimization tools.
Do not be afraid of future migration and refuse to use the simplest tools when starting a project. They are necessary for a successful start. You can learn how to use tools made for heavy loads only by checking them on a wide audience.
The site never ends.
See also...
Web site databases.