KLEPPMANN - DESIGNING DATA-INTENSIVE APPLICATIONS

chapter 1

Alan Kay on why programming is eternally young (in a bad way)

“Computing is pop culture…pop culture holds a disdain for history”

“I think the same is true of most people who write code for money. They have no idea where their culture came from”

the end of Moore’s Law = more distributed systems

“CPU clock speeds are barely increasing, but multi-core processors are standard, and networks are getting faster. This means parallelism is only going to increase.” [6]

‘big data’ = term for people who don’t know what they’re talking about (management, the newspaper)

the term Big Data is so over-used and under-defined that it is not useful in a serious engineering discussion. [9]

the sane reply to management’s utterance of ‘big data’

“you’re not Google or Amazon, stop worrying about scale and just use a relational database”. There is truth in that statement: building for scale that you don’t need is wasted effort, and may lock you into an inflexible design. In effect, it is a form of premature optimization. [8]

handy definitions for different data stores

database: store data so that they, or another application, can find it again later [11]

cache: remember the result of an expensive operation, to speed up reads

index: allow users to search data by keyword or filter it in various ways

message queue: send a message to another process, to be handled asynchronously

stream processing: observe what is happening, and act on events as they occur

batch processing: periodically crunch a large amount of accumulated data

db? mq? both?

For example, there are data stores that are also used as message queues (Redis), and there are message queues with database-like durability guarantees (Kafka), so the boundaries between the categories are becoming blurred. [12]

what ‘scale’ really means

[scalability] is not a one-dimensional label that we can attach to a system: it is meaningless to say “X is scalable” or “Y doesn’t scale”. [18]

perhaps it’s requests per second, ratio of reads to writes, the number of simultaneously active users, or something else

50 ways to add latency

random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk

throughput

the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size [20]

For example, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for three requests per minute, each 2 GB in size—even though the two systems have the same data throughput. [24]

design for maintenance first

It is well-known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance [24]

chapter 2

the data model matters in a profound way

Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also how we think about the problem that we are solving. [31]

relational and SQL: invented in 1970, took off in the 1980s, atypically long-lasting for tech

by the mid-1980s, relational database management systems (RDBMS) and SQL had become the tool of choice for most people who needed to store and query data with some kind of regular structure. The dominance of relational databases has lasted around 25 30 years—an eternity in computing history. [32]

competitors to the relational model

CODASYL (the network model) and IMS (the hierarchical model) were the main alternatives…object databases came and went…XML databases appeared in the early 2000s, but have only seen niche adoption [32]

object-relational mismatch

if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows and columns. The disconnect between the models is sometimes called an impedance mismatch [33]

you’ll always have to do joins somewhere

If the database itself does not support joins, you have to emulate a join in application code by making multiple queries to the database. [37]

diff btw hierarchical model and network model

In the tree structure of the hierarchical model, every record has exactly one parent; in the network model, a record can have multiple parents. [40]

query optimizer = what’s actually running your SQL

the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use.

Query optimizers for relational databases are complicated beasts, and they have consumed many years of research and development effort. [41]

one-to-many = NoSQL, many-to-many=SQL

If the data in your application has a document-like structure (i.e. a tree of one-to-many relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to use a document model.

If your application does use many-to-many relationships, the document model becomes less appealing. [42]