Authorea

Sankar edited Load on the Database.tex over 9 years ago

Commit id: 88bf0eda5c7adc2177f66631032891530c6f1f58

deletions | additions

The choice of the datastructures and the design of the individual components of a database system depends a lot on the load on the database. The estimated \textbf{Queries Per Second (QPS)} numbers for the load on the database during average load and peak load, the duration for which the peak loads will continue will all come in handy. More than the number of the calls that the database receives, the ratio of the type of the database calls plays a big factor in choosing a design. We can roughly classify the type of the database calls to three the following four buckets: \begin{itemize} \item Insert: INSERT: To create new records (write) \item Modify: UPDATE: To modify an existing record (write/delete) (write) \item DELETE: To remove a record (delete) \item Select: SELECT: To fetch a record, optionally based on a condition (read) \end{itemize} If the database will spend more than 99\% of the time on writes (say (Example: as the backend for a logging application), then a Log Structured Merge Tree \cite{O_Neil_1996} may will be effective; conversely, effective. Conversely, if 99\% of the time will be spent on reads and on reads, a small set of hot keys, then using memory mapsmay prove to load the pages from disk will be more efficient. So, understanding If the application need reads will distributed on a large set of keys with no noticeable hotness in the keys, then memory maps will not give big benefits, as the working set size will be larger. If there isvery important while designing a distributed database. Even while choosing an existing database, having knowledge about combination of SELECT calls and UPDATE calls on the same set of keys, memory maps will be slower than doing regular I/O, due to the nature mixture of read and write where memory maps are not efficient. The way in which applications access the database workload by and the application(s) on top will way in which databases makes design choices, should both be useful. in clear understanding of each other and should know each other's strength and weaknesses. Facebook started the Cassandra\cite{Lakshman_2009} distributed database project initially to perform well during parallel writes and later switched to HBase as they started running more data mining queries on the huge bigdata datasets that they have accumulated.