Authorea

Sankar edited Load on the Database.tex over 9 years ago

Commit id: 4b55104b5c17b92bf56dee138a366d07582b3abb

deletions | additions

The choice of the datastructures and the design of the individual database system components depends a lot on the load on the database. In addition to the raw \textbf{Input Output Operations Per Second (IOPS)} estimate, we also need to know the ratio of the type of the I/O requests (\textbf{read or write}). A generically designed distributed database may actually prove to be inefficient for many usecases which could have better performance, if we design as per the application requirement. To give an example, if the database will spend more than 99\% of the time on writes (say a logging application), then a Log Structured Merge Tree \cite{O_Neil_1996} may be effective; conversely, if 99\% of the time will be on reads, then memory maps may prove to be more efficient. So, understanding the application need is very important while designing a distributed database. Even while choosing an existing database, having knowledge about the nature of the database workload by the application(s) on top will be useful. Facebook started the Cassandra Cassandra\cite{Lakshman_2009} distributed database project initially to perform well during parallel writes and later switched to HBase as they started running more data miningtype of queries once on the huge bigdata datasets that they had a large dataset. have accumulated.