Authorea

Sankar edited Introduction.tex over 9 years ago

Commit id: 3537702fcf8c2968b348969d1e5df3a921d55cb5

deletions | additions

\subsection{Introduction} There are various factors that we need to consider while designing a distributed database. Let us analyze some of them. \subsection{Load on the Database} The choice of the datastructures and the design of the individual database system components depends a lot on the load on the database. In addition to the raw \textbf{Input Output Operations Per Second (IOPS)} estimate, we also need to know the ratio of the type of the I/O requests (\textbf{read or write}). A generically designed distributed database may actually prove to be inefficient for many usecases which could have better performance, if we design as per the application requirement. To give an example, if the database will spend more than 99\% of the time on writes (say a logging application), then a Log Structured Merge Tree may be effective; conversely, if 99\% of the time will be on reads, then memory maps may prove to be more efficient. So, understanding the application need is very important while designing a distributed database. Even while choosing an existing database, having knowledge about the nature of the database workload by the application(s) on top will be useful. Facebook started the Cassandra distributed database project initially to perform well during parallel writes and later switched to HBase as they started running more data mining type of queries once they had a large dataset.