Suppose your company, your team or just you have decided to try out MongoDB in your next project. Hopefully you’ve done some analysis on the matter and Mongo indeed fits best this new project(If you are still uncertain you might want to look into the following blog post). Almost certainly you and your team are going to read quite a lot of information regarding how MongoDB works and how you can/should/should not work with it(By the way, if you are indeed new to MongoDB, I would recommend you signing up for free classes from 10gen, which can give you a good start). Unfortunately one’s mind is not always capable of memorizing everything on the fly and you can simply miss important details while reading docs in the first place. This is why in this blog post I’m going to cover some peculiar features of MongoDB which you should probably know before designing your application or at least before going with it into production.
Full Field Names VS Shortened Aliases
I myself was quite surprised first when I read this, but it’s actually true: in MongoDB it’s not just your data that takes space, but also field names from each document you store. It basically happens because of MongoDB’s schema-less nature and the fact that it stores documents in BSON. It may appear to be not as bad as you might imagine, but nevertheless this is the reason why many project have chosen to use shortened aliases instead of full field names.
In my opinion the decision of using shortened aliases should be based primarily on your application schema. If you see that your tables mostly contain numbers(e.g. measurements, metrics, analytics) then having a 20-characters name for a field with 4 bytes of data stored in it, is definitely going to be a big overhead. On the other hand if you plan to store a lot of text/binary data in your collections, than it won’t probably matter if the field names can occupy 3% of your database size.
Fault Tolerance Is In Your Hands
You should always strive for highest fault tolerance possible. That’s one of the obvious things you learn very early in your carrier as a software engineer. Everyone wants their solution to be reliable and stable etc. etc. With MongoDB your code should be wary of certain exceptional situations, such as network exceptions, master re-elections, possibility data losses(indeed possible if you write in fire-and-forget mode, which is very fast but as far as you can see far from perfect) etc. In some edge and extremely rare cases Mongo Driver can even return a failure to your write operation which actually succeeded(this is because your write and Mongo’s getLastError are two separate commands sent over the network).
Anyway what I’m trying to say here is that you should really implement some kind of retries logic where it’s possible(e.g. for exception which are transient by nature) and a trade-off between performance and safety. By the way, this can be a good place to practice in AOP or reuse some of existing libraries(I know Spring Batch and Spring Integration both have retries mechanisms).
Read/Write Contention
You are probably not going to notice it at first glace, but actually your write/read operations will content between each other, even if they were meant to access different documents and even different collections. In a nutshell, Mongo has a single read-write lock per database, which means that only single write can be running on a whole database and all reads addressed to master node will have to wait. (If you wish you can read more on this in the online docs from 10gen.) This is one of the key reasons why you should consider reading from slaves whenever possible.
Eventual Consistency Is Real
Be pessimistic and read from primary by default. This is what I can suggest. Eventual consistency is real with Mongo and the lag between secondaries and your master can be from 1-2 sec to several minutes(or even more depending on your configuration and write traffic). This is why you should read from secondaries only when you are 100% percent sure that you are OK with stale data or you are sure that your write is there(this can be achieved via WriteConcern.ALL, which is quite slow and may fail if one of the nodes is down).
Should I tell you what can happen if you are too optimistic and read stale data when you needed the latest one? Well. You can get an NPE in your code while trying to read an object which you’ve just inserted or an end-user of your app can find her new blog post or order missing right after she added it.
Being pessimistic does not sound too pragmatic. What if you want to get better performance/throughput through offloading some of your reads from the master? The answer is obvious: in this case your app should keep some state(either in http session or cache, or clustered data grid etc.). This state may consists of the newly created/updated data so that you don’t need to look for it in DB. Alternatively you can keep a time-stamp of last data modification done by the user in her session and use it for choosing between reading from master or secondaries. The key point here is that you will need to handle this problem in your application code, and thus should take care of it during the design phase.
Concurrent Access Is Not A Joke
The problem of concurrent access(e.g. data races, race conditions etc.) does not exclusively belong to MongoDB but without ACID transactions (where you could use a transaction isolation level to lock some rows), SELECT FOR UPDATE statements and the other features available RDBMSes you are going to have only a few options to overcome the problem of concurrent access to your data.
First of all you can ignore the fact that concurrent access/modification can occur. Applications in certain domains can simply tolerate cases when the last user to update the row is the one whose data will be there. In addition to this option you can use operations like $set, $push, $inc which allow to modify only a part of document which you are interested in. Less contention means less chances to have problems with concurrency.
The second option is the usage of findAndModify command paired with optimistic locking(an approach when you have a version field in each document which is used to ensure that the data has not changed between read and write). Taking into account that operations on a single document are atomic in MongoDB this approach can really help you to tackle concurrency inside one document.
The last option, which by the way allows to deal with concurrent modifications on multiple documents, is application level locking. This is indeed the last option which should be used with great caution, because poorly implemented, it can give you an incredibly bad performance and a possibility of deadlocks.
Be Prepared For Sharding
When you choose MongoDB to be your primary storage, you should immediately go through your future collections and determine the once which are either very write heavy or can grow indefinitely(e.g. historical data). Those are very good candidates for sharding and it means that you should a) determine your sharded key b) ideally, use this key in all queries to this collection c) be ready that uniqueness constraints on sharded collections are harder to maintain(if you want to have a unique field other than sharded key, you will probably have to use a separate collection to do this. See this link for more details.)
Aggregation Queries Are Still Not A Piece Of Cake
Even though MongoDB provides this nice feature called Aggregation Framework which is meant to reduce need in Map-Reduce facility, you are still going to face quite a lot of challenges with data aggregation for reports or various analytics. First of all Aggregation Framework is not so easy to work with as SQL, second of all it may still require you to pre-collect some data into additional collections(because there are limit to what you can do in a query)and the last thing to remember that a complicated aggregation query can run a significant amount of time on your data set. For example in my case an aggregation query which consists of 7 steps can run about 1 minute on a collection with several millions of entries.
This is why I suggest you testing the performance of your aggregation queries on a dummy data set before going into production and consider adding more fields/collections with pre-calculated figures for your analytics.
Looking for more details
If you’ve realized that your current architecture for MongoDB is not ideal, or you are looking for more information related to Mongo limitations, I would suggest you reading 10gen docs and probably this blog post too. And ff you know more good sources, don’t hesitate to leave them in comments.
Summary
Hopefully in this post I managed to cover most of the design gotchas you might encounter with MongoDB. The bottom line though is that any NoSQL solution has its pros and cons, and you should either be ready to overcome them in your design or look for a solution which fits your particular use-case better.