Seven features of MongoDB which can impact your design

pitfalls

Suppose your company, your team or just you have decided to try out MongoDB in your next project. Hopefully you’ve done some analysis on the matter and Mongo indeed fits best this new project(If you are still uncertain you might want to look into the following blog post). Almost certainly you and your team are going to read quite a lot of information regarding how MongoDB works and how you can/should/should not work with it(By the way, if you are indeed new to MongoDB, I would recommend you signing up for free classes from 10gen, which can give you a good start). Unfortunately one’s mind is not always capable of memorizing everything on the fly and you can simply miss important details while reading docs in the first place. This is why in this blog post I’m going to cover some peculiar features of MongoDB which you should probably know before designing your application or at least before going with it into production.

Full Field Names VS Shortened Aliases

I myself was quite surprised first when I read this, but it’s actually true: in MongoDB it’s not just your data that takes space, but also field names from each document you store. It basically happens because of MongoDB’s schema-less nature and the fact that it stores documents in BSON. It may appear to be not as bad as you might imagine, but nevertheless this is the reason why many project have chosen to use shortened aliases instead of full field names.

In my opinion the decision of using shortened aliases should be based primarily on your application schema. If you see that your tables mostly contain numbers(e.g. measurements, metrics, analytics) then having a 20-characters name for a field with 4 bytes of data stored in it, is definitely going to be a big overhead. On the other hand if you plan to store a lot of text/binary data in your collections, than it won’t probably matter if the field names can occupy 3% of your database size.

Fault Tolerance Is In Your Hands

You should always strive for highest fault tolerance possible. That’s one of the obvious things you learn very early in your carrier as a software engineer. Everyone wants their solution to be reliable and stable etc. etc. With MongoDB your code should be wary of certain exceptional situations, such as network exceptions, master re-elections, possibility data losses(indeed possible if you write in fire-and-forget mode, which is very fast but as far as you can see far from perfect) etc. In some edge and extremely rare cases Mongo Driver can even return a failure to your write operation which actually succeeded(this is because your write and Mongo’s getLastError are two separate commands sent over the network).

Anyway what I’m trying to say here is that you should really implement some kind of retries logic where it’s possible(e.g. for exception which are transient by nature) and a trade-off between performance and safety. By the way, this can be a good place to practice in AOP or reuse some of existing libraries(I know Spring Batch and Spring Integration both have retries mechanisms).

Read/Write Contention

You are probably not going to notice it at first glace, but actually your write/read operations will content between each other, even if they were meant to access different documents and even different collections. In a nutshell, Mongo has a single read-write lock per database, which means that only single write can be running on a whole database and all reads addressed to master node will have to wait. (If you wish you can read more on this in the online docs from 10gen.) This is one of the key reasons why you should consider reading from slaves whenever possible.

Eventual Consistency Is Real

Be pessimistic and read from primary by default. This is what I can suggest. Eventual consistency is real with Mongo and the lag between secondaries and your master can be from 1-2 sec to several minutes(or even more depending on your configuration and write traffic). This is why you should read from secondaries only when you are 100% percent sure that you are OK with stale data or you are sure that your write is there(this can be achieved via WriteConcern.ALL, which is quite slow and may fail if one of the nodes is down).

Should I tell you what can happen if you are too optimistic and read stale data when you needed the latest one? Well. You can get an NPE in your code while trying to read an object which you’ve just inserted or an end-user of your app can find her new blog post or order missing right after she added it.

Being pessimistic does not sound too pragmatic. What if you want to get better performance/throughput through offloading some of your reads from the master? The answer is obvious: in this case your app should keep some state(either in http session or cache, or clustered data grid etc.). This state may consists of the newly created/updated data so that you don’t need to look for it in DB. Alternatively you can keep a time-stamp of last data modification done by the user in her session and use it for choosing between reading from master or secondaries. The key point here is that you will need to handle this problem in your application code, and thus should take care of it during the design phase.

Concurrent Access Is Not A Joke

The problem of concurrent access(e.g. data races, race conditions etc.) does not exclusively belong to MongoDB but without ACID transactions (where you could use a transaction isolation level to lock some rows), SELECT FOR UPDATE statements and the other features available RDBMSes you are going to have only a few options to overcome the problem of concurrent access to your data.

First of all you can ignore the fact that concurrent access/modification can occur. Applications in certain domains can simply tolerate cases when the last user to update the row is the one whose data will be there. In addition to this option you can use operations like $set, $push, $inc which allow to modify only a part of document which you are interested in.  Less contention means less chances to have problems with concurrency.

The second option is the usage of findAndModify command paired with optimistic locking(an approach when you have a version field in each document which is used to ensure that the data has not changed between read and write). Taking into account that operations on a single document are atomic in MongoDB this approach can really help you to tackle concurrency inside one document.

The last option, which by the way allows to deal with concurrent modifications on multiple documents, is application level locking. This is indeed the last option which should be used with great caution, because poorly implemented, it can give you an incredibly bad performance and a possibility of deadlocks.

Be Prepared For Sharding

When you choose MongoDB to be your primary storage, you should immediately go through your future collections and determine the once which are either very write heavy or can grow indefinitely(e.g. historical data). Those are very good candidates for sharding and it means that you should a) determine your sharded key b) ideally, use this key in all  queries to this collection c) be ready that uniqueness constraints on sharded collections are harder to maintain(if you want to have a unique field other than sharded key, you will probably have to use a separate collection to do this. See this link for more details.)

Aggregation Queries Are Still Not A Piece Of Cake

Even though MongoDB provides this nice feature called Aggregation Framework which is meant to reduce need in Map-Reduce facility, you are still going to face quite a lot of challenges with data aggregation for reports or various analytics. First of all Aggregation Framework is not so easy to work with as SQL, second of all it may still require you to pre-collect some data into additional collections(because there are limit to what you can do in a query)and the last thing to remember that a complicated aggregation query can run a significant amount of time on your data set. For example in my case an aggregation query which consists of 7 steps can run about 1 minute on a collection with several millions of entries.

This is why I suggest you testing the performance of your aggregation queries on a dummy data set before going into production and consider adding more fields/collections with pre-calculated figures for your analytics.

Looking for more details

If you’ve realized that your current architecture for MongoDB is not ideal, or you are looking for more information related to Mongo limitations, I would suggest you reading 10gen docs and probably this blog post too. And ff you know more good sources, don’t hesitate to leave them in comments.

Summary

Hopefully in this post I managed to cover most of the design gotchas you might encounter with MongoDB. The bottom line though is that any NoSQL solution has its pros and cons, and you should either be ready to overcome them in your design or look for a solution which fits your particular use-case better.

Java ORMs for MongoDB

Some time ago I wrote a post covering a combination of Spring Data and MongoDB in action. That post was demonstrating how incredibly convenient and deadly simple the data persistence can be implemented with the combination of MongoDB and Spring Data. Thinking on this matter for a while, I released that my post was somewhat unfair in regards to the other ORMs compatible with MongoDB. After all, the concept of mapping domain objects into DB structures and vice versa does not belong to Spring Data exclusively and is widely known as ORM paradigm. Furthermore, unlike with RDBMSes where usage of ORM in your application is somewhat controversial and should be applied with caution, MongoDB already stores data as documents so it’s not a crime to simplify life of developers allowing to make data persistence as easy and as smooth as possible.

Long story short, I would like to introduce a list of ORMs which I know can work with MongoDB. I’m also going to partially cover some pros and cons of using each and the stage on which each project happened to be at the moment of writing. (The most viable options from my point of view will appear in the top of the list).

1) Spring Data for MongoDB is a project under Spring umbrella which is meant to simplify the development of Java applications for Mongo. Key features are a seamless integration with Spring Framework, dynamic repository facility for rapid development, a complete ORM solution(unlike Spring JDBC Template), frequent releases and Spring community support.

2) Morphia is a very stable and feature-rich ORM designed explicitly for MongoDB. Has some nice documentation and google group filled with responses to a huge number of questions. Unfortunately the version on Google Code seems to be abandoned by the original author which means that the newest Mongo features are not going to appear in the ORM(like Aggregation Framework and some new query keywords).

There is fork of Morphia on GitHub – https://github.com/jmkgreen/morphia which has some activity around it. So hopefully this ORM can survive and evolve.

3) The next option is Jongo which is also a kind of ORM written in Java. The most attractive features are the performance promised by developers and convenience of having queries identical to Mongo Shell(JavaScript-based) in your Java code. The project is also kept on GitHub and seems to have a quite vibrant community.

4) Here is also a number of JPA 2.0 providers which support MongoDB. EclipseLink, Kundera and DataNucleus to name a few. The major advantage in this case is that you can program to a familiar JPA interface/annotations, but on the other side you may not be able to do thing which require your code to be more “Mongo-specific”(e.g. specify WriteConcern etc.). For more details you should look in the documentation of each product. For people interested in examples I add a link with EclipseLink in action and a link about Kundera.

5) Hibernate OGM is the next framework in the list which is inspired by the popular Java ORM Hibernate. By the time of writing the latest version is 4.0.0.Beta2 and the development process seems to be somewhat stale(no activity for two months). A post showing it in action can be found by the following link. But until the version gets more stable I wouldn’t suggest using it seriously.

6) And there is also MJORM which lacks documentation(at least at Google Code) but can also be considered as an alternative. Its most interesting feature is a query language similar to SQL.

As far as you can see there is a number of alternatives you can choose between, but the best option should always be chosen depending on your business requirements, available time and familiarity with certain technologies. Good luck!

MongoDB with Spring Data – Awesome combination for a social game

In my previous post “Considerations for choosing or not choosing MongoDB” I covered some pros and cons of choosing MongoDB as a primary storage for your project. Today I’m going to demonstrate how MongoDB and Spring Data can be applied to development of a social game, extremely simplifying the development and providing really outstanding results.

Please note, that this article would in fact fit any solution which is similar in data access pattern to a social game. I’ll be discussing more details on this matter in the next section.

What is a social game?

A social game is basically a type of online game where players mostly interact with their own game areas(e.g. player’s farm) and use a social network in order share their achievements and ask friends for help. So the most common data access pattern in the game is retrieving or updating data by some kind of user identifier. Here are the key requirements which any typical social game would have to meet:

  1. Most optimal performance for the common data access operations(players don’t like to wait);
  2. Horizontal scaling is a must because you will need to get quite a lot of users before your game starts making profit;
  3. At some point you would definitely need add some analytics and more sophisticated features(e.g. showing top 5 players in the neighbourhood);
  4. Rapid prototype development and ability to add new features quickly(i.e. minimizing time to market and applying “fail fast” principal).

Domain Model

Any program written in object oriented language contains domain model classes which represent the domain of the problem the application is meant to solve. In our particular case we are going to review the domain model of our social game. Here is the class diagram build by IntelijjIDEA:

social_game_classes_diagramAs far as you can see, the root of our class hierarchy is UserModel which in addition to its fields-primitives contains a list of Achievements and a reference to UserFarm object. A instance of UserFarm would then contain a reference to its coordinates, a list of buildings and some other fields. The last detail worth noticing is that UserFarm contains a list Building objects which might be both instances of Building as well as instances of CropsBuilding.

You can find the exact classes by the following link in a GitHub repo.

How it might look in the relational world.

Before I show you how MongoDB and SpringData can be used for persistence of our domain model, let’s dive into the world of relational databases, and see how our ER diagram would look like. Here it goes:

social_game_er_diagramAs far as you can see, there are 6 tables in this model. Whenever a new user registers you will need to fill them with new rows, and then select all the data on each user’s visit and then update certain rows after game activities. Even having ORM and caching in place won’t protect you from possible performance/scalability issues, and all the time you would spend fighting for performance and re-designing your ER schema could be spent on new features and user experience improvements.

Spring Data and MongoDB in Action

Setting up Spring Data for MongoDB in your project is quite simple and can be done as follows:

1) Add dependencies on Spring, Mongo Java Driver and Spring Data MongoDB into your project. (See the sample pom.xml)

2) Create your DAO interface and extend Spring Data’s CrudRepository interface. You won’t even need to implement your DAO because Spring will generate an implementation in runtime.

3) Add a Spring configuration class looking like this:

@Configuration
@EnableMongoRepositories(basePackages = "org.simple.farm.dao")
@ComponentScan("org.simple.farm.dao")
public class SimpleFarmConfiguration {

    @Bean
    public MongoDbFactory mongoDbFactory() throws Exception {
        return new SimpleMongoDbFactory(new Mongo(), "farm");
    }

    @Bean
    public MongoTemplate mongoTemplate() throws Exception {
        MongoTemplate mongoTemplate = new MongoTemplate(mongoDbFactory());
        return mongoTemplate;

    }
}

Now you are ready to save/find your data in MongoDB. Here is some code which would insert a new user into DB:

final String username = "testUserName";
UserFarm userFarm = UserFarmBuilder.createFarmBuilder("Test Farm", 15, 15).
        setResources(90, 130, 150).addCropsBuilding(1, 0, 0, new Date()).
        addSimpleBuilding(Building.BuildingType.SCARE_CROW, 1, 1, 1).build();
userDao.save(UserModelBuilder.createUserBuilder(username, "a password", new Date())
        .setFarm(userFarm).setExperience(0).addAchievement(Achievement.AchievementType.BUILDER, 1)
        .build());

The result of execution can be verified from the Mongo Shell:


> db.userModel.findOne()
{
	"_id" : ObjectId("513b6e1c84ae21612fafe599"),
	"_class" : "org.simple.farm.model.UserModel",
	"login" : "testUserName",
	"password" : "a password",
	"registered" : ISODate("2013-03-09T17:15:08.515Z"),
	"achievements" : [
		{
			"type" : "BUILDER",
			"level" : 1
		}
	],
	"experience" : NumberLong(0),
	"farm" : {
		"name" : "Test Farm",
		"location" : {
			"x" : 15,
			"y" : 15
		},
		"level" : 1,
		"food" : 90,
		"stone" : 130,
		"wood" : 150,
		"buildings" : [
			{
				"lastHarvest" : ISODate("2013-03-09T17:15:08.513Z"),
				"inFarmLocation" : {
					"x" : 0,
					"y" : 0
				},
				"level" : 1,
				"type" : "CROPS",
				"_class" : "org.simple.farm.model.embedded.CropsBuilding"
			},
			{
				"inFarmLocation" : {
					"x" : 1,
					"y" : 1
				},
				"level" : 1,
				"type" : "SCARE_CROW"
			}
		]
	}
}

Full source code of this example can be found in the GitHub repo.

As far as you can see, the persistence of your domain model has never been so easy. Nested arrays and objects fit perfectly into Mongo and Spring Data removes any need in manually creating DBObjects and allows to focus on your business requirements in the first place.

Other advantages of MongoDB

Besides easiness in development of simple CRUD operations for your domain objects, there are certain features of MongoDB which can turn out to be quite handy:

  1. Easy schema migrations. Since your objects are not just arrays of bytes to DB, you can use the power of MongoDB querying facility while writing DB-migration scripts.
  2. Geospatial Idexes can be quite useful in order to find neighbours of a player.
  3. Aggregation Framework is nice for ad-hoc analytics which otherwise would require setting up an RDBMS to run your queries on.

Summary

If you have read this far you can see why I said that Spring Data and MongoDB can be an awesome combination for certain types of applications. Hope you enjoyed the reading!

Considerations for choosing or not choosing MongoDB

Image

After been working with MongoDB for some time and completing M101 and M102 classes(first for Developers and second for DBAs) from 10gen, I’ve decided to cover a topic of why you might consider using MongoDB in your application. In this post I’m also going to cover the reasons for not choosing MongoDB for your app(There are quite a few of those). Hopefully, after reading this article you will understand all pros and cons of building your application of top of MongoDB.

Introduction

In the modern world of software development you no longer have to choose only among RDBMSes when starting a new project. A number of products, generally referred as NoSQL, were created to offer new approaches to the data persistence. Some of them offer near-linear horizontal scalability, some offer better read/write performance(than classical relational storage) and some are focused on a more convenient data representation(a more convenient for a certain data access pattern or business domain). MongoDB is one of such NoSQL storages which supports replication, sharding and document-oriented schema-less persistence.

Reasons to choose Mongo

Document oriented and schemaless. Unlike relational DBs, MongoDB stores all your data in collections of BSON documents and has no schema. Which in turn tremendously simplifies mapping between domain objects and DB. Things like embedded/nested objects and arrays inside your domain objects are transparently stored in DB. In this way MongoDB becomes a perfect choice for domains with polymorphic data and/or for rapid software development where you basically can’t afford to spend too much time doing schema design.

Horizontal Scalability and High Availability. This is what many people associate with Cloud Architecture. MongoDB allows to build a clustered topology with replication and sharding, where the former provides fail-over and eventually consistent read scaling and the latter facilitates scaling out writes(and reads as well).

Fast writes in the fire-and-forget mode. Which is quite useful for collecting various statistics where possible loss of some data is acceptable in favor of a shorter response time.

Comprehensive Querying and Aggregation Framework. With MongoDB you can query your collections with a powerful querying facility, which, by the way, takes advantage of suitable indexes if you have created any, and allows to query nested/embedded objects and arrays. For queries, which require things like MAX, AVG or GROUP BY from SQL, there is a comparatively new mechanism called Aggregation Framework, which allows to run some ad-hoc aggregation queries without need to write cumbersome Map-Reduce scripts.

It’s Free and OpenSource. Yeap, and besides it’s stable, has frequent releases(for example an up-coming release will add a support of full text search) as well as a nice documentation and fast growing vibrant community.

Comparatively intuitive architecture. Due to the fact that MongoDB has only a single master per replica things are definitely simpler comparing to peer-to-peer architectures  where you can have concurrent writes and write conflicts.

Reasons not to choose Mongo

After I described major advantages of choosing/adopting MongoDB I would like to cover the other side of the coin and talk about reasons for not choosing MongoDB for your project.

No SQL = No Joins. It should be obvious that with NoSQL DB you won’t have the ability to use SQL. As a result under those, hopefully rare occasions, where you need to pick up related/referenced data from several collection you will have to do it manually and with no guarantees in terms of consistency. If you see yourself doing mission critical decisions inside your application where you will need data from multiple documents/collections, then you should think twice before using MongoDB.

No ACID transactions. After coming from SQL world you are going to be surprised how many things you are loosing when ACID transaction aren’t there anymore. When working with multiple documents(MongoDB guarantees atomic operations on a single document) you will have no automatic rollback, a possibility of inconsistent reads etc. Occasionally you may overcome these limitations by using two-phase commit, entity versions and in-app locks, but generally if you see yourself doing these things more than 1% of operations, you have probably chosen a wrong DB.

Can’t be used as an integration DB. It’s generally a bad idea to use NoSQL storage as an integration DB which can be accessed by several apps simultaneously. No schema and eventual consistency are going to play against you here.

Your indexes should fit into memory. MongoDB performs well only if your indexes fit into RAM and it’s ideal to have SSD hard drives on your prod servers. MongoDB is simply not optimized to work on HDD as many RDBMSes and thus you can get into troubles with certain usage scenarios where you would be just fine with RDBMS.

Summary

In this article I haven’t tried to get into nitty-gritty details of how MongoDB works, but instead covered only the essentials. If you are new to NoSQL world, I would suggest you reading NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, which provides a very good introduction into NoSQL world(e.g. it contains good explanations of what replication and sharding are etc.).