Why My Team Went with DynamoDB Over MongoDB

Ain’t no headache like a database-related headache.

My team and I were hired by a non-profit organization to build a highly customized content system, one that would allow the client to store tens of thousands of articles in the cloud and publish them to the Web. Although the organization was small, with only a few employees, their Website received a good amount of traffic from a niche group of scientists and researchers. In the course of development, we faced a key question: which database to use? We eventually narrowed it down to two very different ones: MongoDB and Amazon’s DynamoDB.

MongoDB is a database engine, whereas DynamoDB is a hosted NoSQL system distributed among Amazon’s servers. So on some levels, the comparison is almost like comparing apples and oranges. One isn’t necessarily better than the other. But for our purposes, and very possibly yours, one did prove more suited than the other.

(One quick note before we plunge into this: I don’t have room to publish the entire specifications here. Although you’re only getting a partial picture, I’ll do my best to walk you through the situation and the conclusions reached by our team.)

The articles meant for the database had a large set of metadata attached to them, including abstracts, information about the article’s publisher, biographies, etc. One thing to note is that few authors wrote more than one article, and the articles came from a huge range of publishers. In other words, it was clear that this type of project would be well suited for a document-oriented NoSQL database.

Even though our team specializes in MongoDB (and initially considered using CouchDB), we ended up using Amazon’s DynamoDB to complete the task. Here are the steps that led to the decision:

Some Thoughts About the Data

Each database article included a set of fields that normally lend themselves well to a traditional RDBMS, such as MySQL or Oracle. Among them:

  • The article’s title
  • The article’s text
  • An abstract spanning several paragraphs
  • The publication date
  • The name of the publisher
  • A unique ID used in the industry

However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles. Every article also needed a list of keywords associated with it, as well as a selection from a set of predefined categories; on rare occasions, the publisher wanted additional information attached to the article, such as a custom survey or contact information.

Although data like this could be put in a relational database, it doesn’t fit well. Using normalization rules, we could create a separate table to hold the authors and their biographies, and then another table for joining the articles to authors. For normalized databases, this is often considered a best practice, although another option would be to store multiple author IDs in the article tables—something that would require extra fields, since most articles had more than one author. That would also require that we anticipate the maximum of author fields needed, which could lead to problems down the road.

In the relational database world, it’s better to use an extra table joining articles to authors, with one author per row. Categories and keywords present a similar situation.

With regard to data structures in the code itself, we have an Article class with top-level fields for different items such as the title and text; inside the Article object, there are arrays containing the author information, categories, and keywords. In Java or C#, it would look more-or-less like this:

class Article {

    public String title;

    public String text;

    public String abstract;

    public Author[] authors;

    public string[] Categories;

    public string[] Keywords;

}

This is where the NoSQL concept fits well; instead of dividing up the data into multiple tables, and then doing joins to pull in an individual article, we can just put the whole thing as-is into the table. In the non-relational, MongoDB world, a table is called a collection, and the items in the collections are similar to JavaScript or JSON objects, and can hold complex data structures that include arrays and embedded classes.

Here’s an example of a JSON data structure as it might look inside a document database:

{

“_id” : ObjectId(“51262c865ca358946be09d77″),

“title” : “Some title”,

“text” : “Here’s the article’s text”,

“abstract” : “This article is about…”,

“authors” : [

{

                "name" : "George Washington",

                "employer" : "Some university",

                "bio" : "Born on..."

        },

{

                "name" : "Albert Einstein",

                "employer" : "Another university",

                "bio" : "After working at such and such..."

        }

],

“categories” : [

        "Theoretical Physics",

        "News"

    ],

“keywords” : [

        "relativity",

        "general relativity",

        "quantum mechanics"

    ]

}

This entire document gets stored as-is right in the collection. The _id field is generated by MongoDB and is unique to each document in the entire database. Additionally, you can create indexes on any field, including the fields buried down inside an array or inner object, such as authors.name.

A Very Brief Note Regarding Searching

Our client said they didn’t need a full-text search on the text or abstract of the documents; they only cared about people searching keywords and categories. That’s fine—we could always add further search capabilities later on, using third-party indexing and searching tools such as Apache Lucene.

More Requirements

As we did more analysis, the project specs began to grow, and so did the requirements. The client was a small organization with only six employees and a tight budget. They didn’t want to purchase servers, and they didn’t want to hire someone to manage hosted servers. After discussing the situation with them, we decided a cloud-based system would work best for them, especially one that required minimum management.

One option was paying for a hosted MongoDB service. Various companies offer different pricing plans, and some are quite low. But the low plans are often shared plans, where your data is stored on a server alongside other clients. Our client wasn’t happy about that, even though it probably wouldn’t have presented a problem. These hosting companies also offered “dedicated” plans, but they were very expensive and more than our client wanted to pay per month. Digging a bit deeper, we found these dedicated plans amounted to the hosting company allocating an Amazon EC2 instance and running MongoDB on it, and managing it for you. You would be paying for that server plus the management company’s services. And some of the servers on Amazon can get pretty expensive.

As an alternative, we could skip the hosting companies and allocate a couple of servers ourselves on Amazon, following best practices and putting them in different zones (which meant separate physical locations). Then we could install MongoDB and set up the replication ourselves. But again, the client didn’t really want to pay for that, because the cost of those servers would add up, and they would have to pay additional money to us for the work. Their budget was limited.

That’s when we started looking at different options. We wanted fast access through the cloud and replication. What about some of Amazon’s storage systems? Amazon has their S3 storage, but that’s more suited to blob data—not ideal for documents. More suited to document data is one of Amazon’s newer products called DynamoDB. (I say “newer,” but it’s actually an extension of a product they’ve used in-house for a number of years.) It’s a NoSQL and can hold any type of data we want. We decided to do some tests and see if it would work for us.

Amazon DynamoDB’s Structure

In DynamoDB, you can store documents with a flexible structure just as in other NoSQL databases. There’s just one shortcoming: you can’t have embedded data structures like you can with MongoDB. This might be a problem. But before ditching it right away, we looked at some of the other features.

For one, DynamoDB uses a different kind of indexing and key system than MongoDB; you can design your tables with either one field for an index, or two fields. The first field is called the hash, and the second field is called a range.

The reason for calling the first field a hash is because it’s used for that purpose. In computer science, a hash can be used in a map structure to help you quickly find a data item, and also to distribute the items evenly throughout a storage area. That’s how the hash works here, as well. By properly designing a hash, your data will be distributed throughout the allocated storage. Ideally you won’t have what Amazon calls “hot spots,” where the majority of your data access comes from one area.

For example, if you’re storing data about Amazon products, and the products of a single category are grouped close together, and the majority of lookups are for a particular category, then the area where that category is stored would become a hot spot. It’s better to distribute the products to avoid hot spots; access tends to be uniformly distributed, as well.

The second key is called the range, because it gives you a way to keep a sort on the items within a single hash, as well as keep the items close together physically (beneficial for quick loopup, once the hash is used to locate the records).

While all this sounds great, what about indexes beyond these two key fields? We want to be able to search authors, keywords, and categories. After going through the documentation and consulting with some DynamoDB experts, we discovered that there’s no way around it: If you want custom indexes, you have to build them yourself and put them in separate tables.

We weren’t thrilled about this, because writing your own indexes can be problematic. Any time we stored a document, we would have to update the index. That’s fine, except if anything goes wrong in between those two steps, the index would be wrong. However, the coding wouldn’t be too terribly difficult, and so we decided this wouldn’t be a showstopper. But just to be sure, we would need to follow best practices, and include code that periodically rebuilds the indexes.

To generate the keyword index, we would have to scan through the article table, pull in the keywords, and add rows to the keyword index. The hash key in the keyword index would be the keyword itself, and the range key would be the article ID. Easy enough. The index tables for the categories and authors would be similar.

Writing and maintaining our indexes felt like a huge step backwards, something we haven’t had to do for the last 20 years. But we did it, and the code wasn’t hard, and it worked well.

But there was still a further problem. The rows in the tables require a flat structure, a simple name:value type field. That would work for the top-level fields such as Title and Abstract, but not for the authors. However, unlike a traditional relational database, the schema is flexible. That means we could actually flatten out our data structure and store multiple categories, keywords, and author information. So a sample structure in JSON might look like this:

{

    “title” : “Some title”,

    “text” : “Here’s the article’s text”,

    “abstract” : “This article is about…”,

    “author_name_1″ : “George Washington”,

    “author_employer_1″ : “Some university”,

    “author_bio_1″ : “Born on…”,

    “author_name_2″ : “Albert Einstein”,

    “author_employer_1″ : “Another university”,

    “author_bio_1″ : “After working at such and such…”,

    “category_1″ : “Theoretical Physics”,

    “category_2″ : “News”,

    “keywords_1″ : “relativity”,

    “keywords_2″ : “general relativity”,

    “keywords_3″ : “quantum mechanics”

}

This would work okay, and seems to be the Amazon-recommended way. But I wasn’t too excited about it, because pulling it into our classes in the code wouldn’t work very well. We would have to climb through it and piece together the objects. Again, not terribly difficult, but it seemed like a step in the wrong direction.

One of the guys on the team came up with an interesting option. The idea was this: keep the entire structure with the arrays and inner objects, serialize it to a string containing the JSON, and then store that JSON in a single field. Additionally, for fields we needed to access often, such as title and abstract, include extra fields containing those as well. Thus, the non-index fields would look like this:

title: “Some Title”,

categories:”Theoretical Physics,News”,

authornames: “George Washington,Albert Einstein”,

keywords: “relativity,general relativity,quantum mechanics”,

json: “{ “title” : “Some title”, “text” : “Here’s the article’s text”, “abstract”… etc…”

When we needed to retrieve the article, we could just grab the JSON and then un-serialize it directly into an object. (The Amazon DynamoDB library we used allowed for this with a single line of code, and DynamoDB only sends down the fields requested.) Whenever we needed to rebuild our indexes, we could scan the table and pull down just the categories field, split it in our code into individual categories, and build the table. Ultimately, this approach worked. (And as an additional note, we also ended up compressing the JSON string to save a lot of space.)

That compression proved to be important due to yet another shortcoming of DynamoDB, one that nearly made me pull my hair out and encourage the team to switch back to MongoDB. It turns out the maximum record size in DynamoDB is 64K. That’s not much, and it takes me back to the days of 16-bit Windows where the text field GUI element could only hold a maximum of 64K. That was also, um, twenty years ago.

By compressing the data, we increased how much we could fit. After going through the documents, that turned out to be enough. However, in the future if it doesn’t work, we have some additional options that Amazon recommends, none of which sound particularly exciting to implement:

1. Break the record up into multiple records. To read the data, read all the records making up the larger record, and pull them back together. This wouldn’t be impossible, because we actually could use the range key for this.

2. Store the data itself in Amazon S3. In the DynamoDB record, simply store the identifier for the S3 object. That doesn’t sound like much fun, but it would be doable.

So there’s a workaround. Nonetheless, I was getting extremely frustrated, because the amount of code we were having to write to support DynamoDB was far more than the amount of code to support MongoDB.

DynamoDB’s Pricing

DynamoDB pricing ultimately proved excellent for our client. That being said, Amazon’s pricing strategy is bizarre and their documentation isn’t written very well: when you set up a table in DynamoDB, you specify how much read and write throughput you’ll need per second. That, combined with how much data you store, will be how much you pay at the end of the month.

Regardless of whether you actually use that much read and write throughput, you still pay for it. Think of your mobile phone; it’s the same idea. You pick how many minutes and messages you’ll need, and that’s what you pay, even if you don’t reach those minutes and messages. With DynamoDB, you pick (and pay for) how much read and write you’ll need.

This is an important note for developers doing tests: It’s very easy to create a dozen or so tables in DynamoDB using their web interface. But those tables sitting there are being applied to your final bill, even if you don’t actually put data in them. One guy on our team created a couple test tables before understanding the billing, and the tables just sat there unused for a month. At the end of the month we ended up paying a little bit for them. (It was only about $5, but money is money.)

When we were setting up the tables, one guy on the team saw Amazon’s question about how much read and write we’d be doing. He said, “How should I know?” True. It’s hard to get an exact idea. But it really is possible to come up with a ballpark estimate; test it out for a few days, and adjust accordingly. The Web control panel that Amazon provides graphs that let you see exactly how much throughput you’re using.

When you go over the throughput, you get throttled, but it’s easy to increase the throughput from the control panel. We were told by one person that if we went beyond the throughput, we would get a 404 error when accessing the data, but fortunately we didn’t find that to be the case; we didn’t want to have our articles not coming back during times of high load. Instead we’d rather see it slow down and at least function. Amazon lets you set up alarms so we could be notified of the situation, log in, and bump up the throughput.

There was one more caveat with the pricing. DynamoDB includes auto-replication, which was important to us. But there was an option regarding consistency. Typically, consistency is a matter of about one second; when you upload your data, it won’t be immediately available to the different servers where it will be stored. But you can turn on a stronger consistency where it gets updated almost immediately, such as tenths of milliseconds. But as our client was storing articles and not super-time-sensitive financial data, consistency really wasn’t an issue in this case.

In the End

Amazon says they store the DynamoDB in fast solid-state drives, which keeps speeds fast.

Our client is paying less than $100 per month for the data. Yes, there are MongoDB hosting options for less than this; but as I mentioned earlier, those tend to be shared options where your data is hosted alongside other data. (Our client wasn’t too keen on that.) So for their particular needs, DynamoDB seemed to be a good fit. And although I’m not thrilled about the additional work we had to do (at times it felt like going back two decades in technology by writing indexes ourselves), we did end up with some nice reusable code to help us with the serialization and indexes and such, which will make future projects easier. But will we go with DynamoDB again?

Let’s put it this way: There’s no cut and dry answer. Every project is different. The decisions for this project were based on the client’s needs and requirements. If they were even slightly different, we could have ended up with a completely different implementation, possibly even a relational database. So I can’t say unequivocally, “DynamoDB is better than MongoDB.” But it proved to be a better fit for this project, and although it wasn’t without headaches, the end result was excellent.

Upload Your ResumeEmployers want candidates like you. Upload your resume. Show them you’re awesome.

Image: Shesternina Polina/Shutterstock.com

Comments

  1. BY Mike says:

    It’s funny that your client is not keen on hosted Mongo because of “sharing”, but Dynamo is shared in the same way. It’s not like Amazon provisions a special server just for your $5/month database. It’s all “shared” and all safe side-by-side. This was not a valid consideration for choosing one over the other.

  2. BY A says:

    Your article actually convinced me to go with MongoDB! I was doing a similar comparison and I kept thinking.. no embedded structures? There has to be a clean way to work around this that I’m missing. The solution of having _1 _2 appended to records (as recommended by Amazon) is kinda hacky and the solution of having json structures stored is *super* hacky IMO. Seems like it’ll work for really simple apps, but it doesn’t seem ideal for more complex situations. Also, build my own indices!? No thanks!

    Nice article though, and good thoughts throughout.

  3. BY Allen Underwood says:

    Fantastic writeup. Thank you for taking the time to go through the thought process and the challenges you faced. I’m a long time RDBMS guy and am toying with the idea of giving a NoSQL database a shot. I really do appreciate this detailed writeup. Much better than the lame comparison tables that show up all over the web.

  4. BY Jeff Jones says:

    Thanks for the article. I’m currently evaluating noSQL/SQL on a project, and your article is food for thought. RDBMS seems better for your use case, especially when you show how easy it would have been to properly normalize and auto-index your data vs doing it manually. The only benefit I can see from the noSQL option is easier scaling, but I’m not sure sharding MySQL or Amazon RDS would have been much harder than what you had to do to make your relational data fit. And what happens when the user asks to see ‘all articles by the same author’, but the Author data is duplicated in each Article, potentially spelled differently in each copy, etc.

Post a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>