Site icon LD Talent Blog

How to set up an optimized MongoDB replica set on AWS EC2

MongoDB Replica Set

Hire the author: Olumuyiwa A

Check out the MongoDB Replica Set Image and GitHub repo for this guide.

TL;DR

You can achieve great results in production by making sure you build your application well and model your data correctly. Moreover, this guide will help you create a well-tuned MongoDB replica set, designed for up to 10,000 customers in a production environment.

Introduction

In the modern world of rapidly evolving applications with dynamic, data-intensive, real-time requirements, MongoDB has become the default choice for web developers. It is particularly popular in the JavaScript/Node.js ecosystem. Two compelling reasons that make MongoDB stand out can attribute to this popularity:

  1. Massive Scalability with Document-Based NoSQL: MongoDB’s inherent design is a document-based NoSQL database. It offers developers a big advantage in terms of scalability. By storing data in flexible, schema-less documents, MongoDB enables seamless horizontal scaling. This allows applications to process large amounts of data and adapt to growing workloads. This scalability feature is particularly beneficial for modern applications that demand high performance and responsiveness. MongoDB becomes a preferred choice for managing complex and evolving data structures.
  2. JSON Everywhere: MongoDB seamlessly integrates with the ubiquitous JSON (JavaScript Object Notation) data format. JSON is widely used for data interchange across the web. This integration ensures a natural workflow for developers. It is particularly beneficial for those working within the JavaScript/Node.js ecosystem. With MongoDB, developers can leverage the power of JSON throughout the entire application stack. This includes the frontend, backend, and communication between different services. The consistent use of JSON simplifies data manipulation with MongoDB. It also facilitates code reusability and enhances developer productivity. This makes MongoDB an ideal fit for JavaScript-centric development environments.

By combining the benefits of massive scalability through its document-based NoSQL approach and its seamless integration with JSON, MongoDB empowers web developers to efficiently tackle the challenges posed by modern, data-intensive applications. It has become a go-to database solution for JavaScript/Node.js developers, enabling them to build robust, scalable, and flexible applications that meet the demands of today’s dynamic digital landscape.

Justification

Running MongoDB locally or using a managed service like MongoDB Atlas is a simple process. But, managing a self-managed database requires a more nuanced approach. This guide aims to provide the necessary information and resources. Its goal is to help you successfully self-manage MongoDB in a production environment.

There are numerous resources available on using MongoDB in a production environment. However, much of the information available online falls into one of the following categories:

The intention of this guide is to provide comprehensive and up-to-date information on self-managing MongoDB in production. The focus is on securing HA and FT deployments emphasizing the importance of proper data modeling and query structure.

You will learn how to deploy production-grade HA/FT/SH MongoDB replica sets. A replica set will support a web application that is scalable enough for your startup/SMB’s first 10,000 customers.

Assumptions

Glossary

This guide provides a working man’s definition of terms used.

Step-by-step Procedure

This part of the guide has two subsections: one for application development and another for database administration. Our goal: Deploy a production application against a MongoDB replica set on AWS and support 10,000 recurring paying customers.

Section 1: Application Development

Step 1: Setting up the connections

Everything starts here. We must set up a reliable database connection as early as possible in the app lifecycle.

Pro tip: As the application will run in multiple environments, parameterize the connection URI by importing it as an environment variable.

When considering connections, there are 3 sets of environments we should be aware of and their peculiarities.

Below is a gist that demonstrates a reliable and robust pattern for standing up the web app backend:

Let’s walk through the code:

You’ll notice that we’ve discussed everything except the actual connection URI (called dBURL). We’ll dig into this later, but for now, in the Local/Dev environment, this variable typically has a value like mongodb://localhost/somedbname.

Step 2: Addressing application type concerns

Determining the type of application you are developing is crucial. It helps in identifying whether the app is read-heavy or write-heavy. This affects the format of the MongoDB connection URI. It also influences the decision between a monolith or partitioning into two or more microservices. It’s essential to address these considerations before progressing beyond the initial app foundations.

A read-heavy app is one where the data flow will have fewer writes and more simple fetches of data. Examples of read-heavy apps are e-commerce apps and social media feeds.

A write-heavy app has a more significant number of mutations to data. Examples of write-heavy apps are live tracking systems or card processing systems.

MongoDB behavior with different app types

This example illustrates a MongoDB replica set URI format. It fulfills the requirements of both read-heavy and write-heavy applications:

mongodb://user:pwd@server0-dns,server1-dns,server2-dns/dbname?replicaSet=replicasetName&retryWrites=true&retryReads=true&w=majority&readPreference=secondaryPreferred

Let’s break this down:

By combining the flags/parameters in the MongoDB connection URI and the Mongoose connection options, we obtain a robust set of tuned parameters for MongoDB. These are suitable for most environments and use cases. With these options, you can fine-tune MongoDB to meet your specific needs and achieve optimal performance.

With the replica set and connection options in place, you can ensure high availability, fault tolerance, and optimized performance for your MongoDB deployment.

This sets the stage for a reliable and resilient application that can handle a wide range of use cases and scale with your business needs.

The next stage is proper Data Modeling.

Step 3: Designing the data model

The first step in performance optimization is to understand your application’s query patterns so that you design your data model and select the appropriate indexes accordingly.

MongoDB team

It is essential to store related data together, whenever feasible, by utilizing embedded documents. Mastering this fundamental principle plays a pivotal role. It helps in assessing the performance of both applications and databases in a production environment. A flawed data model can lead to performance issues within the application, even if the database is finely tuned.

Let’s consider a photo gallery app as an example to provide a context for the discussion. This app enables users to upload any number of photos and save them to AWS S3. By default, the user dashboard displays a gallery with thumbnails of the 10 most recent photos. Users can access pagination/infinite scrolling to pull additional photos from the database. Users can view a full-size version of photos by clicking on the thumbnails and can delete or replace photos.

Basic data model

Keeping the app requirements in mind, an intuitive data model that seems to meet the app constraints would look like this:

Basic model review

At first glance, this data model may appear to be suitable for the requirements of the photo gallery app. The username and email fields are unique, this not only avoids duplication but also creates indexes on these fields improving read performance. Additionally, a  photo is an array of objects. A secondary index has been made against the filename property of objects within the photo array to improve searching. What could be wrong?

The primary issue is that the photos array is unbounded. This means that there is potential in this area to exceed the size limit of MongoDB. 

Model refactoring

Let’s improve the model by refactoring it. Move the photos into a separate collection, (let’s call this Albums). Also, use references (a type of JOIN) as needed to populate the photos field on the Users model. Below, you can find the improved models:

Users:

Albums:

Model refactoring review

The User model references the Album model (Mongoose translates the models into Users and Albums collections respectively on MongoDB) by tracking the _id field of each album document in the photos field. Correspondingly, the Album model references the User model by tracking the _id field of the user document in the owner field. 

You might think that the job is complete, right? But, we still have some work left to do. The problem of unbounded document size persists because it has merely shifted from User to Album. To make matters worse, MongoDB now has to make two queries to return the user document containing album data. From a performance point of view, this extra query is suboptimal.

Optimal data model

The following two points lead to the solution:

These points lead us to an optimal set of data models:

Users:

Albums:

Optimal data model review

While it may not be immediately apparent, a closer examination reveals that the optimized data models have significantly improved performance.

The User model retains the photos field, allowing a single GET request to return both the user information and their 10 most recent photos. As a result, this satisfies the basic constraints of the app and follows a fundamental design pattern in MongoDB.

To improve the efficiency of the Album model, we have replaced the array of photos with a single photo document. A compound index on both the photo owner and filename fields, resulting in faster and easier retrieval of photos specific to a user. These changes are in line with best practices in database design. They represent a significant improvement in the app’s performance and user experience.

To retrieve older photos in our app, users can initiate a single GET request to the album collection. Although it may take a little longer, users anticipate that older photos will become available after a short delay. The slight delay is unlikely to affect their experience. 

By implementing this approach, we are able to efficiently manage and retrieve a large volume of photos. We can do so while still maintaining a seamless user experience.

This approach results in a compact working set (indexes and the most frequently accessed data). This fits inside RAM ensuring that database performance is speedy and reliable.

Data modeling conclusion

The preceding covers the basics for performant data modeling with Mongoose/MongoDB. Here are the example queries that satisfy the different types of models:

With the preceding information, you will have a solid foundation. This foundation will enable you to build a responsive, scalable app deployed against MongoDB in production.

Pro tips:
1. Read-heavy app tips: Caching, setting read preference to secondaries, and increasing the number of secondaries will boost the performance of your read-heavy app against MongoDB.
2. Write-heavy app tips: Using message queues for write logs and decomposing your app into microservices will help boost the performance of your write-heavy app against MongoDB.
3. Use a Process manager for your Node.js app e.g. PM2, this is especially useful when running your app on EC2. Note: For a containerized app, a process manager is not required.

Section 2: Database Administration

An efficient data model that supports well-architected apps, provides the foundation for high-performing MongoDB deployments.

In production, MongoDB is never deployed as a single-node service. The 2 recommended architectures are:

In order to support our initial customer base of 10,000 users, I strongly recommend prioritizing replica sets over sharded clusters. By focusing on replica sets, we can simplify our architecture. This ensures that we are able to provide a reliable, high-performance experience to our users. 

As explained by MongoDB’s Chief Solutions Architect, opting for a single replica set often represents the optimal choice for production environments.

Let’s learn about how to make this happen step by step.

Step 1: Reviewing our needs

To support our app’s development and deployment, DevOps will be a critical component of this section. We will focus on achieving the following goals:

Step 2: Understanding the tools of the trade

To meet the requirements outlined above, we will leverage the following tools:

The goal of using these tools is to design and deploy a HA/FT SH MongoDB replica set. We will achieve this by leveraging AWS ASGs and AWS NLBs to achieve HA/FT. Ansible, EC2 LTs, AWS S3, and bash scripts will be used to achieve SH. AWS SES and the node.js script will be used for notifications.

Let’s dive in with the configuration of the tools.

Step 3: Setting up an S3 and a custom IAM Role

To lay the foundations, we first create an S3 bucket to save backups. Following that, the next step is to create a custom IAM Role. This role must provide SES access to send emails and S3 access to read/put objects into the backup bucket. You can perform these tasks in the AWS Console.

Step 4: Identifying requirements for the replica set

After the setup is complete, this Phase begins by identifying requirements for the replica set. Specifically, we want to set up a 3-member replica set distributed across 3 AWS AZs. The roles will be primary, secondary, and hidden.

To support this architecture, we will need the following:

Step 5: Creating AWS EC2 NLBs and TGs

With the above step successfully completed, the next step is to create three sets of AWS EC2 NLBs and TGs.

In this deployment, we have opted to use NLBs instead of ALBs because MongoDB traffic operates on port 27017. ALBs only support HTTP/HTTPS traffic on ports 80/443. By using NLBs, we can balance and route network traffic to our MongoDB instances more efficiently.

This ensures reliable and efficient communication between our application and its database. This approach helps optimize performance and scalability for your MongoDB deployment. The result is an application that can handle larger volumes of traffic and data with ease.

Follow these steps:

Pro tip: Use the same naming scheme for the TG and NLB eg. mongo-pri-tg/mongo-pri-nlb, mongo-sec-tg/mongo-sec-nlb, etc

Step 6: Creating EC2 LTs

In the previous step, we ensured that the NLB and TG pairs are properly configured so that they are ready to be bound to the ASG. Additionally, we confirmed that the SG for the instances was set up properly. The major task in this step is to create 2 EC2 LTs (AWS is deprecating Launch Configs in favor of LTs). The procedure for the LTs follows:

Step 7: Creating ASGs

To create the replica set nodes, we will define three ASGs. One for the primary node, one for the secondary node, and one for the hidden node. By using ASGs, we can automatically scale our infrastructure to meet changing demands. Also, this approach ensures that we have enough capacity to maintain high availability and performance.

Use descriptive names for the ASGs, such as “mongo-pri-asg” for the primary node ASG. Using clear and consistent naming conventions simplifies managing and troubleshooting of the infrastructure over time.

Repeat for the other 2 ASGs. At the end of this step, we will provision three MongoDB nodes. You will receive an email notification from SES when the nodes are fully set up.

Pro tip: Adopt a naming convention that associates each ASG and NLB with the same AZ. An example: “mongo-pri-nlb“/”mongo-pri-asg” for the NLB/ASG pair in “eu-west-1-a” AZ. This convention will make it easier to identify and manage resources as you scale your infrastructure over time.

Step 8: Configuring the replica set and creating an initial database

At the end of step 7, 3 Mongodb nodes were provisioned and configured for production. Now all that’s left is to set up the replica set (a one-off exercise).

When configuring the hidden node, it’s important to follow the naming convention described above to avoid errors. Additionally, configuring the hidden node for its role requires extra steps beyond those needed for the primary and secondary nodes. Furthermore, it’s crucial to update the hostname of the primary node with the DNS name of the NLB. This ensures that the DNS for the primary node is publicly resolvable, making the replica set available for writes. If this step is not taken, the default private DNS name for the node will be used, resulting in an unidentifiable primary node. This will lead to write errors and potential downtime or data loss.

During this step, we also create an initial database for our application and a user with appropriate permissions. In preparation for the next step of testing, we also injected some dummy data into the database.

Step 9: Testing replica set performance

In this concluding step, we test the replica set performance:

Thanks for making it all the way here.

This guide has provided the process for architecting the backend of your application for high performance against MongoDB in production. By following along, you have learned how to deploy a production-grade MongoDB replica set on AWS EC2. The replica set and app meet the constraints of high availability and fault tolerance. So, your application remains accessible and responsive even in the face of unexpected failures.

Learning Tools

If you want to gain deeper context on the topics covered in this guide, the following resources may be helpful:

HA/FT

SH

MongoDB Connection URI

MongoDB Schema Design

MongoDB Nodejs Developer Course

Learning Strategy

In 2018, a client assigned me the task of setting up a production-grade MongoDB replica set on AWS EC2.

To revise this guide for 2023, I had to read a lot of articles, documentation, and Q&As. I also put in considerable effort to decipher unclear instructions or outdated information. The Learning Tools section above features some of the most helpful resources for my assignment. 

Despite the difficulties, I believe this guide provides a comprehensive and up-to-date approach. It focuses on setting up a production-grade MongoDB replica set on AWS EC2.

Reflective Analysis

There is a significant difference between the approach I have taken for this guide and my prior work.

Previously, I employed an IP address approach to handle URI immutability. This required me to intercept ASG lifecycle events and bind static ENIs to newly spawned EC2 instances. The MongoDB URIs were bound to ENI IP addresses.

A major drawback was the need to SSH into the current primary instance. Then reconfigure the replica set to bind the DNS of the new instance to the IP address.

Trust me, it was a hassle.

This time around, the use of NLBs significantly reduces database admin tasks since the setup is now a one-time task. The DNS is always the same (NLBs are HA by default). So, there is no need for human intervention to add a new member to the replica set. This approach is vastly superior to my previous method.

Conclusions and Future Directions

In summary, MongoDB can serve as a reliable production database for various application types. Deploying it on AWS EC2 can lead to significant cost savings over time. This guide demonstrates how to achieve this with confidence.

An enhancement is to use infrastructure as code to auto-create the AWS resources and deploy the replica set. This would be a true DevOps approach. Terraform would be an excellent tool for achieving this.

Another improvement is to establish a custom VPC with public and private subnets to deploy the app and replica set. This would enhance the security of both the database and the app.

The author and LD Talent are available to deliver these enhancements at your request. Remember to check out the GitHub repository to see all the code used in this guide in one place. It’s a good idea to review the README before you start.

Hire the author: Olumuyiwa A

Exit mobile version