Case Study

On-Demand Infrastructure Deployment and Configuration for Cybus QA Team

Cybus
Client
Cybus

region-iconIndustry

region-icon Region
Osterstraße 124, 20255 Hamburg, Germany

Headquartered in Germany, Cybus Connectware serves as a Factory Data Hub, connecting IoT and IT systems company-wide. Cybus empowers organizations to effortlessly collect, standardize, process, and distribute data, enabling diverse Industrial IoT use cases with a seamless global data flow across all facilities. The QA Team of Cybus needs to tests their applications on various platforms with different types of scenarios and versions.
To address the specific needs of the Cybus QA team, we developed a comprehensive solution for on-demand infrastructure deployment and configuration. This solution integrates infrastructure provisioning using Terraform, ensuring consistent and scalable resource allocation. After deployment, environments are configured using Ansible or Helm, and applications are running on Docker (EC2 Deployments) or Pods (Kubernetes Deployments) making them ready for immediate testing.
For monitoring, we implemented Loki, which provides real-time log tracking for each deployment, allowing the QA team to quickly identify and resolve issues. Additionally, real-time monitoring API ensures visibility throughout the deployment process, enabling prompt detection and response to any potential problems. To optimize resource usage, an auto-cleaner job was introduced to automatically detect and clean up inactive or “zombie” deployments, preventing unnecessary resource consumption.
This integrated approach has significantly improved the Cybus QA team’s ability to deploy, manage, and monitor envi IPronments efficiently, leading to faster Iioud Services: AWS (S3, EC2, EKS, VPN)

Technologies Used

Terraform

Ansible

Docker

Git

Fluent

Prometheus

Grafana

cAdvisor

Loki

Challenges & Solutions

Handling Large Scale Infrastructure Provisioning

Problem

The DevOps team faced challenges in deploying Connectware and agents across different environments for the QA team, as the number of agents could range from 0 to 999. Manually provisioning infrastructure for each release of Connectware and agent versions proved to be repetitive, time-consuming, and costly.

Solutions

To address this, we utilized Git as the source for Terraform modules, ensuring that the modules remained extendable and easily updatable without needing to redeploy the entire solution. By using Jinja templates, we dynamically generated Terraform scripts that leverage these defined modules, enabling efficient and scalable infrastructure provisioning across varying agent counts. This approach streamlined the deployment process, making it more cost-effective and adaptable to changing requirements.

Handling Large Infrastructure Configuration

Problem

Configuring on-demand, large-scale infrastructure can be challenging, especially when managing numerous virtual machines or Kubernetes clusters. The complexity increases with the scale, making manual configuration impractical.

Solution

We addressed this by using Jinja templates to generate dynamic Ansible scripts, which automate the configuration of infrastructure. These scripts seamlessly integrate Docker Compose or Helm values files, enabling efficient and scalable setup across large environments. This approach simplifies the process, reduces manual effort, and ensures consistent configurations.

Handling Large Infrastructure Configuration

problem

The Terraform modules, Helm values, and Docker Compose files are subject to change, and new files or modules can be easily added to the Git repositories. Without an efficient update mechanism, the solution would need to be redeployed repeatedly to incorporate these changes, leading to unnecessary overhead.

Solution

The solution clones the relevant repositories during the initial deployment. For each on-demand deployment, the repositories are pulled to fetch the latest updates. This approach ensures that any changes or additions are seamlessly integrated without requiring a complete redeployment, enabling a more efficient and streamlined update process.

Keeping The Software Updated

Problem

The Terraform modules, Helm values, and Docker Compose files are subject to change, and new files or modules can be easily added to the Git repositories. Without an efficient update mechanism, the solution would need to be redeployed repeatedly to incorporate these changes, leading to unnecessary overhead.

Solution

The solution clones the relevant repositories during the initial deployment. For each on-demand deployment, the repositories are pulled to fetch the latest updates. This approach ensures that any changes or additions are seamlessly integrated without requiring a complete redeployment, enabling a more efficient and streamlined update process.

Log And Docker Metrics Monitoring For Each Deployment

Problem

Since these deployments are intended for QA testing, errors or bugs may arise with each release. Manually checking logs and performance metrics across multiple virtual machines or pods is not feasible, creating a need for centralized log and stats monitoring.

Solution

Our solution configures each virtual machine’s Docker setup to collect logs using Fluentd and monitor Docker stats via cAdvisor. The collected logs and metrics are then shipped to a centralized Grafana instance, where Loki and Prometheus are set up to aggregate data from all deployments. This configuration allows the QA team to efficiently monitor logs in Loki and view performance metrics in real time, simplifying the process of identifying and troubleshooting issues across the infrastructure.

Concurrent On Demand deployment Requests

Problem

Each deployment task is a long-running job. Utilizing Terraform and Ansible adds significant resource load. For load testing, we aim to support a minimum of 10 deployment requests simultaneously.

Solution

To meet this requirement, we implemented a Python background task system and deployed it using Gunicorn workers to handle requests efficiently. The application instance is hosted on a single EC2 instance (type c7i.xlarge) with 4 vCPUs. Through this setup, we achieved handling of 10 concurrent deployment requests of different criteria, using just 32% of total CPU capacity.

Handling Failed or Zombie Deployments:

Problem

The long running deployments can fail due to configuration issue or system crash or network issue. Every time it will not be possible to track deployments of each user. For that reason the zombie deployments will increase the cost as it was taking resources from cloud and not adding any business value.

Solution

A cleaner job will run everyday after office hours to check for zombie deployments and It will try to destroy all the failed deployments(with 2 retries).And notify to slack channel about the status of the cleaner job.

Handling Concurrent Git Pull for Deployment Requests

Problem

During concurrent deployment requests, each process attempts to pull updates from Git, resulting in conflicts due to .git/index.lock being engaged. When one process holds this lock, others are unable to access Git, causing failures. If the pull operation fails(Interrupted operations, simultaneous operation or system crash), the .git/index.lock file must often be manually deleted to resolve the blockage.

Solution

To manage this, we use linux flock to globally lock the Git pull process. This ensures that only one deployment process can run the pull operation at a time. If a pull operation fails, the lock is automatically released, allowing other deployment processes to proceed without relying on .git/index.lock. This solution removes the dependency on the Git lock file by deleting the lock file if it exists and enables smoother, concurrent deployments.

Real-Time Deployment Log Streaming of Our Solution:

Problem

Provisioning and configuration logs from Terraform and Ansible are essential for users initiating deployments. Given that the log volume can vary based on deployment requests, multiple users may need to access these logs simultaneously.

Solution

We use Python’s subprocess to execute Ansible and Terraform commands, capturing all logs via a collector function that writes them to a dedicated log file for each deployment process. Additionally, we provide an API to stream these logs, enabling real-time viewing while ensuring a single-write, multiple-read setup. A copy of this log file is stored in S3 bucket for any further processing or inquiry.

Security

Problem

As the Cybus QA team manages an infrastructure deployment solution for Industrial IoT data, numerous security concerns arise beyond basic authentication. The solution’s extensive integration across environments demands rigorous measures to ensure data integrity, protection, and system resilience. While this comprehensive approach has enhanced the Cybus QA team’s deployment and monitoring efficiency, it introduces critical security challenges. Addressing these requires a solution with multi-layered security measures.

Solution

To address the security challenges in Cybus’s deployment solution, we implemented AWS Cognito for robust authentication, ensuring only authorized users access the system. The central control machine, responsible for executing the on demand deployments, is hosted in a secure private subnet, isolating it from direct internet access. User access to APIs is enabled through a VPN hosted in a public subnet, providing a secure gateway to the infrastructure. All deployments occur within additional private subnets, further protecting sensitive resources by keeping them inaccessible from outside the VPN. This architecture not only secures user authentication but also restricts network access, ensuring that critical infrastructure components remain protected while still accessible for authorized QA team operations. A diagram is given below,

Team Involvement

ResourcesCount
Backend Developers2
Product Manager1

Core Features of the Software

Infrastructure Deployment

Uses Terraform for consistent and scalable provisioning of resources across different environments.

Configuration Management

Configures environments with Ansible, Docker, or Helm, ensuring they are ready for testing and deployment.

Log Monitoring

Integrates Grafana with Loki and Prometheus for real-time monitoring. This setup enables quick troubleshooting and performance tracking by aggregating logs and metrics from all deployments in one centralized dashboard.

Real-Time Deployment Monitoring

Offers comprehensive visibility into the deployment process, allowing for prompt identification and resolution of issues as they arise.

Auto-Cleaner Job

Automatically detects and removes inactive (“zombie”) deployments, optimizing resource usage and ensuring efficient infrastructure management.

Slack Notifications

Sends notifications about each deployment, providing real-time updates and information directly to the team, enhancing communication and visibility throughout the deployment lifecycle.

Development Timeline

Provide a high-level timeline of the project, including:

Design and Planning

1 months

1

Development

4 months

2

Testing and Quality Assurance

Iterative. There’s a testing phase after each sprint. The testing is done by both the Cybus QA team and us

3
500+ companies rely on our top 1% talent to scale their dev teams.
Azerion
NumberSkills
Klikit-logo
Flarie
Stickler
Dunite
Goava
ROO
Talrock

Ready to transform your digital platform into a scalable, user-centric solution?

At Vivasoft, we specialize in creating customized software solutions tailored to your business needs, just as we did with Wellteam. Whether you’re looking to revamp your website, build a new platform from scratch, or enhance your existing workflows, our team is ready to help you achieve your goals.
Potential Developer
Tech Stack
0 +
Offshore-Development-at-Vivasoft (1)
Vivasoft - Career Opportunity
Vivasoft - Career Opportunity
let's build our future together

Get to Know Us Better

Explore our expertise, projects, and vision.