Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

engineering.salesforce.comby Scott NybergApril 1, 20261 min read0 views

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today we spotlight Shivangi Srivastava, Senior Director of Software Engineering at Salesforce, as she details the creation of Cloud Data Integration, the decentralized architecture driving Informatica’s platform that supports workflows for over 5,500 corporate clients managing about 250,000 daily tasks. […] The post Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily appeared first on Salesforce Engineering Blog .

Explore how the team transformed the Informatica data integration framework from a solitary node setup into a scalable Spark environment on Kubernetes while ensuring legacy support for thousands of active streams and utilizing FinOps logic to stabilize operational expenses against processing speed for massive data sets.

What is your team’s mission as it relates to building Cloud Data Integration (CDI)?

We make enterprise data accessible and reliable across hybrid and multi-cloud environments. Cloud Data Integration serves as the engine for this mission by connecting systems, transforming datasets, and moving information to its destination.

Enterprises manage hundreds of sources, including SaaS platforms and legacy systems. CDI provides the necessary connectors for these environments. This allows teams to build pipelines that cleanse and reshape data as it moves through the organization.

Productivity remains a central focus for us. We prioritize graphical pipelines over handwritten code within the CDI model. Engineers define mappings while the runtime engine handles orchestration and scaling. This approach lets teams design integrations instead of managing infrastructure.

High-level overview.

What constraints drove CDI’s evolution from a single-node engine to a distributed Spark platform on Kubernetes?

The original Informatica integration engine served a different era of data processing. It functioned as a single-node system, which worked for datasets measured in gigabytes.

Modern enterprises now operate at a massive scale. SaaS platforms and digital applications generate volumes that reach terabytes and petabytes. This shift required a move toward a distributed architecture.

Backward compatibility remained the primary constraint during this transition. Since thousands of production pipelines already existed on the platform, asking users to rebuild them was not an option. We solved this by preserving the logical abstraction layer used to design pipelines. Engineers still create graphical mappings, but the runtime now converts those mappings into distributed Spark execution plans.

Open-source Spark alone lacked necessary enterprise capabilities like lineage tracking and deep connector support. To fix this, we extended the engine into Spark++. This version combines the distributed processing model of Spark with our transformation framework and governance features. This extended runtime allows CDI to run complex integration pipelines at scale while keeping the logical abstractions engineers already use.

What reliability constraints shaped the architecture of CDI as the foundational engine underneath much of the Informatica stack?

CDI operates as the backbone of many data workflows. Reliability remains a core design requirement because integration pipelines power analytics systems and operational processes.

To maintain stability, CDI focuses on three reliability principles:

Data integrity: The platform tracks execution at the row level. This allows the system to isolate problematic records without corrupting pipelines.
Tenant isolation: Compute clusters operate within strict VPC boundaries. Ephemeral nodes disappear after execution.
Infrastructure resilience: High-availability services run across multiple availability zones. This prevents localized outages.

These systems allow CDI to function as a stable data integration backbone. The platform maintains a 99.9% control-plane availability target.

High-level architecture.

What scalability constraints emerged as enterprise data volumes expanded from gigabytes to terabytes and petabytes?

Scaling CDI presents challenges as enterprise data volumes expand. Infrastructure planning grows complex because cloud providers offer many compute and storage options.

Enterprise deployments face three specific scaling challenges:

Infrastructure complexity: Selecting the right instance types and cluster configurations remains difficult.
Dynamic workloads: Pipelines experience unpredictable spikes and idle periods.
Cost-performance tradeoffs: Improving throughput often inflates infrastructure costs.

CDI uses FinOps automation to solve these issues. Users define cost and performance goals instead of configuring clusters manually. The platform analyzes workloads and selects the best infrastructure configuration. It scales compute resources dynamically across Kubernetes-managed clusters.

This approach processes large datasets efficiently. Engineers no longer manage infrastructure directly.

What engineering challenges arise when balancing infrastructure cost with performance as data workloads scale?

Scaling distributed systems by simply adding hardware often leads to rapidly increasing infrastructure costs. CDI automates infrastructure optimization to balance performance and cost. This architecture removes the manual burden of cluster management.

Three core systems power the FinOps architecture:

Cluster Lifecycle Manager: It predicts job demand to start or shut down clusters automatically.
Cluster Tuner: This system selects appropriate cluster configuration like instance types, storage, network etc.
Job Tuner: It adjusts Spark runtime parameters like CPU and memory based on historical data.

These systems optimize infrastructure continuously. Production deployments show this architecture reduces infrastructure costs by approximately 1.65 times while maintaining performance.

Sample performance results.

What scale challenges emerge when operating CDI across thousands of enterprises and hundreds of thousands of jobs?

Scaling CDI at an enterprise level requires more than just initial construction. The platform must manage growth across customer environments, job concurrency, and data throughput.

CDI currently supports approximately 5,500 enterprise customers who execute roughly 250,000 integration jobs daily. Two architectural choices maintain stability at this scale:

Control plane separation: The platform separates orchestration and scheduling from the data processing layer.
Distributed execution: The data plane executes Spark workloads independently of the control services.

This design ensures orchestration services remain stable during compute cluster spikes. Advanced scheduling also prevents large workloads from monopolizing shared infrastructure. These capabilities allow CDI to maintain performance during enterprise growth.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.
Want a deeper dive into this topic? Check out these white papers:

INFA-FinOps for Cloud Data Integration

CDI-E: An Elastic Cloud Service for Data Engineering

Original source

engineering.salesforce.com

https://engineering.salesforce.com/inside-informaticas-spark-based-data-integration-platform-running-250k-enterprise-pipelines-daily/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

platformintegration

ProductsLive

Desktop Canary v2.1.48-canary.31

🐤 Canary Build — v2.1.48-canary.31 Automated canary build from canary branch. Commit Information Based on changes since v2.1.48-canary.30 Commit count: 1 bd345d35a8 🐛 fix(openapi): fix response.completed output missing message, wrong tool name id ( #13555 ) (Arvin Xu) ⚠️ Important Notes This is an automated canary build and is NOT intended for production use. Canary builds are triggered by build / fix / style commits on the canary branch. May contain unstable or incomplete changes . Use at your own risk. It is strongly recommended to back up your data before using a canary build. 📦 Installation Download the appropriate installer for your platform from the assets below. Platform File macOS (Apple Silicon) .dmg (arm64) macOS (Intel) .dmg (x64) Windows .exe Linux .AppImage / .deb

LobeChat Releases

1mabout 1 hour ago

ReleasesRecent

HotJobs launched as Sri Lanka’s first AI-driven recruitment platform - Daily FT

HotJobs launched as Sri Lanka’s first AI-driven recruitment platform Daily FT

Google News - AI Sri Lanka

1m1 day ago

ModelsFresh

Looking for Help on Building a Cheap/Budget Dedicated AI System

I’ve been getting into the whole AI field over the course of the year and I’ve strictly said to NEVER use cloud based AI (Or under VERY strict and specific circumstances). For example, i was using Opencode’s cloud servers, but only because it was through their own community maintained infrastructure/servers and also it was about as secure as it gets when it comes to cloud AI. But anything else is a hard NO. I’ve been using my main machine (Specs on user) and so far it’s been pretty good. Depending on the model, I can run 30-40B models at about 25-35 tok/s, which for me is completely usable, anything under or close to 10 tok/s is pretty unusable for me. But anyways, that has been great for me, but I’m slowly running into VRAM and GPU limitations, so I think it’s time to get some dedicated h

Reddit r/LocalLLaMA

5mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 129 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

What is your team’s mission as it relates to building Cloud Data Integration (CDI)?

What constraints drove CDI’s evolution from a single-node engine to a distributed Spark platform on Kubernetes?

What reliability constraints shaped the architecture of CDI as the foundational engine underneath much of the Informatica stack?

What scalability constraints emerged as enterprise data volumes expanded from gigabytes to terabytes and petabytes?

What engineering challenges arise when balancing infrastructure cost with performance as data workloads scale?

What scale challenges emerge when operating CDI across thousands of enterprises and hundreds of thousands of jobs?

Learn more

Daily AI Digest

More about

Desktop Canary v2.1.48-canary.31

HotJobs launched as Sri Lanka’s first AI-driven recruitment platform - Daily FT

Looking for Help on Building a Cheap/Budget Dedicated AI System

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

As a Tool of Productivity, AI Can Make the Effort to Learn More Meaningful - EdSurge

Zimbabwe Approves National Artificial Intelligence Strategy - Bantu Gazette

Desktop Canary v2.1.48-canary.31

Lenovo Brings Production-Scale AI to Global Sports: Enhancing Fan Experience, Driving Revenue Growth, Boosting Performance, and Improving Operational Efficiency with NVIDIA - Yahoo Finance