Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Invisible Broken Clock in AI Video Generation - HackerNoonGNews AI videoDesktop Canary v2.1.48-canary.31LobeChat ReleasesQwen 3.5 397B vs Qwen 3.6-PlusReddit r/LocalLLaMAThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessThe Invisible Broken Clock in AI Video Generation - HackerNoonGNews AI videoDesktop Canary v2.1.48-canary.31LobeChat ReleasesQwen 3.5 397B vs Qwen 3.6-PlusReddit r/LocalLLaMAThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily

engineering.salesforce.comby Scott NybergApril 1, 20261 min read0 views
Source Quiz

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today we spotlight Shivangi Srivastava, Senior Director of Software Engineering at Salesforce, as she details the creation of Cloud Data Integration, the decentralized architecture driving Informatica’s platform that supports workflows for over 5,500 corporate clients managing about 250,000 daily tasks. […] The post Inside Informatica’s Spark-Based Data Integration Platform: Running 250K Enterprise Pipelines Daily appeared first on Salesforce Engineering Blog .

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today we spotlight Shivangi Srivastava, Senior Director of Software Engineering at Salesforce, as she details the creation of Cloud Data Integration, the decentralized architecture driving Informatica’s platform that supports workflows for over 5,500 corporate clients managing about 250,000 daily tasks.

Explore how the team transformed the Informatica data integration framework from a solitary node setup into a scalable Spark environment on Kubernetes while ensuring legacy support for thousands of active streams and utilizing FinOps logic to stabilize operational expenses against processing speed for massive data sets.

What is your team’s mission as it relates to building Cloud Data Integration (CDI)?

We make enterprise data accessible and reliable across hybrid and multi-cloud environments. Cloud Data Integration serves as the engine for this mission by connecting systems, transforming datasets, and moving information to its destination.

Enterprises manage hundreds of sources, including SaaS platforms and legacy systems. CDI provides the necessary connectors for these environments. This allows teams to build pipelines that cleanse and reshape data as it moves through the organization.

Productivity remains a central focus for us. We prioritize graphical pipelines over handwritten code within the CDI model. Engineers define mappings while the runtime engine handles orchestration and scaling. This approach lets teams design integrations instead of managing infrastructure.

High-level overview.

What constraints drove CDI’s evolution from a single-node engine to a distributed Spark platform on Kubernetes?

The original Informatica integration engine served a different era of data processing. It functioned as a single-node system, which worked for datasets measured in gigabytes.

Modern enterprises now operate at a massive scale. SaaS platforms and digital applications generate volumes that reach terabytes and petabytes. This shift required a move toward a distributed architecture.

Backward compatibility remained the primary constraint during this transition. Since thousands of production pipelines already existed on the platform, asking users to rebuild them was not an option. We solved this by preserving the logical abstraction layer used to design pipelines. Engineers still create graphical mappings, but the runtime now converts those mappings into distributed Spark execution plans.

Open-source Spark alone lacked necessary enterprise capabilities like lineage tracking and deep connector support. To fix this, we extended the engine into Spark++. This version combines the distributed processing model of Spark with our transformation framework and governance features. This extended runtime allows CDI to run complex integration pipelines at scale while keeping the logical abstractions engineers already use.

What reliability constraints shaped the architecture of CDI as the foundational engine underneath much of the Informatica stack?

CDI operates as the backbone of many data workflows. Reliability remains a core design requirement because integration pipelines power analytics systems and operational processes.

To maintain stability, CDI focuses on three reliability principles:

  • Data integrity: The platform tracks execution at the row level. This allows the system to isolate problematic records without corrupting pipelines.

  • Tenant isolation: Compute clusters operate within strict VPC boundaries. Ephemeral nodes disappear after execution.

  • Infrastructure resilience: High-availability services run across multiple availability zones. This prevents localized outages.

These systems allow CDI to function as a stable data integration backbone. The platform maintains a 99.9% control-plane availability target.

High-level architecture.

What scalability constraints emerged as enterprise data volumes expanded from gigabytes to terabytes and petabytes?

Scaling CDI presents challenges as enterprise data volumes expand. Infrastructure planning grows complex because cloud providers offer many compute and storage options.

Enterprise deployments face three specific scaling challenges:

  • Infrastructure complexity: Selecting the right instance types and cluster configurations remains difficult.

  • Dynamic workloads: Pipelines experience unpredictable spikes and idle periods.

  • Cost-performance tradeoffs: Improving throughput often inflates infrastructure costs.

CDI uses FinOps automation to solve these issues. Users define cost and performance goals instead of configuring clusters manually. The platform analyzes workloads and selects the best infrastructure configuration. It scales compute resources dynamically across Kubernetes-managed clusters.

This approach processes large datasets efficiently. Engineers no longer manage infrastructure directly.

What engineering challenges arise when balancing infrastructure cost with performance as data workloads scale?

Scaling distributed systems by simply adding hardware often leads to rapidly increasing infrastructure costs. CDI automates infrastructure optimization to balance performance and cost. This architecture removes the manual burden of cluster management.

Three core systems power the FinOps architecture:

  • Cluster Lifecycle Manager: It predicts job demand to start or shut down clusters automatically.

  • Cluster Tuner: This system selects appropriate cluster configuration like instance types, storage, network etc.

  • Job Tuner: It adjusts Spark runtime parameters like CPU and memory based on historical data.

These systems optimize infrastructure continuously. Production deployments show this architecture reduces infrastructure costs by approximately 1.65 times while maintaining performance.

Sample performance results.

What scale challenges emerge when operating CDI across thousands of enterprises and hundreds of thousands of jobs?

Scaling CDI at an enterprise level requires more than just initial construction. The platform must manage growth across customer environments, job concurrency, and data throughput.

CDI currently supports approximately 5,500 enterprise customers who execute roughly 250,000 integration jobs daily. Two architectural choices maintain stability at this scale:

  • Control plane separation: The platform separates orchestration and scheduling from the data processing layer.

  • Distributed execution: The data plane executes Spark workloads independently of the control services.

This design ensures orchestration services remain stable during compute cluster spikes. Advanced scheduling also prevents large workloads from monopolizing shared infrastructure. These capabilities allow CDI to maintain performance during enterprise growth.

Learn more
  • Stay connected — join our Talent Community!

  • Check out our Technology and Product teams to learn how you can get involved.

  • Want a deeper dive into this topic? Check out these white papers:

INFA-FinOps for Cloud Data Integration

CDI-E: An Elastic Cloud Service for Data Engineering

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

platformintegration

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Inside Info…platformintegrationengineering…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 129 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products