Define and lead the data architecture vision and strategy, ensuring it supports analytics, ML, and business operations at scale.
Architect and manage cloud-native data platforms using Databricks and AWS, leveraging the lakehouse architecture to unify data engineering and ML workflows.
Build and optimize large-scale batch and streaming pipelines using Apache Spark, Airflow, and AWS Glue, ensuring high availability and fault tolerance.
Design and develop data marts, warehouses, and analytics-ready datasets tailored for BI, product, and data science teams.
Implement robust ETL/ELT pipelines with a focus on reusability, modularity, and automated testing.
Enforce and scale data governance practices, including data lineage, cataloging, access management, and compliance with security and privacy standards.
Partner with ML Engineers and Data Scientists to build and deploy ML pipelines, leveraging Databricks MLflow, Feature Store, and MLOps practices.
Provide architectural leadership across data modeling, data observability, pipeline monitoring, and CI/CD for data workflows.
Evaluate emerging tools and frameworks, recommending technologies that align with platform scalability and cost-efficiency.
Mentor data engineers and foster a culture of technical excellence, innovation, and ownership across data teams.
8+ years of hands-on experience in data engineering, with at least 4 years in a lead or architect-level role.
Deep expertise in Apache Spark, with proven experience developing large-scale distributed data processing pipelines.
Strong experience with Databricks platform and its internal ecosystem (e.g., Delta Lake, Unity Catalog, MLflow, Job orchestration, Workspaces, Clusters, Lakehouse architecture).
Extensive experience with workflow orchestration using Apache Airflow.
Proficiency in both SQL and NoSQL databases (e.g., Postgres, DynamoDB, MongoDB, Cassandra) with a deep understanding of schema design, query tuning, and data partitioning.
Proven background in building data warehouse/data mart architectures using AWS services like Redshift, Athena, Glue, Lambda, DMS, and S3.
Strong programming and scripting ability in Python (preferred) or other AWS-compatible languages.
Solid understanding of data modeling techniques, versioned datasets, and performance tuning strategies.
Hands-on experience implementing data governance, lineage tracking, data cataloging, and compliance frameworks (GDPR, HIPAA, etc.).
Experience with real-time data streaming using tools like Kafka, Kinesis, or Flink.
Working knowledge of MLOps tooling and workflows, including automated model deployment, monitoring, and ML pipeline orchestration.
Familiarity with MLflow, Feature Store, and Databricks-native ML tooling is a plus.
Strong grasp of CI/CD for data and ML pipelines, automated testing, and infrastructure-as-code (Terraform, CDK, etc.).
Excellent communication, leadership, and mentoring skills with a collaborative mindset and the ability to influence across functions.