• Led Spark pipelines processing 5M transactions/day for national and international traders on the exchange market application (~2 TB) with Python & Scala.
• Designed scalable data models (fact/dimension schemas, bronze–silver–gold layers) to support reliable analytics and downstream systems, and produced Balanced Scorecard reports for management.
• Optimized SQL queries with partitioning and indexing, improving query performance by 60%.
• Architected Azure Databricks/Data Lake pipelines with governance and cost optimization; collaborated with the cybersecurity team to ensure compliance and secure access.
• Built real-time Kafka streams with end-to-end latency under 5s, enabling near real-time market data delivery for traders and compliance monitoring.
• Built Spark pipelines for finance/procurement data merging SAP, SAP Ariba, and other sources, reducing duplicates by 80% using Apache Spark and PySpark.
• Implemented batch & streaming ETL with Spark Structured Streaming, Kafka, and Azure Data Factory, structuring data and building scalable models for both transactional and analytical needs.
• Optimized SQL for large datasets, cutting query runtime from hours to minutes, and enabled Balanced Scorecard reporting with KPIs and executive dashboards.
• Assisted in Airflow DAG deployments, Dockerized workflows, and CI/CD integration, ensuring production-grade ETL pipelines while working with cybersecurity on compliance and secure access.
• Directed training on systems and reports for hundreds of users in 570 companies on reports and interfaces and 7-person engineering team, reducing processing time by 50% and improving SLA adherence, through best practices in data modeling, reporting, and governance.
• Developed Spark pipelines handling ~500k–1M daily finance and logistics events, enabling timely freight cost analysis, shipment tracking, and financial reconciliations.
• Tuned SQL queries (joins, window functions, partitioned aggregations), reducing reporting latency by 40% and improving visibility for finance and supply chain managers.
• Built Airflow workflows for production ETL with monitoring and alerts, and assisted in Docker-based deployment and CI/CD, ensuring reliable data delivery across global teams.
• Designed and maintained data lake zones (raw, curated, sandbox) with strong governance, and built scalable data models (fact/dimension schemas) to support analytics and reporting.