Databricks announced that the company will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release. In addition, the company announced MLflow 2.0, which includes MLflow Pipelines, a new feature to accelerate and simplify ML model deployments. Finally, the company introduced Spark Connect, to enable the use of Spark on virtually any device, and Project Lightspeed, a next generation Spark Structured Streaming engine for data streaming on the lakehouse.
GigaOm Radar for Evaluating Data Warehouse Platforms
This new GigaOm Radar Report provided by our friends over at Vertica, examines the leading platforms in the data warehouse marketplace, describes the fundamentals of the technology, identifies key criteria and evaluation metrics by which organizations can evaluate competing platforms, describes some potential technology developments to look out for in the future, and classifies platforms across those criteria and metrics.
StreamSets Launches StreamSets Transformer
StreamSets, Inc., provider of the DataOps platform for modern data integration, released StreamSets® Transformer, a simple-to-use, drag-and-drop UI tool to create native Apache Spark applications. Designed for a wide range of users — even those without specialized skills — StreamSets Transformer enables the creation of pipelines for performing ETL, stream processing and machine-learning operations. Now, data engineers, scientists, architects and operators gain deep visibility into the execution of Apache Spark while broadening usage across the business.
Podcast: HPC & AI Convergence Enables AI Workload Innovation
In this Conversations in the Cloud podcast, Esther Baldwin from Intel describes how the convergence of HPC and AI is driving innovation. “On the topic of HPC & AI converged clusters, there’s a perception that if you want to do AI, you must stand up a separate cluster, which Esther notes is not true. Existing HPC customers can do AI on their existing infrastructure with solutions like HPC & AI converged clusters.”
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Piotr Balcer and Cheng Xu from Intel gave this talk at the 2019 Spark+AI Summit. “Intel Optane DC persistent memory breaks the traditional memory/storage hierarchy and scales up the computing server with higher capacity persistent memory. Also it brings higher bandwidth & lower latency than storage like SSD or HDD. And Apache Spark is widely used in the analytics like SQL and Machine Learning on the cloud environment.”
NEC Embraces Open Source Frameworks for SX-Aurora Vector Computing
In this video from ISC 2019, Dr. Erich Focht from NEC Deutschland GmbH describes how the company is embracing open source frameworks for the SX-Aurora TSUBASA Vector Supercomputer. “Until now, with the existing server processing capabilities, developing complex models on graphical information for AI has consumed significant time and host processor cycles. NEC Laboratories has developed the open-source Frovedis framework over the last 10 years, initially for parallel processing in Supercomputers. Now, its efficiencies have been brought to the scalable SX-Aurora vector processor.”
Deep Learning Open Source Framework Optimized on Apache Spark*
Intel recently released BigDL. It’s an open source, highly optimized, distributed, deep learning framework for Apache Spark*. It makes Hadoop/Spark into a unified platform for data storage, data processing and mining, feature engineering, traditional machine learning, and deep learning workloads, resulting in better economy of scale, higher resource utilization, ease of use/development, and better TCO.
State of the Art Natural Language Processing at Scale
The two part presentation below from the Spark+AI Summit 2018 is a deep dive into key design choices made in the NLP library for Apache Spark. The library natively extends the Spark ML pipeline API’s which enables zero-copy, distributed, combined NLP, ML & DL pipelines, leveraging all of Spark’s built-in optimizations.
Databricks Partners with RStudio To Increase Productivity of Data Science Teams
Databricks, a leader in unified analytics and founded by the original creators of Apache Spark™, announced a partnership with RStudio, providers of a free and open-source integrated development environment for R, to increase the productivity of data science teams. The partnership will allow the two companies to seamlessly integrate Databricks’ Unified Analytics Platform with the RStudio Server, simplifying R programming on big data.








