Databricks, the corporate based by the unique builders of Apache Spark, has launched Delta Lake, an open supply storage layer for Spark that gives ACID transactions and different data-management features for machine studying and different large information work.
Many varieties of knowledge work want options like ACID transactions or schema enforcement for consistency, metadata administration for safety, and the flexibility to work with discrete variations of knowledge. Features like these don’t come customary with each information supply on the market, so Delta Lake offers these options for any Spark DataBody information supply.
Delta Lake can be utilized as a drop-in alternative to entry storage techniques like HDFS. Data ingested into Spark via Delta Lake is saved in Parquet format in a cloud storage service of your selection. Devlopers can use their selection of Java, Python, or Scala to entry Delta Lake’s API set.
Delta Lake helps a lot of the present Spark SQL DataBody features for studying and writing information. It additionally helps Spark Structured Streaming as a supply or vacation spot, though not the DStream API. Every learn and write via Delta Lake has an ACID transaction assure, in order that a number of writers may have their writes serialized and a number of readers will see constant snapshots.
Reading a selected model of a knowledge set—what the Delta Lake documentation calls “time travel”—works by merely studying a DataBody with an related time stamp or model ID. Delta Lake additionally ensures the schema of the DataBody being written matches the desk it’s being written to; if there’s a mismatch, it throws an exception moderately than change the schema. (Spark’s file APIs will exchange the desk in such a case.)
Future releases of Delta Lake could assist extra of Spark’s public API set, though DataFrameReader/Writer are the primary focus for now.