A new, cloud-based development environment for Apache Spark from IBM aims to offer data scientists high-performance analytics in near real time, the company announced yesterday. Called the Data Science Experience, the new environment will be available on IBM’s Bluemix cloud platform with 250 curated data sets, open source tools and a collaborative workspace.
Big Blue has invested some $ 300 million to develop Apache Spark as a sort of operating system for analytics. Spark was originally developed by the University of California, Berkeley’s AMPLab before being donated to Apache as an open source framework. IBM said it created the Data Science Experience to offer data scientists the computing speed and flexibility of the Spark platform, with faster access to more data.
Accelerating Innovation
“The Data Science Experience’s open and collaborative environment allows data scientists to accelerate and simplify data ingestion, curation, and analysis by bringing together the content, data, models, and open source resources from IBM and others including H2O, RStudio, Jupyter Notebooks on Apache Spark in a single security-rich managed environment,” the company said in a statement.
IBM said it is already working with a number of enterprises and organizations to develop and use data science applications built on the Apach Spark environment to generate business insights and improve efficiencies.
“With Apache Spark, we see an opportunity to significantly transform the role of the data scientist by providing access to curated data sets, open source tools and a collaborative platform to accelerate innovation,” said Bob Picciano, senior vice president, IBM Analytics, in the statement. “IBM’s Digital Science Experience is the killer enterprise app for Apache Spark, and gives data scientists new opportunities to deliver insight-driven models to developers, and opens the door for unprecedented innovation from the open source community.”
Joining the R Consortium
In addition, IBM said the new development environment will be a major benefit to scientists working in the R programming language, thanks to its contributions to SparkR, SparkSQL, and Apache SparkML. R is an open source programming language and software environment used by data scientists to develop statistical software. Big Blue said it will also be joining the R Consortium, a group organized to support the development and growth of the R ecosystem.
IBM has incorporated the Spark framework into a number of products and services, including Watson, Commerce, Analytics, Systems, and Cloud. In addition, the company has also contributed extensive amounts of work to the project, with IBM BigInsights for Apache Hadoop, IBM Analytics on Apache Spark, Spark with Power Systems, Watson Analytics, SPSS Modeler and IBM Stream Computing. The company also open sourced its SystemML machine learning technology to advance Spark’s machine learning capabilities.
“Just as IBM played a critical role in the development of computer science, we can see many similarities today. Computer science went mainstream with the introduction of the PC,” said Picciano. “With data science, the major roadblock is having access to large data sets and having the ability to work with so much data. With today’s announcement, clients can have both.”