Open Source: A primer for Big Data

It is nearly impossible to talk about Big Data without making frequent reference to a broad ecosystem of computer code that has been made available for use and modification by the general public at no charge.

History of open source

In the early days of computing, computer code could be considered an idea or method, preventing it from being copy-written.  In 1980, copy write law was extended in the USA to include computer programs (ref).

In 1983, three years later, Richard Stallman of MIT made a strong move in the opposite direction by establishing a movement aimed at promoting free and open collaboration of software development.   In particular, he created a project (1983), a manifesto (1985) and a legal framework (1989) for producing software which anyone was free to run, copy, distribute or modify, subject to some basic conditions (such as not attempting to resell the software). For reasons beyond comprehension, but certainly to construct in the acronym ‘Not Unix’, he called this the GNU project (ref) and the legal framework was the first of several versions of the General Public License (GPL).

One of the foundational pieces of software released in the GNU project was the now ubiquitous operating system called Linux, released in 1992.  I would be exaggerating only slightly if I said that Linux is now or has at one time been used by just about every software developer on the planet.

The other ubiquitous and functionally foundational piece of software released as open source in the 1990s was the Apache HTTP server, which played a key role in the growth of the web.  This software traces its origins back to a 1993 project involving just eight developers.  In the fifteen years after its initial release in 1995, the Apache HTTP server has provided the basic server functionality for over 100 million websites (ref).

While companies such as Microsoft, Oracle, and the like built business models around not making their software freely available and certainly not releasing the source code, quite a large number of software developers from around the world felt a personal drive to continue using and contributing to the open source community.  Thus both proprietary and open source streams of software development continued to grow in parallel in the 80s and 90s.

In 1998, something very interesting happened.  Netscape Communications Corporation, which had developed a browser competing with Microsoft’s Internet Explorer, announced that it was releasing to the public the source code for its browser  (ref).  The world’s pool of open source code was being fed from both corporate as well as private contributions of code.

In 1999, the originators of the already widely-used Apache HTTP server founded the Apache Software Foundation, a decentralized open source community of developers.  The Apache Software Foundation is now the primary venue for releasing open source Big Data software.  Whereas it may be possible to have a conversation about Big Data, without using the term ‘Big Data’, it is nonetheless impossible to have such as  conversation without using the term to ‘Apache’. (Hadoop, for example, was released to Apache in 2006, and a large number of software tools that run over HDFS have also been produced and licensed under the terms of the Apache foundation).


There are numerous licenses in use today for open source software, varying in details of how the code my be distributed, modified, sub-licensed, linked to other code, etc. The original GPL from GNU is now up to version 3.  The Apache Foundation has its own version, as do organizations such as MIT.


Open source software is typically made available as source code or compiled code.  In addition, changes to the code are managed via a revision control system such as Git, hosted on sites such as GitHub, Bitbucket, etc.  These systems provide a transparent way for anyone to view each development or change in the code, including who made that change and when.  Software developers can use their public contributions to such projects as bragging rights on their C.V.’s.

Advantages of open source

Many programmers contribute to open source projects largely out of principle and as a reaction against proprietary software.   Many might wonder why a corporation would donate their code as open source, spending significant resources developing software and subsequently giving it away.  Software companies themselves have wondered this.  In 2001, Jim Allchin, then an executive at Microsoft, said the following:

“I can’t imagine something that could be worse than this for the software business and the intellectual-property business.”(ref)

Microsoft has since itself contributed to open source repositories on several occasions.  I’m not sure how happy Jim was about this.

In fact, there are several reasons that a company might want to contribute to open source:

  1. To encourage the spread of a software for which the company can sell support services or enhanced, non-public, versions
  2. To encourage the spread of a software that will promote the company’s other revenue streams. For example, when a company open-sources software that runs on hardware produced by the company or integrates with another paid software offering.
  3. To harness the collective development power of the open source community in debugging and improving some software.
  4. To satisfy certain licensing requirements when the company has incorporated code subject to some open source licensing in a product that it is reselling
  5. To present the company as technologically advanced for purposes of publicity and recruitment

Open source for Big Data

The concepts behind Hadoop were originally developed at Google and published in a research paper in 2003.  Hadoop itself began in 2006 as an Apache project.  Since then, dozens of software tools have been created under the Apache Foundation for use on or in connection with Apache Hadoop.

Many Big Data tools outside of the Hadoop club are either part of the Apache Foundation or licensed under the Apache license (such as MongoDB).  Spark, a Big Data framework developed at Berkely labs and which brings significant improvements in speed by using RAM for computations, also found its way into the Apache fold.  Neo4j, a graph database, currently has a community version that is open-sourced under version 3 of GPL.  Thus, not all tools are under Apache, but a very large number of them are.

Of course there are still quite a number of Big Data tools, both hardware and software, that are only available for purchase and for which no public source code is available.  I discuss use of one such tool in another article entitled Useful Big Data: The Fastest Way. Some organizations, depending on their particular circumstances, prefer to purchase these vendor solutions rather than to utilize open source software.