I always enjoy writing Python code. It’s fun, expressive, and my go to language when starting a new project. Python has exploded the past few years becoming the language of choice in many areas. No area has been affected more than data science and analytics.
In this post, I’ll be talking about why I think Python has become so prominent in this industry. We’ll go over its most import language features and tools. My hope is for this to serve as a means of convincing your boss or coworkers that Python, and its open source tools, cover all the requirements for completing an analytics project.
Warning: I’m not a programming language zealot. I believe in choosing — and understanding — the best tool for the job. That said, I am a little biased towards Python. Perhaps after reading this, you will be too.
The language
Python has a lot of eccentricities. Firstly, its name comes from the classic British comedy group Monty Python and not the snake. Core developers of the language pay homage to Monty Python throughout the documentation. Most notably by using spam and eggs for variable names.
The first line of its wikipedia article states:
Python is an interpreted, high-level, general-purpose programming language.
Let’s break that down.
Interpreted languages execute code directly without first compiling the code into machine language instructions (as you do for languages like C/C++).
Being interpreted gives the language more flexibility, but comes at the cost of running slower than a compiled language.
Since Python is high-level, your code will be composed of elements that resemble natural language instead of confusing symbols. For example, if you had a group of elements and you needed to look at every single one and perform a computation on it, you would do:
for element in group:
computation(element)
If you properly defined element, group, and computation(), the above snippet would execute without issue.
In other words, it reads like English, but runs like code.
Being general-purpose means Python is suited for any task. It wasn’t designed for a specific application or domain, it handles them all well.
The Zen of Python defines the guiding principles for the design of Python, and the software you write with it. The following 5 principles that appear in it are a perfect example of why the language is so popular:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Readability counts.
Python is focused on readability and simplicity. The more time you spend sitting in front of code, the more you will appreciate this position.
You can see the full list by opening your Python interpreter and running:
import this
Math in Python
When it comes to performing math, Python’s greatest strength is also a big weakness. Python’s type system is referred to as Duck Typing (If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck). When you define a variable in Python, you don’t have to declare what type it is (Integer, String, etc…). The system can go and look up its type and use the appropriate method for that type
Let’s look at an example with the ‘+’ operator. When you use ‘+’ with two integers, you expect to get back a new integer which is the sum of the two in the expression. However, when you use ‘+’ with two strings, you’d expect to get back a new string that contains the previous two combined.
1 + 2 = 3
"Hello " + "world!" = "Hello world!"
The system looks up the appropriate implementation of the ‘+’ method each time. While this gives you tremendous flexibility in writing code, it adds overhead when doing computation on a large collection of data. Like what you need to do for any machine learning algorithm or data science task.
This is where Numpy comes in.
Numpy
Numpy (NUMerical PYthon) is the most important package for scientific computing. Its core offering is an ndarray, a typed collection of homogeneous items. A typed collection means the system knows ahead of time the kind of values it will be dealing with. Since it’s homogeneous, there’s no need to look up how to handle each item since they’re all the same.
Now that the system knows ahead of time the type of every item in a collection, and that we only have that type, we’ve eliminated all the overhead created by duck typing.
Numpy makes the math go fast.
Pandas
Fast math and optimized data collections is a great start, but what we really want is an abstraction for dealing with vast amounts of data. We don’t want our screens filled with endless streams of numbers, we want an object that encapsulates all that information in a way that’s easier reason about and work with. We need tools to read data in from files and databases, clean it, and shape it. This is where Pandas comes in. Sadly it has nothing to do with panda bears…
Pandas provides high level data manipulation tools built on top of numpy. The most important is the DataFrame, which looks like a sheet if you’re coming from Excel. They have named columns, row ids, and ways to efficiently apply operations to an entire column. There’s even a function to quickly get summary statistics for an entire data set:
Pandas’ DataFrame objects are common inputs for many tools, including the final one we’ll talk about.
Scikit-learn
The final piece. We have fast math, good data abstractions, and now let’s do something with it.
Scikit-learn contains everything needed for analysis and machine learning. You get all the core algorithms surrounding regression, classification, clustering, model selection, and many more.
Pandas is what you use to read your data in, clean it, and do statistical analysis. Once done, scikit-learn comes in to split your data in test and training sets, train machine learning models, and test their accuracy.
Conclusion
Python is great choice for your next analytics project. In combination with a few external packages, it becomes a powerhouse for working with data.
To summarize, here are the tools:
- Python: flexible and general purpose language that’s fun to write and easy to read.
Combined with the following packages:
- Numpy: makes math go fast!
- Pandas: high level data abstraction and manipulation. Perfect for cleaning and preparing data.
- Scikit-Learn: contains all the machine learning algorithms you need.
Great summary! Looking forward to reading more application on how to use python in Spotfire.