Python Beginner's Guide to Processing Data

Quick Links

Python is a popular general-purpose language, but it’s increasingly favored for statistics, data analysis, and data science. If you have a basic knowledge of statistics, how can you apply that to Python? Here’s how to get started sifting through data with Python faster than you ever could by hand.

Why Python for Data?

While Python is popular for data analysis, you might wonder why you might want to use Python instead of a spreadsheet like Excel, LibreOffice Calc, or Google Sheets.

The main reason to use Python is that you get a lot more options than what’s included in most spreadsheets. Spreadsheets are primarily designed for business and financial calculations. you may perform more advanced calculations with Python, since you can tap into Python’s large number of libraries.

LibreOffice Calc displaying laptop data in a spreadsheet.

The other issue is scalability. Spreadsheets work better with smaller datasets, though heavy users can create some multi-hundred-line monstrosities. To perform operations, you have to click and drag down columns. While this works for a few rows, if you have multiple screens of data, your fingers can get tired quickly.

Python’s data operations, with libraries likeNumPy,pandas,Seaborn, andPingouin, are much more efficient when working with large amounts of data. You can specify complex operations like selecting data from several columns and performing calculations on them in one or a few lines. Even better, you can write scripts to automate these operations so you only have to type them once in the first place.

Activate the Mamba stats environment and starting up IPython in the Linux terminal.

Spreadsheets still have their place. They’re great for quicker operations, as well as formatting data for use in Python. I’ll even show you how to import data from spreadsheets. For formatting data, if you need data to be formatted to constraints like character length or a type of number like an integer,a small database management system like SQLiteis even better.

For conciseness, this article will cover the usage of Python libraries for basic statistical calculations and won’t explain the theory much. If you would like to learn about statistical theory, there are lots of online and offline resources, including courses, textbooks, and videos. You might tryOpenStax’s online textbookorKhan Academyfor free learning options.

Getting the “head” of a pandas dataframe in Python

Setting Up Your Environment

To set up your environment for data analysis, you’ll have to install some libraries, mentioned earlier. I’ll assume you’re using some form of Unix-like system, such as Linux, macOS, or Windows with theWindows Subsystem for Linuxinstalled.

The first thing you’ll need to install is Mamba, which is a package manager for these libraries. Most Linux systems come with a package manager, so why would you need a package manager on top of your package manager? System package managers do have Python and the libraries I mention, but they’re mostly meant to take care of the OS itself, not for your programming projects. Developers tend to want newer versions than what’s supplied with most mainstream distros, which is whyrolling-release distros are popular among this group. Mamba offers a third option, letting you run a stable base system while offering access to newer development packages. you’re able to follow the instructions on the website. It boils down to pasting a script into your terminal.

Generating a random number with NumPy.

With Mamba installed, you’ll need to install the libraries. The libraries we’ll use in this article are NumPy, pandas, SciPy, and Pingouin. We’ll also installIPython, because it’s so handy for interactive use over the standard Python interpreter.

We’ll create an environment called “stats” with Mamba with these packages.

Examining tips data using “tips.head()” in Python.

Then we’ll need to activate it.

With our environment created, we can then install packages.

Now that the environment is set up, we can begin with the calculations.

Getting Data

To perform statistical calculations, you’ll need some data. This can be data you already have, such as a spreadsheet. It could be data you downloaded from a site likeKaggle. Seaborn and Pingouin can access public datasets for you to play around with and learn from.

To start, make your the “stats” environment is active, and then run IPython:

Displaying descriptive statistics using pandas in Python.

We’ll start by importing pandas:

Pandas has methods to read from popular data file formats, including Excel .xls and the comma-separated values format or .csv, which is widespread in data analysis.

We’ll use pandas' read_csv method to read in a file. I’ll demonstrate with my laptop price data that I used to build a complex model recently.

Python pandas descriptive statistics for the total_bill column.

This will create a data structure called a “DataFrame,” which is similar to a spreadsheet or relational database. Think of it as a table containing the data. you’re able to see the first few rows by calling the head method on the DataFrame:

you’re able to also create data yourself by generating it randomly. This is good for creating test data. You can use NumPy’s random number generator for this.

Linear regression of tips vs total bill made with Seaborn,

First, import NumPy:

Then we’ll create a random number generator:

We can create an array of 50 random numbers taken from the normal distribution:

Descriptive Statistics: Mean, Median, Standard Deviation, Percentiles

It’s easy to calculate basic descriptive statistics using Python and pandas.

We’ll use the Seaborn tips dataset for our DataFrame:

To see the columns, use the head() method mentioned above.

We can use the describe method to get descriptive statistics of all the numerical columns in the DataFrame.

pandas will print the data for the “total_bill,” “tip,” and “size” columns. This will be the number of data points, the mean or average, the standard deviation, the minimum value, the lower quartile or 25th percentile, the median or 50th percentile, and the upper quartile or 75th percentile.

Generating a and b groups of random numbers in Python with NumPy.

You can also view the descriptive stats for an individual column:

You can also view stats for a column. For example, to see the median tip:

Regression: What’s the Trend?

Descriptive statistics, well, give descriptions of data. The power of data analysis comes from finding relationships in data. Linear regression is one of the simplest ways to do this.

We can visualize linear regression as fitting a line over datapoints.

Let’s go back to our tips dataset. We’ll useSeaborn to plot the tips vs. total bill. The bill amount, the independent variable, will be on the x-axis, and the tip, the dependent variable, will be on the y-axis. We can plot the regression line over the scatterplot to see how good a fit it is.

We can also obtain a more formal analysis with Pingouin:

The far left-hand column will contain the y-intercept and the coefficient for the x-value, in this case, the total bill. This will let you reconstruct the line as a standard slope-intercept equation, but the number to pay attention to is the square of the correlation coefficient, or r². In this case, it’s approximately .46, which is a pretty good fit in the positive direction, confirming what we saw in the plot.

Statistical Tests: Do the Differences Really Matter?

One thing that comes up a lot in experiments that have a control group and an experimental group, such as a clinical test of a new drug, is determining if the differences between the two are due to chance or not. Statistical tests between groups can help us determine if a difference is statistically significant or not.

One of the most popular in contemporary research is Student’s t-test, because it’s good at dealing with the small samples that are necessary in experiments.

We’ll use NumPy’s random number generator to create a couple of simulated groups of ten elements each:

Pingouin has a t-test function built in to test the null hypothesis that there’s no significant difference between the two groups:

The number to determine the significance in the output is the p-value. We’ll use a p-value of .05 to determine significance. The result is approximately .61. Since that’s higher than .05, we can’t reject the null hypothesis, so we conclude that the results aren’t statistically significant.

These examples are just scratching the surface when it comes to data analysis in Python. Now that you’ve seen how easy and powerful Python data operations can be with the help of these libraries, you can see why Python is a language of choice for data analysis and data science.

Quick Links#

Why Python for Data?#

Setting Up Your Environment#

Then we’ll need to activate it.#

With our environment created, we can then install packages.#

Getting Data#

We’ll start by importing pandas:#

First, import NumPy:#

Then we’ll create a random number generator:#

Descriptive Statistics: Mean, Median, Standard Deviation, Percentiles#

We’ll use the Seaborn tips dataset for our DataFrame:#

To see the columns, use the head() method mentioned above.#

Regression: What’s the Trend?#

We can also obtain a more formal analysis with Pingouin:#

Statistical Tests: Do the Differences Really Matter?#