Python Data Processing Reference (Part 1)

Coming from a Javascript/Node.js background, the Python standard library documentation seems kinda… wordy. This article is meant to be a more concise documentation for basic data science needs. I like to just write out examples with comments and hopefully you should be able to infer how it works from them. It’s more concise this way. If it’s self-explanatory I won’t go into the details. I will use the documentation for Python 3 so some things might not work in Python 2.

This article also serves as a quick reference to common data processing needs (specifically image data) in Python. I will cover two main data processing needs: data pre-processing and data augmentation. Pre-processing is needed so that the data is ready to be used for training machine learning models, and augmentation is used for generating new data from a limited dataset to help with the training process.

This article got a lot longer than I originally thought, so I’ve separated it into 5 parts. The content breakdown is as follows:

Numbers

Python is dynamically typed, but it does have some weird types that I’ve never seen in Javascript, so I just wanted to explicitly list them here as well…

Operations on numbers

Decimals: floats without floating point precision problems

Fractions

Math Constants

Math functions

Random numbers

Sequences

The four main sequence types in Python are lists (mutable), strings (immutable), tuples (immutable), and ranges (immutable).

Sequence operations

Mutable sequence operations

Sequence definition

String operations

Iterables

Iterables are objects that can be iterated through with an iterator object, which is created with the iter() function. You then go through iterations of the iterator object with the next() function. Self explanatory enough? All the aforementioned sequence types are iterables, and here are some examples:

The enumerate() function creates a new iterator from other iterables where each iteration returns a tuple of the format (index, value).

Cool trick to generate lists:

Sets

As the name suggests, this data structure behaves like mathematical sets. A set is an unordered collection with no duplicate elements. It can be used for membership testing and eliminating duplicate entries.

Dictionaries

Unlike sequences, which are indexed by numbers, dictionaries are indexed by keys. Keys can be any immutable type (strings, numbers, tuples, tuples of the previous three). A dictionary is an unordered set of key-value pairs.

Regular Expressions

Everyone should know this powerful tool that is basically magic. If you don’t know what regular expressions are… Basically they are search patterns. I won’t go into the details, but you can read about them here. I recommend practicing regular expression wizardry on regex101.com.

Exploring cyberspaces and vector spaces