# Python Data Processing Reference (Part 1)

Coming from a Javascript/Node.js background, the Python standard library documentation seems kinda… wordy. This article is meant to be a more concise documentation for basic data science needs. I like to just write out examples with comments and hopefully you should be able to infer how it works from them. It’s more concise this way. If it’s self-explanatory I won’t go into the details. I will use the documentation for Python 3 so some things might not work in Python 2.

This article also serves as a quick reference to common data processing needs (specifically image data) in Python. I will cover two main data processing needs: data pre-processing and data augmentation. Pre-processing is needed so that the data is ready to be used for training machine learning models, and augmentation is used for generating new data from a limited dataset to help with the training process.

This article got a lot longer than I originally thought, so I’ve separated it into 5 parts. The content breakdown is as follows:

- Part 1: Python basics (you are here)
- Part 2: File handling
- Part 3: Numpy basics
- Part 4: Image data pre-processing
- Part 5: Image data augmentation

# Numbers

Python is dynamically typed, but it does have some weird types that I’ve never seen in Javascript, so I just wanted to explicitly list them here as well…

Operations on numbers

`6 / 5 # 1.2 (quotient)`

6 // 5 # 1 (floored quotient)

abs(-100) # 100

int("3") # 3

int(3.14) # 3

float("3.14") # 3.14

quotient, remainder = divmod(9, 5) # 1, 4

pow(2, 10) # 1024.0

2 ** 10 # 1024

pow(0, 0) # 1.0

0 ** 0 # 1

round(3.14) # 3

round(2.72) # 3

round(3.1415926535, 6) # 3.141593

# (round to n dec. pts.)

Decimals: floats without floating point precision problems

`from decimal import Decimal`

0.1 + 0.1 + 0.1 - 0.3 # 5.551115123125783e-17

point1 = Decimal('0.1')

point3 = Decimal('0.3')

point1 + point1 + point1 - point3 # Decimal('0.0')

Fractions

`from fractions import Fraction`

Fraction(16, -10) # Fraction(-8, 5)

Fraction('-3/7') # Fraction(-3, 7)

Fraction('-.125') # Fraction(-1, 8)

Fraction('7e-6') # Fraction(7, 1000000)

Fraction(Decimal('1.1')) # Fraction(11, 10)

Fraction(3, 5) + Fraction(1, 10) # Fraction(7, 10)

Math Constants

`math.pi # 3.14159...`

math.e # 2.71828...

math.tau # 6.28318...

math.inf # inf

float('inf') # inf

math.nan # nan

float('nan') # nan

Math functions

`math.ceil(3.14) # 4`

math.floor(2.7) # 2

math.factorial(4) # 24

math.exp(1) # 2.71828...

math.log(math.exp(1)) # 1

math.log10(100) # 2.0

Random numbers

`import random`

min = 1

max = 10

random.random() # random float in range [0, 1)

random.uniform(min, max) # random float in range [min, max]

random.randint(min, max) # random int in range [min, max]

# Sequences

The four main sequence types in Python are lists (mutable), strings (immutable), tuples (immutable), and ranges (immutable).

Sequence operations

`# sequence denoted as s, t`

x in s

x not in s

s + t # Concatenation, doesn't work for ranges

s * n # Concatenating s to itself n times

s[i] # item in i index of s

s[i:j] # slice of s from i (inclusive) to j (exclusive)

s[i:j:k] # slice of s from i to j with step k

len(s)

min(s)

max(s)

s.count(x) # return number of times x appears in s

s.index(x) # return index of x

Mutable sequence operations

`s[i] = x`

del s[i]

del s[i:j]

s.append(x)

s.clear() # equivalent to del s[:]

s.insert(i, x) # insert x in index i of s

s.pop(i)

s.remove(x) # remove first item in s that is equal to x

s.reverse() # reverse s in place

s.sort() # sort s in place (lists only)

Sequence definition

# define directly

l = [1, 2, 3] # list

t = (1, 2, 3) # tuple

r = range(0, 4) # range# define from other iterables, which will be explained later

# for now iterables is the same thing as sequences

list((5, 7, 8)) # [5, 7, 8]

tuple([5, 7, 8]) # (5, 7, 8)

list('cat') # ['c', 'a', 't']

tuple('dog') # ('d', 'o', 'g')

''.join(['c', 'a', 't']) # 'cat'

''.join(('d', 'o', 'g')) # 'dog'

String operations

`str1 = 'Dog'`

str1 = 'The Quick Brown Fox'

str1.upper() # 'DOG'

str1.lower() # 'dog'

str1.capitalize() # 'Dog'

str2.swapcase() # 'tHE qUICK bROWN fOX'

str2.replace('o', 'a') # 'The Quick Brawn Fax'

str2.split(' ') # ['The', 'Quick', 'Brown', 'Fox']

str2.join(['', ' PROTECC. ', ' ATTACC. '])

# 'The Quick Brown Fox PROTECC. The Quick Brown Fox ATTACC. '

# Iterables

** Iterables **are objects that can be iterated through with an

**object, which is created with the**

*iterator*`iter()`

function. You then go through **of the iterator object with the**

*iterations*`next()`

function. Self explanatory enough? All the aforementioned sequence types are iterables, and here are some examples:# iterables

list = [1, 1, 2, 3, 5, 8]

str = 'Puppies'# create iterator object

i1 = iter(list)

i2 = iter(str)# go through iterations

next(i1) # 1

next(i1) # 1

next(i1) # 2

next(i2) # 'P'

next(i2) # 'u'

next(i2) # 'p'# attempting to enumerate out of range

# would result in a StopIteration error

next(i1) # 3

next(i1) # 5

next(i1) # 8

next(i1) # Error: StopIteration

The `enumerate()`

function creates a new iterator from other iterables where each iteration returns a tuple of the format (index, value).

# use enumerate() in for loops

questions = ['ho', 'hat', 'hen', 'here', 'hy']

enum = enumerate(questions)

for i, v in enum:

print('W' + v, end=" ")

# Who What When Where Why# since enumerate() returns an iterator object

# you can still do this:

enum = enumerate(questions)

next(enum) # (0, 'ho')

next(enum) # (1, 'hat')

next(enum) # (2, 'hen')

next(enum) # (3, 'here')

next(enum) # (4, 'hy')

next(enum) # Error: StopIteration

Cool trick to generate lists:

`questions = ['ho', 'hat', 'hen', 'here', 'hy']`

['W' + ending for ending in questions]

# ['Who', 'What', 'When', 'Where', 'Why']

# Sets

As the name suggests, this data structure behaves like mathematical sets. A set is an unordered collection with no duplicate elements. It can be used for membership testing and eliminating duplicate entries.

# create sets and inspect sets

basket = [‘apple’, ‘orange’, ‘apple’, ‘pear’, ‘orange’, ‘banana’]

fruit = set(basket)

fruit # set(['orange', 'pear', 'apple', 'banana'])

'orange' in fruit # True

'grapes' in fruit # False# operations on sets

a = set('abracadabra')

b = set('alacazam')

a - b # set(['a', 'r', 'b', 'c', 'd'])

a | b # set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])

a & b # set(['a', 'c'])

a ^ b # set(['r', 'd', 'b', 'm', 'z', 'l'])

# Dictionaries

Unlike sequences, which are indexed by numbers, dictionaries are indexed by ** keys**. Keys can be any immutable type (strings, numbers, tuples, tuples of the previous three). A dictionary is an unordered set of

**.**

*key-value pairs*`ash = {'occupation': 'Pokemon Trainer', 'age': 10}`

ash['occupation'] = 'Pokemon Master'

ash # {'occupation': 'Pokemon Master', 'age': 10}

ash.keys() #['occupation', 'age']

ash['occupation'] # 'Pokemon Master'

del ash['occupation']

ash # {'age': 10}

ash.keys() #['age']

# Regular Expressions

Everyone should know this powerful tool that is basically magic. If you don’t know what regular expressions are… Basically they are search patterns. I won’t go into the details, but you can read about them here. I recommend practicing regular expression wizardry on regex101.com.

# examples in this section are from

# https://www.thegeekstuff.com/2014/07/python-regex-examples/# raw strings are used for regular expressions

str = 'apple \nbees' # \n actually prints as new line

rawStr = r'apple \nbees' # raw strings ignores escape characters# re.match()

# only finds matches if they occur at start of string

# returns a match object

str = 'dog cat dog'

re.match(r'dog', str) # returns SRE_Match object

re.match(r'cat', str) # returns nothing (no results)

match = re.match(r'dog', str) # match object

match.group(0) # 'dog'

match.start() # 0 (start index of matching content)

match.end() # 3 (end index of matching content)# re.search()

# is like re.match() except it searches everywhere

# returns a match object

re.search(r'cat', str).group(0) # 'cat'

re.search(r'dog', str).group(0) # 'dog'# re.findall()

# returns a list of matches instead of a match object

re.findall(r'dog', str) # ['dog', 'dog']

re.findall(r'cat', str) # ['cat']# match.group() returns different parts of result

contactInfo = 'Doe, John: 555-1212'

match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)

match.group(0) # 'Doe, John: 555-1212'

match.group(1) # 'Doe'

match.group(2) # 'John'

match.group(3) # '555-1212'# assign names to groups using (?P<name>[regex])

myRE = r'(?P<last>\w+), (?P<first>\w+): (?P<phone>\S+)'

match = re.search(myRE, contactInfo)

match.group('last') # 'Doe'

match.group('first') # 'John'

match.group('phone') # '555-1212'