Python Data Processing Reference (Part 2)

Edward Gao
5 min readJul 30, 2018

--

This is a series:

Folder handling

import os
import shutil
DIRECTORY = './data' # example directory
os.path.exists(DIRECTORY) # check if directory exists
os.mkdir(DIRECTORY) # create directory
os.rmdir(DIRECTORY) # delete directory
shutil.rmtree(DIRECTORY) # recursively delete directory
os.stat(DIRECTORY).st_size # size of directory in bytes
os.listdir(DIRECTORY) # list all files (including hidden)
[f for f in os.listdir(DIRECTORY) if not f.startswith('.')]
# list all files excluding hidden files and folders

File handling

import os
DIRECTORY = './data' # example directory
FILENAME = 'dataset.tar.gz' # example file name
FILEPATH = os.path.join(DIRECTORY, FILENAME)
os.path.isfile(FILEPATH) # check if file exists
with open(FILEPATH, mode='w') as f: # open file in write (w) mode
f.write('hello' + os.linesep + 'world')
with open(FILEPATH) as f: # open file in read (r) mode (default)
f.read()
os.remove(FILEPATH) # delete file
os.stat(FILEPATH).st_size # size of file in bytes

Note that there are different modes of open files, here are the commonly used ones:

  • 'r' : open for reading (default)
  • 'w' : open for writing, replace file if it already exists
  • 'x' : open for exclusive creation, failing if the file already exists
  • 'a' : open for writing, appending to the end of the file if it already exists
  • 'b' : open in binary mode

Note that different operating systems have different conventions for creating a new line. In Unix-like systems such as Linux and MacOS, \n (new-line) is used for creating new lines, where as Windows uses \r\n (carriage-return + new-line) due to backwards compatibility with MS-DOS, which used the convention of typewriters. (Yes, physical typewriters). To be OS-agnostic, os.linesep can be used to replace hard-coded new line characters.

with … as … ?!

You may have noticed earlier that I used with ... as ... when reading and writing files. This is called a “context manager” in Python, and it is used for allocating and releasing resources. It may seen weird at first but there are good reasons to do it this way. In a nutshell, it is used to guarantee that the file is closed after opening it.

Normally, to read or write a file, you would need to explicitly close it like this:

# create new file
f = open('test.log', 'w')
f.write('hello' + os.linesep + 'world')
f.close()
# print its content
f = open('test.log')
print(f.read())
f.close()

If you don’t explicitly call f.close(), multiple problems arise:

  1. The opened file will continue to take up space in RAM (memory leak)
  2. Your writes may not complete until the file is closed
  3. OS’s have strict limits on how many files could be open simultaneously
  4. Some OS’s like Windows treat open files as locked and private, which could impact other programs trying to open it

So to solve this we used the with ... as ... compound statement as such:

# create new file
with open('test.log', 'w') as f:
f.write('hello' + os.linesep + 'world')
# print its content
with open('test.log') as f:
print(f.read())

For more details about with ... as ..., see this thread and this article.

Downloading files

import os
from urllib.request import urlretrieve
SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'
WORK_DIRECTORY = './data/mnist-data'
FILE_NAME = 'train-labels-idx1-ubyte.gz'
localpath = os.path.join(WORK_DIRECTORY, FILE_NAME)
remotepath = SOURCE_URL + FILE_NAME
os.makedirs(WORK_DIRECTORY) # the destination directory must exist
localpath, _ = urlretrieve(remotepath, localpath) # download file
with open(localpath, 'rb') as f:
f.read() # read file in mode 'rb' b/c it's a binary file

The method above is what I see most people suggest when I look up “download files in Python.” However, according to the Python 3 documentation: “this function may become deprecated at some point in the future,” so here’s an alternative:

import os
from urllib.request import urlopen
SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'
WORK_DIRECTORY = './data/mnist-data'
FILE_NAME = 'train-labels-idx1-ubyte.gz'
remotepath = SOURCE_URL + FILE_NAME
f = urlopen(remotepath) # urlopen returns a file like object
data = f.read() # read file

You do need an additional step of saving the file locally this way though:

localpath = os.path.join(WORK_DIRECTORY, FILE_NAME)
with open(localpath, 'wb') as f:
f.write(data)

Compressed and archive files

I will include one of the most popular archive and compression format in *nix systems: .tar (archive) and .gz (compression).

Extract tar archives

import tarfile
with tarfile.open('/path/to/tarball.tar') as tar:
tar.extractall('/new/path/for/extracted/archive')

The tarfile.open() method has two parameters, the filename and the mode. If the mode is not explicitly state, it will be ‘r’ (read) by default. Note that tar.extractall() also automatically takes care of compression if it’s a .tar.gz or .tar.bz2 file. To create an uncompressed tar archive:

import tarfile
files = ['file1', 'file2', 'file3']
with tarfile.open('/path/to/tarball.tar', 'w') as tar:
for file in files:
tar.add(file)

To read or write a tar archive with different compression methods, explicitly state the open mode:

  • 'r' or 'r:*': Open for reading + auto detect compression
  • 'r': Open for reading exclusively without compression
  • 'r:gz': Open for reading with gzip compresssion
  • 'r:bz2': Open for reading with bzip2 compression
  • 'a' or 'a:': Open for appending with no compression
  • 'w' or 'w:': Open for uncompressed writing
  • 'w:gz': Open for gzip compressed writing
  • 'w:bz2': Open for bzip2 compressed writing

Also note that appending with compression is not possible.

To open a gzip file directly

import os
import gzip
DIRECTORY = './data' # example directory
FILENAME = 'dataset.tar.gz' # example file name
FILEPATH = os.path.join(DIRECTORY, FILENAME)
with gzip.open(FILEPATH) as f:
print(f.read(4)) # print the first 4 bytes
print(f.read(4)) # print the second 4 bytes

Dealing with binary data

Sometimes datasets are packed as binary data, such as the MNIST dataset I used for my examples. In situations like these, you can use the struct module in the Python standard library to inspect it. For example:

# continue example from previous section
with gzip.open(FILEPATH) as f:
f.read(8) # skip first 8 bytes of metadata
for i in range(100): # inspect first 100 labels
label = struct.unpack('B', f.read(1))
print(label[0], end=' ')

Ok this might be confusing so I’ll go more into the details…

The line label = struct.unpack('B', f.read(1)) basically reads 1 byte from the file and “unpacks” it, i.e. converting the binary data to an understandable format. In this case, 'B' is the “format” to read the binary data as, and ‘B’ stands for unsigned char. I will go more in depth about formats later.

On the next line print(label[0], end=' '), you may be wondering why I had to include [0] while printing. That’s because for some ungodly reason struct.unpack() decided to return a tuple containing the character data, and I have no idea why. The second argument end=' ' is just so that instead of a newline character, print would append a space instead.

Regarding format, binary data does not automagically indicate what kind of data it is, and different ways of interpreting it changes the meaning. There are two factors to consider when dealing with binary data: byte order and type. You then change the format character(s) when unpacking accordingly. (See link for byte order and type for detailed format character charts.)

For example, a big-endian integer would be unpacked using ‘>i’, whereas an unsigned character would be unpacked using ‘B’. If you do not know what big-endian and little-endian means, I recommend reading this article.

--

--

Edward Gao
Edward Gao

Written by Edward Gao

Exploring cyberspaces and vector spaces

No responses yet