SageMaker 101

Edward Gao
6 min readSep 5, 2018

--

Are you new to the AWS ecosystem and are overwhelmed by their millions of services? Fear not, for I’m here to help.

This was not meant to be a detailed guide of how to use SageMaker, but merely a conceptual understanding of how different components of AWS and SageMaker fit together. I also assume that you already have a basic understanding of various programming and machine learning terminologies. If you are looking for a step-by-step guide, turn back now!

The Basics

You probably already know AWS as part of “the cloud,” but there is no magic in the world, “the cloud” not some fluffy and invisible buzzword that people throw around to get VC dollars (actually it is). What it really is is this:

Source: https://www.awsforbusiness.com/wp-content/uploads/2017/06/datacenter-620x400.jpg

Warehouses of computers in someone else’s backyard.

AWS’ ginormous list of services may be intimidating, but you really don’t need to care about most of them. Think of them as apps running on someone else’s computer that has a web or command line interface for you to use.

Here’s a brief introduction to a few of the services that you need to know for the purpose of SageMaker:

  • Identity and Access Management (IAM): This is basically account management. The person who owns the AWS account is the admin or “root user”, and they can create and delete “IAM users” at will, like god.
Source: http://stripgenerator.com/strip/298654/da-lord-giveth-and-the-lord-taketh-away/view/all/
  • Elastic Compute Cloud (EC2): Used for creating virtual servers, basically computers you get full control over. Think of EC2 as AWS assembling the hardware of the computer together for you and providing you with an interface to control the computer with.
  • Virtual Private Cloud (VPC): This service is for managing your virtual private network (pretty self-explanatory). If you don’t know what you’re doing, just stick with the default settings for now. The reason why I say this is 1) the default works and 2) I once messed around and started a NAT gateway without knowing it charges nearly $30 per month until 3 months later.
  • Simple Storage Service (S3): This is basically like Google Cloud, OneDrive, Dropbox, or whatever online file system you use, except it has version control, no trash bin, you get to manage how and literally where it’s stored, and it’s highly integrated in the AWS ecosystem. S3 is literally just for storing files.
  • CloudWatch: This is where you monitor your resources, such as logs and usage metrics.

Before I get into SageMaker, I just want to do a quick tangent on containers, because it is crucial to understanding SageMaker.

You’ve probably heard of or played around with Virtual Machines before. Well, containers are basically that, or at least feels like that, except they’re not actually virtual machines. Don’t believe me? Try running a container, and then checking the process list on your host computer, you would see processes for both the host and the containers. Weird, right?

BAM! I just created a faint perception of a loud noise in your head, woken up yet? Great, let’s continue…

SageMaker

SageMaker is yet another service on the AWS ecosystem, specializing in streamlining the Machine Learning development pipeline. It utilizes EC2 to start container instances behind the scenes, runs these instances in your virtual private network which you can control via VPC, works with S3 to store your training data, and can be monitored with CloudWatch.

There are three components to SageMaker: Notebook, Training, and Inference. Each of these components starts up and manages an EC2 instance for you. Learning how to use SageMaker is a time investment, once you get through the learning curve, these three components work together to make things easier by automating the process of training and deployment.

What SageMaker is good for: Situations where you already have a set neural network architecture, know that it works, and just need to tweak the hyperparameters or have an ever changing dataset. SageMaker makes your job faster, easier, and cheaper by decoupling your code, training data, training instance, and inference endpoint.

Without decoupling, you would need a huge virtual server instance with a lots of storage for storing your training data, a GPU and lots of RAM for training, which makes it pretty expensive. With SageMaker, you can get a smaller instance for running your notebook and endpoint, spin up a larger instance for training when necessary, and store your data in S3 for cheap.

Additionally, once you have the code set up, lots of things could then be automated. If you need to update your dataset, you just need to update them on S3. You can start hyperparameter tuning jobs to test out different hyperparameters. Starting a new endpoint or training job with the new data would then just be a matter of re-running a few cells in the notebook!

What SageMaker is NOT good for: If you are still figuring out a neural network architecture to use and need to debug the code for your training job.

Starting a training job takes time, lots of time (like minutes). If you need to debug your code for training, waiting minutes each time to test the code is very time consuming. What you could do is test your code on another GPU instance first using a smaller dataset, and then deploying that code once it works. (GPU instance so you could make sure it works with CUDA cores)

SageMaker Python SDK

The SageMaker Python SDK is a library for you to start training jobs, hyperparameter tuning jobs, and inference endpoints with code. It also has a suite of built in ML algorithms for the user to use directly.

The SageMaker SDK has things called “estimators,” you could think of them basically as methods that run other Python scripts. The SDK provides different estimators for different ML frameworks. Estimators are used to start training jobs with the training script you tell it to run.

You create an instance of an estimator by passing in the Python script location (entry point), training instance type, and various other hyperparameters.

The following code examples are incomplete and only serve to give an idea of what the code would look like, it also assumes you use PyTorch for your ML framework. Detailed description of the SDK can be found here.

For example, creating an instance of an estimator would look like this:

pytorch_estimator = PyTorch('pytorch-train.py',
train_instance_type='ml.p3.2xlarge',
train_instance_count=1,
hyperparameters = {
'epochs': 20,
'batch-size': 64,
'learning-rate': 0.1,
'momentum': 0.9
})

Various methods of the estimator would then be used for various functions, including starting a training job and starting an inference endpoint:

# Start SageMaker training job
pytorch_estimator.fit('s3://my_bucket/my_training_data')
# Deploy model generated by fit()
pytorch_predictor = pytorch_estimator.deploy(
initial_instance_count=1,
instance_type='ml.p2.xlarge'
)
# make prediction using endpoint
response = pytorch_predictor.predict(data)
# delete endpoint
pytorch_estimator.delete_endpoint()

Instead of starting a training job with fit() directly and manually decide the hyperparameters, you can also use the hyperparameter tuner to choose the optimal ones automatically as such:

lr_range = ContinuousParameter(0.05, 0.06)# configure HyperparameterTuner
my_tuner = HyperparameterTuner(estimator=pytorch_estimator,
objective_metric_name='validation-accuracy',
hyperparameter_ranges={'learning-rate': lr_range},
metric_definitions=[{
'Name': 'validation-accuracy',
'Regex': 'validation-accuracy=(\d\.\d+)'
}],
max_jobs=100,
max_parallel_jobs=10)
# start hyperparameter tuning job
my_tuner.fit({
'train': 's3://my_bucket/my_training_data',
'test': 's3://my_bucket/my_test_data'
})

Endnotes

Obviously this article is woefully inadequate for actually learning how to use SageMaker, but hopefully it gives you an idea of how it’s various components work together, and you can start delving into the details yourself.

--

--