Using Notebook on Azure Machine Learning Workspace

In this tutorial, we are going to leverage the Notebook in Azure Machine Learning Workspace to create and run our model.

Create a Notebook

In the Azure Machine Learning Workspace, click on Notebook tab. Create a folder for your project, and a notebook file inside this folder.

This is my project structure:

  • The outputs subdirectory will include all the output file of our program
  • The digit-clustering.ipynb is the notebook file, where you can create multiple code cells, and note cells simultaneously.

Create compute instance

To run our code, we need to create a compute instance, which is a cloud computing resource. Open your notebook file, click on the “Add” button.

Then, just follow some simple steps in the pop up window to create your own compute instance. As you can see, I have created one named “nathan-instance”. Now, you can type some Python code on your notebook, and run with this compute resource.

Note: remember to stop your compute instance after finishing your running.

Problem

In this tutorial, we are going to build a clustering model that allows you to detect number in the picture. The algorithm which is used is K-Means. Let’s take a quick look at our dataset.

The input is an image with size of 8x8, and the desired output is a label from 0 to 9. We will flatten out the image to get a typical vector for it. Each element of this vector is considered as a feature of the example, so we have 8x8=64 features.

Define Experiment and Run object

For further inspections on the result, we need to create an experiment and run object, run our code within this experiment, and keep track of some metrics, save the output files of multiple runs for comparison.

The start_logging() method is used for creating a new run within the current experiment.

Get data and visualization functions

We use the MNIST dataset from sk-learn’s datasets for this tutorial.

We use get_data() function to gain the train data for our model.

To shed light on the problem, we define some visualization function for our dataset’s distribution, and the clustering result as well.

Let’s see the distribution of our dataset:

This function is used for visualizing some wrong predictions, comparing the “real” label and the “predicted” label of the example. We will use them later.

Define clustering model

As I previously mentioned, we use K-Means algorithm for this problem, with the hope that all images of same number will be gathered in the same cluster.

Because we need to distinguish numbers from 0 to 9, our K-Means model will have k=10 cluster.

Main script for our program

Get data, and visualize it, save the image files within the run.

The upload_file() method will save the file from our current project directory to the run, so that you can access them later, in the Outputs+log tab of the run.

Execute K-Means, and save the clustering model.

Visualize all clusters to decide the label for each of them.

You can see some images of clusters:

  • Cluster 0:

  • Cluster 1:

  • Cluster 2:

After visualize the cluster, you can assign the most frequent label for it. It turns out a dictionary like that:

The key of transfer_dict is the cluster index, and value is the label of this cluster.

We can save the result of our prediction model to a csv file for further examinations.

We can make a comparison between the distribution of “real” and “predict” label:

  • “Real” label:

  • “Predict” label:

We have some lines of code to calculate the accuracy of our prediction:

And the result is …

Let’s take a glance on some wrong predictions:

Remember to close your running session by calling this method:

In the Experiments tab, you can see your running, like this:

Click on this running, on the Outputs+log tab, you can see all your uploaded file: