MLflow Quick Guide

Introduction
Components (In detail)
Quick Start
Some technical things

MLFlowTracking(MLflow Tracking): Tracking experiments to record and compare parameters and results of different runs.
MLFlowProjects (MLflow Projects): Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production.
MLFlowModels (MLflow Models): Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms

MLflow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI.

The code and examples shown here are exclusively in python.

Mlflow Tracking

The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. MLflow Tracking lets you log and query experiments using Python, REST, R API, and Java API APIs. The MLflow Python API logs runs locally to files in an mlruns directory wherever you ran your program. You can then run mlflow ui to see the logged runs.

  mlflow ui

Following are the standard logging functions which are used.

mlflow.start_run() #returns the currently active run (if one exists), or starts a new run.

You do not need to call start_run explicitly: calling one of the logging functions with no active run automatically starts a new one.

	mlflow.end_run() #ends the currently active run.

if any, taking an optional run status.

	mlflow.log_param() #logs a single key-value param in the currently active run.

The key and value are both strings. Use mlflow.log_params() to log multiple params at once.

	mlflow.log_metric() #logs a single key-value metric.

The value must always be a number. MLflow remembers the history of values for each metric. Use mlflow.log_metrics() to log multiple metrics at once.

mlflow.log_artifact() #logs a local file or directory as an artifact, optionally taking an artifact_path to place it in within the run's artifact URI.

Run artifacts can be organized into directories, so you can place the artifact in a directory this way.

An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions. In addition, the Projects component includes an API and command-line tools for running projects, making it possible to chain together projects into workflows. Each project is simply a directory of files, or a Git repository, containing your code.

It mainly contains two .yaml files named:

A conda.yaml file, treated as a Conda environment
More detailed MLproject file ( Remove the .yaml extension....still not clear why!)

A conda.yaml file looks like this

name: name
channels:
    - defaults
dependencies:
    - python = 3.6
    - scikit-learn
    - pip:
        - mlflow>=1.0

Where,

Name is any human-readable project name
Channels, refers to where Conda, the environment management tool, is going to look to find the declared dependencies. Currently, the defaults channel will search all URLs under the https://repo.anaconda.com/pkgs/ directory.
Dependencies, include all the packages required to be installed for running these program(s).

And MLproject looks like this

name: tutorial

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      alpha: {type: float, default: 0.5}
      l1_ratio: {type: float, default: 0.1}
    command: "python train.py {alpha} {l1_ratio}"

conda_env, The software environment that should be used to execute project entry points. This includes all library dependencies required by the project code. (Conda environment for our current scope of project).
Entry Points, Commands that can be run within the project, and information about their parameters. Most projects contain at least one entry point that you want other users to call.

(Note: Not all Python packages are available as Conda packages. Some might only available through PyPI, or may be released there first. By including pip in the dependencies, that Python-specific package manager will be included. Listing packages below pip in the hierarchy, indicates that pip should be used to install those packages.)

An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools—for example, real-time serving through a REST API or batch inference on Apache Spark. (Like a python pickle module!)

MLflow includes integrations with several common libraries. For example,

mlflow.sklearn contains save_model
log_model, and load_model functions for scikit-learn models.
mlflow.models.Model class to create and write models. There are built-in flavours, but model customization is also supported. Supports the adding of flavours, loading, saving, logging a model. The model can be deployed locally or as a Docker image.

Dataset: Forest Cover Type Dataset(Dataset)

This dataset surveys four areas of the Roosevelt National Forest in Colorado and includes information on tree type, shadow coverage, distance to nearby landmarks (roads etcetera), soil type, and local topography. The forest cover type is the classification problem.

Tensorflow

TensorFlow allows developers to create dataflow graphs(for language and hardware portability) where each node represents a mathematical operation, and each edge between nodes is a 3 or higher n-dimensional data array, or tensor. These are directed, acyclic graphs (DAG). Useful APIs include:

tf.estimator - High level API for distributed training. Estimators allow for quick models, Checkpointing, Out-of-memory datasets, distributed training etc. Can feed data from numpy arrays/pandas dataframe.
tf.layers, tf.losses, tf.metrics - For building custom Neural Network models.

Python APIs - used to build the DAG.

Simply put, build a DAG (a.k.a the model), create a session to run the model, feed model values using feed-dict(in tf.run()), run the model in the session. Variables are trainable in TF(eg. Weight vector for neural network). But in order to have formal parameters initialised at runtime, placeholders are used that can be initialised by feed-dict. train_and_evaluate function can be used for data parallelism.

Use cases of Tensorflow include: Voice/Sound/Image Recognition, Text based applications, Time Series, Video Detection.

Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. We use the Random Forest classifier from Scikit-learn on the Forest Cover Type Dataset.

Both the programs are configured to run any classification dataset.

First program is a demo program for sklearn where it selects random values of hyper-parameters from a given range and log to MLflow UI so that you have multiple models. It record metrics, parameters and artifacts(Plots, model, etc.) for each run. The number of runs are configurable.

Second program is tensorflow program adopted from Kaggle to record multiple metrics. You can pass the arguments of hyper-parameters to get various results. By default, it will run with some default parameters and display a message with instructions like this

This program can be ran by manually tuning the following Parameters
Input the following parameters in a following way
python file.py mini_batch_size no_of_epochs total_data_size test_size(in format %total_data_size)
eg. python mlflow_tensor.py 10 1000 2000 0.4
Running on default params right now

Code Snippet

'''For logging tensorflow model'''
with mlflow.start_run():
        mlflow.log_artifact(sa_fig)
        for value in range(len(tot_cost)):
           mlflow.log_metric('loss',tot_cost[value])
        mlflow.log_metric("Accuracy",fin_acc)
        mlflow.log_param("Input Features",ip_features)
        mlflow.log_param("Output Labels", op_features)
        mlflow.log_param("Training_Size",len(training_set))
        mlflow.log_param("Test_Size",len(test_set))
        mlflow.log_param("Learning_rate",lr)
mlflow.end_run()

The packaged ML project, should ideally be ran inside a virtual python environment (venv), so it doesn't mess with your original directories and scripts.
There is an issue with conda activation within the automatic script which MLflow project calls. A work around is to use –no-conda at the end while running from CLI. zsh mlflow run projectname parameters(if any) --no-conda
More information here: https://github.com/mlflow/mlflow/issues/1507

Contents

Mlflow Tracking

Dataset: Forest Cover Type Dataset(Dataset)

Tensorflow

Random Forest Classifier