Snorkel beginners guide

Introduction to Snorkel
Simple Labelling Task
Introduction to Labelling Functions
Analysis and performance evaluation
Bringing LFs together!
Training a Classifier
Further Steps

Snorkel is a system for programmatically building and managing training datasets. In Snorkel, users can develop training datasets in hours or days rather than hand-labeling them over weeks or months.
The core idea is that we create multiple labeling functions that label a training point. We don't hand label the training set, but we keep a validation set just so we could measure how good the functions were.
The problem is we don't have ground truth for these labeling functions whenever we are training so there's no way to determine the accuracy easily.
What we do is have a generative model where multiple labeling functions are created based on certain guidance, conditions, and expertise. — NOTE Currently, Windows is not supported natively. So after doing a bit of manual installation of specific packages with a supported version, it should work. Make sure to install a specific version of Pytorch (Because letting it install via requirements of snorkel or docker image might throw a Pytorch memory error). Also, make sure to install everything in a virtual environment. Other options to run on windows are through a Linux subsystem (WSL) or as a docker image.

This tutorial will serve as a basic guide to data labeling, building simple labeling functions for classifying youtube comments as ham and spam. Additionally, it will also show how to integrate MLflow into this, to keep track of all the models.

By default, you can load up the dataset directly by cloning Snorkel Tutorials and writing
from utils import load_spam_dataset
It downloads the raw CSV files from the internet, divides them into splits, converts them into DataFrames, and shuffles them. The dataset contains comments from 5 of the most popular YouTube videos during a period between 2014 and 2015. But because of the Linux filesystem, it doesn't work. So I have created a function that replicates the same job. Download the data set from here

import pandas as pd
import glob

def data_loader():
    path = r'directory of dataset'
    all_files = glob.glob(path + "/*.csv")
    li = []

    for filename in all_files:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)
    
    frame = pd.concat(li, axis=0, ignore_index=True)
    
    frame = shuffle(frame)
    train = frame.iloc[0:1369,:-1]
    dev = frame[500:900]
    validation = frame[1370:1663]
    test = frame[1664:]  
    return test,train,dev,validation

The splits are as explained below:

Training Set: The largest split of the dataset, and the one without ground truth ("gold") labels. We will generate labels for these data points with weak supervision.
[Optional] Development Set: A small labeled subset of the training data (e.g. 100 points) to guide LF development. See the note below.
Validation Set: A small labeled set used to tune hyperparameters while training the classifier.
Test Set: A labeled set for a final evaluation of our classifier. This set should only be used for final evaluation, not error analysis.
```
#For clarity, we define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = -1
HAM = 0
SPAM = 1
```
Introduction to Labelling functionsLabeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically. LFs are heuristics that take as input a data point and either assign a label to it (in this case, HAM or SPAM) or abstain (don't assign any label). Labeling functions can be noisy: they don't have perfect accuracy and don't have to label every data point. Moreover, different labeling functions can overlap (label the same data point) and even conflict (assign different labels to the same data point). There are many ways (as explained above) to determine labeling functions. Here I have shown two of the ways:
- Keyword searches_: looking for specific words in a sentence
- Heuristic labeling functions Labeling functions in Snorkel are created with the @labeling_functiondecorator. The decorator can be applied to any Python function that returns a label for a single data point. Here is one example python from snorkel.labeling import labeling_function @labeling_function() def check(x): return SPAM if "check" in x.text.lower() else ABSTAIN

which checks the string for this particular word. Likewise, usually, you define certain keywords. But instead of writing redundant functions we create a labeling function class which takes up keywords so we don't have to manually write functions for each keyword.

from snorkel.labeling import LabelingFunction 
#This is Labelling function class not labeling FUNCTION!

def keyword_lookup(x, keywords, label):
    if any(word in x.CONTENT.lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )
 
 """Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])

"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])

"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])

"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])

"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)

Here is one heuristic labelling function

@labeling_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return HAM if len(x.CONTENT.split()) < 5 else ABSTAIN

Here we also use MLflow to define those labelling functions as parameters.

  with mlflow.start_run():
    mlflow.log_param("keywords LF","my, subscribe, link, please, song")
    mlflow.log_param("heuristic LF", "short comment")

from snorkel.labeling import PandasLFApplier
lfs = [
    keyword_my,
    keyword_subscribe,
    keyword_link,
    keyword_please,
    keyword_song,
    short_comment,
    ]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)

Snorkel provides a tool for common LF analysis to calculate statistics like coverage of these LFs (i.e., the percentage of the dataset that they label). Apart from that, we get:

Polarity: The set of unique labels this LF outputs (excluding abstains)
Coverage: The fraction of the dataset the LF labels
Overlaps: The fraction of the dataset where this LF and at least one other LF label
Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
Correct: The number of data points this LF labels correctly (if gold labels are provided)
Incorrect: The number of data points this LF labels incorrectly (if gold labels are provided)
Empirical Accuracy: The empirical accuracy of this LF (if gold labels are provided) (Here gold labels mean true labels, if available)

Here, shown one of the instance

from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

| Tables        | J | COVERAGE  | OVERLAPS  | CONFLICTS | CORRECT | INCORRECT | EMP. ACC.|
| -------------:|:-:|----------:|----------:|----------:|--------:|----------:|---------:|
| CHECK_OUT     | 0 | 0.22      | 0.22        | 0.1        | 22      | 0          |1.000000  |
| CHECK         | 1 | 0.30      | 0.22        | 0.1        | 29      | 1            |0.966667  |

Further to improve our labeling functions, snorkel provides a helper method get_label_buckets(...) groups data points by their predicted label and true label. For example, we can find the indices of data points that the LF labeled SPAM that belong to class HAM. This may give ideas for where the LF could be made more specific. ( Note that we can only do if we have gold-labeled dev set available!)

from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(Y_dev, L_dev[:, 1])
df_dev.iloc[buckets[(HAM, SPAM)]]

''' Here the tutorial performs an additional test, make buckets for training dataset to check if the intuition
was correct or not. In real-world, that wouldn't be possible most of the time!'''

| Author       | Date                | Text                                   | Label  | Video |
| Eanna Cusack | 2014-01-20T22:20:59 | Im just to check how much views it has | 0      | 1     |

Other complex ways to write labeling functions include (but not limited to) regular expression, preprocessors(Textblob), Complex preprocessors(NLP).

We convert these functions into noise aware probabilistic functions i.e. Take the consideration of all the label functions of what they think the label is and try to reduce noise as much as possible (or confidence-weighted) label per data point. A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM than HAM, label it SPAM (and vice versa).

from snorkel.labeling import MajorityLabelVoter

majority_acc = majority_model.score(L=L_valid, Y=validation.iloc[:,-1])["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
mlflow.log_metric('Majority_Vote_Acc',majority_acc * 100)

Another way to convert these functions is to use LabelModel. The LabelModel can learn weights for the labeling functions using only the label matrix as input. Note that no gold labels are used during the training process. The only information we need is the label matrix, which contains the output of the LFs on our training set.

The output of the Snorkel LabelModel is just a set of labels which can be used with most popular libraries for performing supervised learning, such as TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, and XGBoost. In this tutorial, we demonstrate using classifiers from Keras and Scikit-Learn.

Featurization

For simplicity and speed, we use a simple "bag of n-grams" feature representation: each data point is represented by a one-hot vector marking in which words or 2-word combinations are present in the comment text.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

Keras Classifier with Probabilistic Labels

We'll use Keras, a popular high-level API for building models in TensorFlow, to build a simple logistic regression classifier. We compile it with a categorical_crossentropy loss so that it can handle probabilistic labels instead of integer labels. We use the common settings of an Adam optimizer and early stopping (evaluating the model on the validation set after each epoch and reloading the weights from when it achieved the best score). For more information on Keras, see the Keras documentation.

from snorkel.analysis import metric_score
from snorkel.utils import preds_to_probs
from utils import get_keras_logreg, get_keras_early_stopping

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)

preds_test = keras_model.predict(x=X_test).argmax(axis=1)
test_acc = metric_score(golds=Y_test, preds=preds_test, metric="accuracy")
print(f"Test Accuracy: {test_acc * 100:.1f}%")
mlflow.log_metric("test_accuracy", test_acc*100)

''' Save this keras model in MLflow'''

from mlflow import keras
mlflow.keras.save_model(keras_model,path = f'runs')

NOTE This tutorial is adopted from official Snorkel documentation and tutorial literature and parts of it are directly used. See more here

Contents

Featurization

Keras Classifier with Probabilistic Labels