Contents

This tutorial will serve as a basic guide to data labeling, building simple labeling functions for classifying youtube comments as ham and spam. Additionally, it will also show how to integrate MLflow into this, to keep track of all the models.

By default, you can load up the dataset directly by cloning Snorkel Tutorials and writing
from utils import load_spam_dataset
It downloads the raw CSV files from the internet, divides them into splits, converts them into DataFrames, and shuffles them. The dataset contains comments from 5 of the most popular YouTube videos during a period between 2014 and 2015. But because of the Linux filesystem, it doesn't work. So I have created a function that replicates the same job. Download the data set from here

import pandas as pd
import glob

def data_loader():
    path = r'directory of dataset'
    all_files = glob.glob(path + "/*.csv")
    li = []

    for filename in all_files:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)
    
    frame = pd.concat(li, axis=0, ignore_index=True)
    
    frame = shuffle(frame)
    train = frame.iloc[0:1369,:-1]
    dev = frame[500:900]
    validation = frame[1370:1663]
    test = frame[1664:]  
    return test,train,dev,validation

The splits are as explained below:

which checks the string for this particular word. Likewise, usually, you define certain keywords. But instead of writing redundant functions we create a labeling function class which takes up keywords so we don't have to manually write functions for each keyword.

from snorkel.labeling import LabelingFunction 
#This is Labelling function class not labeling FUNCTION!

def keyword_lookup(x, keywords, label):
    if any(word in x.CONTENT.lower() for word in keywords):
        return label
    return ABSTAIN

def make_keyword_lf(keywords, label=SPAM):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )
 
 """Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])

"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])

"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])

"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])

"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)

Here is one heuristic labelling function

@labeling_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return HAM if len(x.CONTENT.split()) < 5 else ABSTAIN

Here we also use MLflow to define those labelling functions as parameters.

  with mlflow.start_run():
    mlflow.log_param("keywords LF","my, subscribe, link, please, song")
    mlflow.log_param("heuristic LF", "short comment")
from snorkel.labeling import PandasLFApplier
lfs = [
    keyword_my,
    keyword_subscribe,
    keyword_link,
    keyword_please,
    keyword_song,
    short_comment,
    ]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train)
L_dev = applier.apply(df=dev)

Snorkel provides a tool for common LF analysis to calculate statistics like coverage of these LFs (i.e., the percentage of the dataset that they label). Apart from that, we get:

Here, shown one of the instance

from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()
| Tables        | J | COVERAGE  | OVERLAPS  | CONFLICTS | CORRECT | INCORRECT | EMP. ACC.|
| -------------:|:-:|----------:|----------:|----------:|--------:|----------:|---------:|
| CHECK_OUT     | 0 | 0.22      | 0.22        | 0.1        | 22      | 0          |1.000000  |
| CHECK         | 1 | 0.30      | 0.22        | 0.1        | 29      | 1            |0.966667  |    

Further to improve our labeling functions, snorkel provides a helper method get_label_buckets(...) groups data points by their predicted label and true label. For example, we can find the indices of data points that the LF labeled SPAM that belong to class HAM. This may give ideas for where the LF could be made more specific. ( Note that we can only do if we have gold-labeled dev set available!)

from snorkel.analysis import get_label_buckets

buckets = get_label_buckets(Y_dev, L_dev[:, 1])
df_dev.iloc[buckets[(HAM, SPAM)]]

''' Here the tutorial performs an additional test, make buckets for training dataset to check if the intuition
was correct or not. In real-world, that wouldn't be possible most of the time!'''
| Author       | Date                | Text                                   | Label  | Video |
| Eanna Cusack | 2014-01-20T22:20:59 | Im just to check how much views it has | 0      | 1     |

Other complex ways to write labeling functions include (but not limited to) regular expression, preprocessors(Textblob), Complex preprocessors(NLP).

We convert these functions into noise aware probabilistic functions i.e. Take the consideration of all the label functions of what they think the label is and try to reduce noise as much as possible (or confidence-weighted) label per data point. A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM than HAM, label it SPAM (and vice versa).

from snorkel.labeling import MajorityLabelVoter

majority_acc = majority_model.score(L=L_valid, Y=validation.iloc[:,-1])["accuracy"]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
mlflow.log_metric('Majority_Vote_Acc',majority_acc * 100)

Another way to convert these functions is to use LabelModel. The LabelModel can learn weights for the labeling functions using only the label matrix as input. Note that no gold labels are used during the training process. The only information we need is the label matrix, which contains the output of the LFs on our training set.

The output of the Snorkel LabelModel is just a set of labels which can be used with most popular libraries for performing supervised learning, such as TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, and XGBoost. In this tutorial, we demonstrate using classifiers from Keras and Scikit-Learn.

Featurization

For simplicity and speed, we use a simple "bag of n-grams" feature representation: each data point is represented by a one-hot vector marking in which words or 2-word combinations are present in the comment text.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())

X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())

Keras Classifier with Probabilistic Labels

We'll use Keras, a popular high-level API for building models in TensorFlow, to build a simple logistic regression classifier. We compile it with a categorical_crossentropy loss so that it can handle probabilistic labels instead of integer labels. We use the common settings of an Adam optimizer and early stopping (evaluating the model on the validation set after each epoch and reloading the weights from when it achieved the best score). For more information on Keras, see the Keras documentation.

from snorkel.analysis import metric_score
from snorkel.utils import preds_to_probs
from utils import get_keras_logreg, get_keras_early_stopping

# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])

keras_model.fit(
    x=X_train,
    y=probs_train_filtered,
    validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
    callbacks=[get_keras_early_stopping()],
    epochs=50,
    verbose=0,
)
preds_test = keras_model.predict(x=X_test).argmax(axis=1)
test_acc = metric_score(golds=Y_test, preds=preds_test, metric="accuracy")
print(f"Test Accuracy: {test_acc * 100:.1f}%")
mlflow.log_metric("test_accuracy", test_acc*100)

''' Save this keras model in MLflow'''

from mlflow import keras
mlflow.keras.save_model(keras_model,path = f'runs')

NOTE This tutorial is adopted from official Snorkel documentation and tutorial literature and parts of it are directly used. See more here