Annotating new data using selected prototypes#

In the original design of the discovery-driven pipeline, prototype selection was intended as the final step following the HMM scan. However, some use cases may require labeling a new dataset using previously selected prototypes, without re-running the HMM scan or re-analyzing all prototypes.

Re-running the HMM scan on new data, even when it includes the original dataset, typically results in a permutation of the prototype labels. This permutation can, in principle, be corrected through automated label alignment, but the process would add complexity and slow down workflows. More importantly, the inclusion of new data can introduce new prototypes if novel behavioral patterns are detected. While this dynamic discovery is a valuable feature of the pipeline, it complicates analyses where the goal is specifically to track known prototypes across additional datasets, for example to study behavior consistency or investigate associated brain circuit dynamics.

To address this need, we have implemented two methods for annotating new data using the prototypes selected in the original dataset. The first method is to train a LISBET classifier on the selected prototypes and then use it to label new data. The second method uses cached HMMs to annotate new data, but it is not recommended for most users due to its complexity and potential safety issues.

It is worth noting that the cached HMMs method provides an exact match to the original prototype labels, whereas the the classifier approach only offers an approximation. In our experience, the advantages of the classifier approach outweigh its drawbacks, as it is less error-prone and yields a reusable model that can be shared for future use.

Recommended Approach: Train a LISBET Classifier on Prototypes#

The recommended and most robust way to annotate new data with previously selected prototypes is to train a LISBET classifier using the prototype labels as ground truth. This approach is simple, reproducible, and produces a reusable model.

Prepare a Labeled Dataset with Prototype Annotations#

After running prototype selection, you will have CSV files with prototype labels for each sequence. You need to convert these into a dataset format suitable for LISBET training (e.g., directory structure with tracking and annotations).

Example Python snippet to patch the CalMS21 dataset (Sun et al. 2021) with prototype labels, please adapt to your dataset:

import numpy as np
import pandas as pd
import xarray as xr

from lisbet.datasets import dump_records, load_records


def extract_labels(csv_path):
    df = pd.read_csv(csv_path, index_col=0)

    # Rows that already have at least one positive label
    covered = df.eq(1).any(axis=1)

    # Create / update the fallback class
    df["Other"] = (~covered).astype(int)

    # Keep only the first 1 in every row
    first_mask = df.eq(1).cumsum(axis=1).eq(1)

    # Apply the mask – everything that isn’t the first 1 becomes 0
    df &= first_mask

    return df.values


def patch_dataset():
    records = load_records(
        data_format="movement",
        data_path="datasets/CalMS21/task1_classic_classification",
        data_scale="1024x570",
        data_filter="train",
    )["main_records"]

    patched_records = []
    for key, data in records:
        posetracks = data["posetracks"].unstack("features")

        labels = extract_labels(f"prototypes/{key}/machineAnnotation_hmmbest_6_32.csv")

        assert labels.shape[0] == posetracks.sizes["time"]

        # Convert to xarray Dataset
        annotations = xr.Dataset(
            data_vars=dict(
                label=(
                    ["time", "behaviors", "annotators"],
                    labels[..., np.newaxis],
                )
            ),
            coords=dict(
                time=posetracks.time,
                behaviors=[f"motif_{motif_id}" for motif_id in range(labels.shape[1])],
                annotators=["LISBET"],
            ),
            attrs=dict(
                source_software=posetracks.source_software,
                ds_type="annotations",
                fps=posetracks.fps,
                time_unit=posetracks.time_unit,
            ),
        )

        patched_record = (
            key,
            {"posetracks": posetracks, "annotations": annotations},
        )

        patched_records.append(patched_record)

    dump_records("datasets/proto_CalMS21", patched_records)


if __name__ == "__main__":
    patch_dataset()

This will create a new dataset with prototype labels as annotations.

Train a Classifier on the Prototype Labels#

Use the LISBET CLI to train a classifier on your new labeled dataset. For example:

$ betman train_model \
    --run_id=proto_classifier \
    --data_format=movement \
    --data_scale="1x1" \
    --data_filter=train \
    --learning_rate=1e-4 \
    --epochs=10 \
    --load_backbone_weights=models/lisbet32x4-calms21U-embedder/weights/weights_last.pt \
    --freeze_backbone_weights \
    --save_history \
    -v \
    datasets/proto_CalMS21

Use –freeze_backbone_weights to ensure the classifier matches the embedding model used for prototype discovery.
Adjust –data_format and paths as needed for your dataset.

Annotate New Data Using the Trained Classifier#

Once trained, use the classifier to annotate new datasets:

$ betman annotate_behavior \
    --data_format=movement \
    --data_scale="1024x570" \
    --data_filter=test \
    -v \
    datasets/CalMS21/task1_classic_classification \
    models/proto_classifier/model_config.yml \
    models/proto_classifier/weights/weights_last.pt

The output will be CSV files with predicted prototype labels for each frame.

Note

This approach provides an approximation of the original prototype labels
Overlapping prototypes are currently resolved by assigning the first label; multi-label support is planned.
Always ensure your new data matches the keypoint configuration expected by the model (see Data Preparation).

Alternative: Using Cached HMMs#

For advanced users, LISBET allows you to use cached HMM models to annotate new data. This method is not recommended for most users due to complexity and potential safety issues with loading pickle files.

If you wish to proceed:

Ensure you have the original HMM .joblib files saved from the initial scan.

Run:

$ betman segment_motifs \
    --pretrained_path=PATH_TO_HMM_MODELS \
    --output_path=NEW_OUTPUT_PATH \
    datasets/NewDataset

You can then extract the relevant prototype columns from the output annotation files.

Warning

Loading pickle/joblib files can be unsafe if the source is untrusted. Only use this method with files you generated yourself, DO NOT LOAD PICKLE FILES FROM UNTRUSTED SOURCES.

References#

Sun, J. J., Karigo, T., Chakraborty, D., Mohanty, S. P., Wild, B., Sun, Q., Chen, C., Anderson, D. J., Perona, P., Yue, Y., & Kennedy, A. (2021). The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions (arXiv:2104.02710). arXiv. https://doi.org/10.48550/arXiv.2104.02710

Chindemi, G., Girard, B., & Bellone, C. (2023). LISBET: a machine learning model for the automatic segmentation of social behavior motifs (arXiv:2311.04069). arXiv. https://doi.org/10.48550/arXiv.2311.04069