Custum ML model for BPMN-search using spektral and keras.

Overview

This project aims to create embeddings for BPMN files to facilitate tasks like search, similarity, and clustering. The embeddings capture both the semantics and structure of BPMN files, allowing for effective retrieval and comparison of similar BPMN diagrams.

Important Note

The current model uses embeddings created by the paraphrase-multilingual-MiniLM-L12-v2 with an embedding dimension of 384. Using a different sentence-transformer will result in unexpected behavior. Ensure to use the correct sentence-transformer with the kerasEmbedder and adjust the 'dims' parameter in mu-search accordingly.

Motivation

The goal is to:

Capture the semantics of BPMN files, making similar BPMN files have similar embeddings.
Capture the structure of BPMN files, making structurally similar BPMN files have similar embeddings.
Enable semantic search, allowing retrieval of BPMN files based on user queries using keywords, key phrases, or natural language queries.

Design Choices:

Preprocessing BPMN Files: Adjust the input size to fit the fixed input size of embedding models or handle large inputs by splitting them into smaller parts.
Encoding Structure: Use graph embedding techniques (e.g., GNNs, GCNs) to encode the structure of BPMN diagrams.
Graph Representation: Convert BPMN diagrams into graph representations using NetworkX.
Node and Edge Information: Extract labels and documentation fields from nodes and edges, converting them into numerical vectors using pre-trained embeddings or custom-trained embeddings.

Current Approach:

Convert BPMN Files to Graphs: Use NetworkX to represent BPMN files as graphs.
Node and Edge Embeddings: Use pre-trained embeddings to create vector representations of nodes and edges.
Graph Embedding: Use these embeddings as features for a GNN or GCN model (e.g., Spektral) to create a single embedding for each BPMN file.

Search Model Specifics:

Query Handling: Accepts keywords, key phrases, or natural language queries.
Similarity Calculation: Uses precomputed embeddings and cosine similarity to rank BPMN files based on query similarity.
Efficiency: Designed to handle large volumes of BPMN files and queries efficiently.

Shortcomings:

Data Availability: Lack of sufficient BPMN files and real user queries for training and validation.

Future Steps:

Gather more BPMN files and user queries.
Train custom text embeddings for nodes and edges (e.g., using robbert-2023-dutch-base-abb).
Validate and refine the current model with new data.
Potentially merge the graph and text models into a unified architecture.

Suggestions:

Data Collection: Store search results and user interactions anonymously to gather diverse query data.
User Interaction Analysis: Use interaction data to train models for better search result ranking.
Integrate this into mu-search: store all queries made to mu-search to gain an understanding in how users interact with OPH, Lokaal Beslist, ...

Requirements for future steps:

A large and varied dataset of BPMN files to ensure the model generalizes well.
Real user queries to validate and improve the model's effectiveness.