--- license: apache-2.0 language: - fi - sv - en --- # Model Card for NatLibFi/Qwen2-0.5B-Instruct-FinGreyLit-GGUF This is a LLM that is fine-tuned for the specific task of extracting metadata from grey literature PDF documents. It is based on Qwen2-0.5B-Instruct, a relatively small (0.5B) model that is suitable for running locally on a CPU. The model has been quantized and is provided as a single 400MB GGUF file that can be run using [llama.cpp](https://github.com/ggerganov/llama.cpp). ## Model Details ### Model Description - **Developed by:** National Library of Finland - **Model type:** Transformer large language model - **Language(s) (NLP):** Finnish, Swedish, English (others could work but not extensively tested) - **License:** Apache 2.0 - **Finetuned from model:** https://huggingface.co/Qwen/Qwen2-0.5B-Instruct ### Model Sources [optional] - **Repository:** https://github.com/NatLibFi/FinGreyLit - **Demo notebook:** https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/llm-inference-api/Extract-metadata-LLM-API.ipynb ## Uses The model is intended for the single purpose of metadata extraction, in particular the development of metadata extraction tools. ### Direct Use The model supports a ChatML template and expects the following "conversation": ```python "messages": [ {"role": "system", "content": "You are a skilled librarian specialized in meticulous cataloguing of digital documents."}, {"role": "user", "content": "Extract metadata from this document. Return as JSON." + "\n\n" + doc_json}, ] ``` where `doc_json` is a JSON document that contains text and embedded metadata extracted from a PDF file, like this: ```json {"pdfinfo":{"title":"AI coffee presentation: Extracting metadata using LLMs"},"pages":[{"page":1,"text":"Extracting metadata from grey literature using large language models\nOsma Suominen\n2023-11-01"},{"page":2,"text":"Grey literature? reports working papers government documents white papers preprints theses … semi-formal non-commercial\nPDFs published on the web – lots of them!"},{"page":6,"text":"First 5 pages of text from PDF"},{"page":7,"text":"Example of LLM extracted metadata\nDiff view: human vs. LLM generated"},{"page":8,"text":"What we found out"}]} ``` The response should be a JSON document that looks something like this: ```json {"language": "eng", "title": "Extracting metadata using large language models", "creator": ["Suominen, Osma"], "year": "2023", "publisher": ["Yrkesh\u00f6gskolan Novia"], "type_coar": "research article"} ``` Note that the extracted publisher is incorrect in the above example. Also the `coar_type` classification is wrong. The full [metadata schema](https://github.com/NatLibFi/FinGreyLit/blob/main/schema.md) of the response is documented in the FinGreyLit repository. ### Out-of-Scope Use All other uses than metadata extraction are out of scope, although the model seems to retain some of the basic chat capability that the base model has. ## Bias, Risks, and Limitations The model quite often produces inaccurate results. Especially extracted titles for non-English documents are frequently wrong. It is also possible that the output is not a valid JSON document or even if it is, it doesn't necessarily follow the intended schema. ### Recommendations The model is intended for local development of metadata extraction tools, not any kind of production use. It can be more efficient for developers to work with a small, locally running LLM than a larger, better quality LLM that is slower to run and/or needs a GPU. Using JSON mode, i.e. setting `"response_format": {"type": "json_object"}`, is recommended when using this model with llama.cpp. This forces the model to output valid JSON. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. ## How to Get Started with the Model See the [demo notebook](https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/llm-inference-api/Extract-metadata-LLM-API.ipynb). ## Training Details ### Training Data The model was fine-tuned on 620 training documents from the [FinGreyLit](https://github.com/NatLibFi/FinGreyLit) data set. ### Training Procedure The model was fine-tuned in the [University of Helsinki HPC environment](https://helpdesk.it.helsinki.fi/en/services/scientific-computing-services-hpc) on a single A100 GPU using the Axolotl tool and the LoRA method. See the [notebook used for training](https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/axolotl-finetune-llm/Axolotl-fine-tune-Qwen2-0.5B.ipynb) for details such as hyperparameters. ## Evaluation The model has been evaluated using an evaluation methodology developed specifically for the metadata extraction task. The methodology is still being refined. The current [evaluation code](https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/eval.py) is available. The overall score for this model is 0.85. Full [field-by-field results](https://github.com/NatLibFi/FinGreyLit/blob/main/experiments/results-axolotl-Qwen2-0_5B-Instruct.md) can be found in the FinGreyLit repository. ## Model Card Contact Please use the [FinGreyLit](https://github.com/NatLibFi/FinGreyLit) GitHub repository.