Instructions to use normalcomputing/extended-mind-llama-2-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use normalcomputing/extended-mind-llama-2-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="normalcomputing/extended-mind-llama-2-7b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-llama-2-7b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use normalcomputing/extended-mind-llama-2-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "normalcomputing/extended-mind-llama-2-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-llama-2-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/normalcomputing/extended-mind-llama-2-7b

SGLang

How to use normalcomputing/extended-mind-llama-2-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "normalcomputing/extended-mind-llama-2-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-llama-2-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "normalcomputing/extended-mind-llama-2-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-llama-2-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use normalcomputing/extended-mind-llama-2-7b with Docker Model Runner:
```
docker model run hf.co/normalcomputing/extended-mind-llama-2-7b
```

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card for Extended-Mind-Llama-2-7b

Github: https://github.com/normal-computing/extended-mind-transformers/
ArXiv: https://arxiv.org/abs/2406.02332

Original architecture and code by Meta.

Developed by: Normal Computing, Adapted from Meta
License: Apache 2.0

This model is part of the Extended Mind Transformers collection, and implements the methods described in our paper. This model retrieves and attends to an external cache of key-value pairs (or memories), and has not been finetuned (The original model weights have not been edited).

Model Usage

External Memory

Passing external memories to the model is easy. Simply pass the token ids to the model during instantiation, as the following examples illustrate. Generating and caching the memories is handled internally, during the first model.generate() call. You can update the memories using the following sequence of commands:

model.clear_memories()
model.memory_ids = list_of_new_token_ids

Set trust_remote_code=True to avoid warnings. Pass the memories to the model as a list of token ids.

from transformers import AutoModelForCausalLM, AutoTokenizer

ag_wiki_entry = """Alexander Grothendieck (/ˈɡroʊtəndiːk/; German pronunciation: [ˌalɛˈksandɐ ˈɡʁoːtn̩ˌdiːk] (listen); French: [ɡʁɔtɛndik]; 28 March 1928 – 13 November 2014) was a stateless (and then, since 1971, French) mathematician who became the leading figure in the creation of modern algebraic geometry.[7][8] His research extended the scope of the field and added elements of commutative algebra, homological algebra, sheaf theory, and category theory to its foundations, while his so-called "relative" perspective led to revolutionary advances in many areas of pure mathematics.[7][9] He is considered by many to be the greatest mathematician of the twentieth century.[10][11]"""

tokenizer_hf = AutoTokenizer.from_pretrained("normalcomputing/extended-mind-llama-2-7b")
memories = tokenizer_hf(ag_wiki_entry).input_ids

model_hf = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-llama-2-7b", external_memories=memories, trust_remote_code=True)

After this, you can generate text with the model as usual. The model will automatically use the memories during generation. You can update any config parameters (we set topk below) by passing new values to the model.generate() method.

inputs = "When did Alexander Grothendieck become a French citizen?"
inputs = tokenizer(inputs, return_tensors="pt").input_ids

outputs = model.generate(inputs, max_length=40, topk=2)
tokenizer.decode(outputs_hf['sequences'][0], skip_special_tokens=True)

Citations

By simply setting output_retrieved_memory_idx=True in the model.generate() method, you can retrieve the memory indices used during generation. We walk through an example in the demo notebook.

Additional configuration

LongLLaMA has several other parameters:

memory_type (string, optional, defaults to manual): Whether to store external memories manually or in a vector database.
mask_by_sim (bool, optional, defaults to True): Whether or not to mask retrieved memories by similarity.
sim_threshold (float, optional, defaults to 0.25): Threshold for masking retrieved memories.
tokenizer_all_special_ids (list, optional, defaults to [0, 50278]): Ids for special tokens to remove from memories.
remove_special_tokens (bool, optional, defaults to True): Remove memories that correspond to tokenizer special ids.

Additionally, the stride used to compute the memory representations can be set within generate_cache() method. Smaller strides generate higher-quality representations, while larger strides require fewer computations.

Limitations

This model is part of ongoing research at Normal Computing.

Downloads last month: 17

Collection including normalcomputing/extended-mind-llama-2-7b

Extended Mind Transformers

Collection

8 items • Updated Jun 5, 2024 • 6

Paper for normalcomputing/extended-mind-llama-2-7b

Extended Mind Transformers

Paper • 2406.02332 • Published Jun 4, 2024 • 1