macadeliccc (Tim Dolan)

updated a dataset 20 days ago

macadeliccc/US-SupremeCourt-RAG

Viewer • Updated 20 days ago • 1.1k • 24

liked a model 20 days ago

microsoft/OmniParser

Image-Text-to-Text • Updated 22 days ago • 11.1k • 1.34k

liked 4 models about 1 month ago

liked a model about 2 months ago

vidore/colqwen2-v0.1

Updated 15 days ago • 61.6k • 139

updated 3 models about 2 months ago

macadeliccc/magistrate-3.2-3b-it-GGUF

Text Generation • Updated Oct 1 • 107 • 1

macadeliccc/magistrate-3.2-3b-it

Text Generation • Updated Oct 1 • 353

macadeliccc/magistrate-3.2-3b-base

Text Generation • Updated Oct 1 • 37 • 1

New activity in macadeliccc/US-SupremeCourtVerdicts about 2 months ago

Choosing the Category for a Supreme Court Case

1

#2 opened about 2 months ago by

itsjjpowell

liked 2 models 2 months ago

stepfun-ai/GOT-OCR2_0

Image-Text-to-Text • Updated Sep 18 • 588k • 1.21k

arcee-ai/Llama-3.1-SuperNova-Lite

Text Generation • Updated Oct 2 • 9.98k • 173

liked a dataset 2 months ago

HFforLegal/laws

Viewer • Updated Sep 13 • 163k • 147 • 10

posted an update 3 months ago

Post

949

My tool calling playgrounds repo has been updated again to include the use of flux1-schnell or dev image generation. This functionality is similar to using Dall-E 3 via the @ decorator in ChatGPT. Once the function is selected, the model will either extract or improve your prompt (depending on how you ask).

I have also included 2 notebooks that cover different ways to access Flux for your specific use case. The first method covers how to access flux via LitServe from Lightning AI. LitServe is a bare-bones inference engine with a focus on modularity rather than raw performance. LitServe supports text generation models as well as image generation, which is great for some use cases, but does not provide the caching mechanisms from a dedicated image generation solution.

Since dedicated caching mechanisms are so crucial to performance, I also included an example for how to integrate SwarmUI/ComfyUI to utilize a more dedicated infrastructure that may already be running as part of your tech stack. Resulting in a Llama-3.1 capable of utilizing specific ComfyUI JSON configs, and many different settings.

Lastly, I tested the response times for each over a small batch request to simulate a speed test.

It becomes clear quickly how efficient caching mechanisms can greatly reduce the generation time, even in a scenario where another model is called. An average 4.5 second response time is not bad at all when you consider that an 8B model is calling a 12B parameter model for a secondary generation.

Repo: https://github.com/tdolan21/tool-calling-playground
LitServe: https://github.com/Lightning-AI/LitServe
SwarmUI: https://github.com/mcmonkeyprojects/SwarmUI

liked a dataset 3 months ago

NousResearch/hermes-function-calling-v1

Viewer • Updated Aug 30 • 11.6k • 614 • 218

posted an update 3 months ago

Post

1596

Automated web scraping with playwright is becoming easier by the day. Now, using ollama tool calling, its possible to perform very high accuracy web scraping (in some cases 100% accurate) through just asking an LLM to scrape the content for you.

This can be completed in a multistep process similar to cohere's platform. If you have tried the cohere playground with web scraping, this will feel very similar. In my experience, the Llama 3.1 version is much better due to the larger context window. Both tools are great, but the difference is the ollama + playwright version is completely controlled by you.

All you need to do is wrap your scraper in a function:

async def query_web_scraper(url: str) -> dict:
    scraper = WebScraper(headless=False)
    return await scraper.query_page_content(url)

and then make your request:

# First API call: Send the query and function description to the model
response = ollama.chat(
    model=model,
    messages=messages,
    tools=[
        {
            'type': 'function',
            'function': {
                'name': 'query_web_scraper',
                'description': 'Scrapes the content of a web page and returns the structured JSON object with titles, articles, and associated links.',
                'parameters': {
                    'type': 'object',
                    'properties': {
                        'url': {
                            'type': 'string',
                            'description': 'The URL of the web page to scrape.',
                        },
                    },
                    'required': ['url'],
                },
            },
        },
    ]
)

To learn more:
Github w/ Playground: https://github.com/tdolan21/tool-calling-playground/blob/main/notebooks/ollama-playwright-web-scraping.ipynb
Complete Guide: https://medium.com/@tdolan21/building-an-llm-powered-web-scraper-with-ollama-and-playwright-6274d5d938b5

liked a model 3 months ago

nvidia/Minitron-4B-Base

Updated Aug 22 • 15 • 127

posted an update 3 months ago

Post

1272

Save money on your compute bill by using LMCache to share prefix KV between 2 different vllm instances. By deploying LMCache backend along with your vLLM containers, you can share a prefix KV Cache between 2 different containers and models. It is very simple to implement into your existing stack.

Step 1: Pull docker images

docker pull apostacyh/vllm:lmcache-0.1.0

Step 2: Start vLLM + LMCache

model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_TOKEN=<Your huggingface access token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.6 --port 8000 \
    --lmcache-config-file /lmcache/LMCache/examples/example-local.yaml

You can add another vLLM instance as long as its on a separate GPU by simply deploying another:

# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2    # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
    -v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
    -p 8001:8001 \
    --env "HF_TOKEN=<Your huggingface token>" \
    --ipc=host \
    --network=host \
    apostacyh/vllm:lmcache-0.1.0 \
    --model $model --gpu-memory-utilization 0.7 --port 8001 \
    --lmcache-config-file /lmcache/LMCache/examples/example.yaml

This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).

LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm

liked a model 3 months ago

nvidia/Mistral-NeMo-Minitron-8B-Base

Text Generation • Updated Aug 22 • 17.6k • 160

Tim Dolan

AI & ML interests

Recent Activity

Articles

Deploy hundreds of open source models on one GPU using LoRAX

Organizations

macadeliccc's activity

macadeliccc/US-SupremeCourt-RAG

microsoft/OmniParser

bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF

nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

rhymes-ai/Aria

naver/efficient-splade-VI-BT-large-query

vidore/colqwen2-v0.1

macadeliccc/magistrate-3.2-3b-it-GGUF

macadeliccc/magistrate-3.2-3b-it

macadeliccc/magistrate-3.2-3b-base

Choosing the Category for a Supreme Court Case

stepfun-ai/GOT-OCR2_0

arcee-ai/Llama-3.1-SuperNova-Lite

HFforLegal/laws

NousResearch/hermes-function-calling-v1

nvidia/Minitron-4B-Base

nvidia/Mistral-NeMo-Minitron-8B-Base