Rajdeep Borgohain

rbgo

AI & ML interests

Solving language barriers.

Recent Activity

Reacted to m-ric's post with šŸ‘ 7 days ago
šŸ” Meta teams use a fine-tuned Llama model to fix production issues in seconds One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy. šŸ¤” 42%, is that not too low? āž”ļø Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack. šŸ’” So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds! How did they do it? šŸ”„ Two-step approach: ā€£ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set ā€£ Fine-tuned Llama 2 7B ranks the most likely culprits šŸŽ“ Training pipeline: ā€£ Continued pre-training on Meta's internal docs and wikis ā€£ Supervised fine-tuning on past incident investigations ā€£ Training data mimicked real-world constraints (2-20 potential changes per incident) šŸ”® Now future developments await: ā€£ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems) ā€£ Improvements in model reasoning should boost accuracy further Read it in full šŸ‘‰ https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
updated a dataset 16 days ago
rbgo/llm-inference-benchmark
liked a Space about 1 month ago
lj1995/GPT-SoVITS-v2
View all activity

Organizations

rbgo's activity

Reacted to m-ric's post with šŸ‘ 7 days ago
view post
Post
776
šŸ” Meta teams use a fine-tuned Llama model to fix production issues in seconds

One of Meta's engineering teams shared how they use a fine-tuned small Llama (Llama-2-7B, so not even a very recent model) to identify the root cause of production issues with 42% accuracy.

šŸ¤” 42%, is that not too low?
āž”ļø Usually, whenever there's an issue in production, engineers dive into recent code changes to find the offending commit. At Meta's scale (thousands of daily changes), this is like finding a needle in a haystack.
šŸ’” So when the LLM-based suggestion is right, it cuts incident resolution time from hours to seconds!

How did they do it?

šŸ”„ Two-step approach:
ā€£ Heuristics (code ownership, directory structure, runtime graphs) reduce thousands of potential changes to a manageable set
ā€£ Fine-tuned Llama 2 7B ranks the most likely culprits

šŸŽ“ Training pipeline:
ā€£ Continued pre-training on Meta's internal docs and wikis
ā€£ Supervised fine-tuning on past incident investigations
ā€£ Training data mimicked real-world constraints (2-20 potential changes per incident)

šŸ”® Now future developments await:
ā€£ Language models could handle more of the incident response workflow (runbooks, mitigation, post-mortems)
ā€£ Improvements in model reasoning should boost accuracy further

Read it in full šŸ‘‰ https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
liked a Space about 1 month ago
Reacted to abhishek's post with šŸ‘ about 1 month ago
updated a Space about 2 months ago
liked a Space about 2 months ago
updated a collection about 2 months ago
updated a collection about 2 months ago