# InternLM
👋 join us on Discord and WeChat
## Introduction InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8B-Instruct, designed for general-purpose usage and advanced reasoning. This model has the following characteristics: - **Enhanced performance at reduced cost**: State-of-the-art performance on reasoning and knowledge-intensive tasks surpass models like Llama3.1-8B and Qwen2.5-7B. Remarkably, InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale. - **Deep thinking capability**: InternLM3 supports both the deep thinking mode for solving complicated reasoning tasks via the long chain-of-thought and the normal response mode for fluent user interactions. ## InternLM3-8B-Instruct ### Performance Evaluation We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://rank.opencompass.org.cn) for more evaluation results. | Benchmark | | InternLM3-8B-Instruct | Qwen2.5-7B-Instruct | Llama3.1-8B-Instruct | GPT-4o-mini(close source) | | ------------ | ------------------------------- | --------------------- | ------------------- | -------------------- | ------------------------- | | General | CMMLU(0-shot) | **83.1** | 75.8 | 53.9 | 66.0 | | | MMLU(0-shot) | 76.6 | **76.8** | 71.8 | 82.7 | | | MMLU-Pro(0-shot) | **57.6** | 56.2 | 48.1 | 64.1 | | Reasoning | GPQA-Diamond(0-shot) | **37.4** | 33.3 | 24.2 | 42.9 | | | DROP(0-shot) | **83.1** | 80.4 | 81.6 | 85.2 | | | HellaSwag(10-shot) | **91.2** | 85.3 | 76.7 | 89.5 | | | KOR-Bench(0-shot) | **56.4** | 44.6 | 47.7 | 58.2 | | MATH | MATH-500(0-shot) | **83.0*** | 72.4 | 48.4 | 74.0 | | | AIME2024(0-shot) | **20.0*** | 16.7 | 6.7 | 13.3 | | Coding | LiveCodeBench(2407-2409 Pass@1) | **17.8** | 16.8 | 12.9 | 21.8 | | | HumanEval(Pass@1) | 82.3 | **85.4** | 72.0 | 86.6 | | Instrunction | IFEval(Prompt-Strict) | **79.3** | 71.7 | 75.2 | 79.7 | | Long Context | RULER(4-128K Average) | 87.9 | 81.4 | **88.5** | 90.7 | | Chat | AlpacaEval 2.0(LC WinRate) | **51.1** | 30.3 | 25.0 | 50.7 | | | WildBench(Raw Score) | **33.1** | 23.3 | 1.5 | 40.3 | | | MT-Bench-101(Score 1-10) | **8.59** | 8.49 | 8.37 | 8.87 | - The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means evaluating with Thinking Mode), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/). - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/). **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information. ### Requirements ```python transformers >= 4.48 ``` ### Conversation Mode #### Transformers inference To load the InternLM3 8B Instruct model using Transformers, use the following code: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_dir = "internlm/internlm3-8b-instruct" tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) # Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error. model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16) # (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes. # InternLM3 8B in 4bit will cost nearly 8GB GPU memory. # pip install -U bitsandbytes # 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True) # 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True) model = model.eval() system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语). - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Please tell me five scenic spots in Shanghai"}, ] tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids) ] prompt = tokenizer.batch_decode(tokenized_chat)[0] print(prompt) response = tokenizer.batch_decode(generated_ids)[0] print(response) ``` #### LMDeploy inference LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. ```bash pip install lmdeploy ``` You can run batch inference locally with the following python code: ```python import lmdeploy model_dir = "internlm/internlm3-8b-instruct" pipe = lmdeploy.pipeline(model_dir) response = pipe("Please tell me five scenic spots in Shanghai") print(response) ``` Or you can launch an OpenAI compatible server with the following command: ```bash lmdeploy serve api_server internlm/internlm3-8b-instruct --model-name internlm3-8b-instruct --server-port 23333 ``` Then you can send a chat request to the server: ```bash curl http://localhost:23333/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "internlm3-8b-instruct", "messages": [ {"role": "user", "content": "Please tell me five scenic spots in Shanghai"} ] }' ``` Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.io/en/latest/) #### Ollama inference TODO #### vLLM inference We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually. ```python git clone -b support-internlm3 https://github.com/RunningLeon/vllm.git pip install -e . ``` inference code: ```python from vllm import LLM, SamplingParams llm = LLM(model="internlm/internlm3-8b-instruct") sampling_params = SamplingParams(temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8) system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语). - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.""" prompts = [ { "role": "system", "content": system_prompt, }, { "role": "user", "content": "Please tell me five scenic spots in Shanghai" }, ] outputs = llm.chat(prompts, sampling_params=sampling_params, use_tqdm=False) print(outputs) ``` ### Thinking Mode #### Thinking Demo