Text Generation
Inference Endpoints


This is quantized version of taide/Llama3-TAIDE-LX-8B-Chat-Alpha1 created using llama.cpp

Model Description

  • The TAIDE project aims to develop a generative AI dialogue engine model that is tailored to the linguistic and cultural characteristics of Taiwan, while also establishing a trustworthy AI environment. By combining academic, industrial, and research resources, the project seeks to advance the development of trustworthy generative AI, enhancing Taiwan's international competitiveness, promoting industrial development, and reducing dependence on foreign technologies.
  • The Llama3 TAIDE series models are based on Meta's released LLaMA3-8b model, incorporating text and training materials from various fields in Taiwan to enhance the model's ability to respond in Traditional Chinese and perform specific tasks. The publicly released models are as follows:
    • Llama3-TAIDE-LX-8B-Chat-Alpha1: Based on LLaMA3-8b, continuous pretrained using Traditional Chinese data, and enhanced for office tasks and multi-turn dialogue capabilities through instruction tuning. Suitable for scenarios involving chat dialogue or task assistance. Llama3-TAIDE-LX-8B-Chat-Alpha1 also provides a 4-bit quantization model. The quantization model is primarily offered for user convenience but may affect performance and introduce unforeseen issues. Users are advised to understand and take note of this.

Model Parameters

  • Parameters: 8B
  • Max context length: 8K
  • Training token in Traditional Chinese: 43B
  • Training time: 2336 H100 GPU Hours


  • Strictly vetting training data for the model to improve its trustworthiness and applicability.
  • Enhancing the model's performance for tasks such as summarization, writing articles, writing letters, and translating between Chinese and English, which are commonly used in office settings.
  • Strengthening the model's understanding of Taiwan-specific culture, terminology, and context.
  • Equip the model with multi-turn conversation capabilities.


Task User input Model output
Writing articles 請以以下內容為基礎,寫一篇文章:撰寫一篇作文,題目為《一張舊照片》,內容要求為:選擇一張令你印象深刻的照片,說明令你印象深刻的原因,並描述照片中的影像及背後的故事。記錄成長的過程、與他人的情景、環境變遷和美麗的景色。 一張舊照片,承載著時光的痕跡和回憶的溫度。我還記得那張照片是怎麼出現的,那是一張泛黃的黑白相片,裱在牆上,訴說著一段屬於我家族的故事。







Writing letters 請以品牌經理的身份,給廣告公司的創意總監寫一封信,提出對於新產品廣告宣傳活動的創意建議。 [您的姓名]













Summarization 請將這篇文章精簡條理化:「產業創新條例第10條之2及第72條條文修正案」俗稱「台版晶片法」,針對半導體、電動車、5G等技術創新且居國際供應鏈關鍵地位公司,提供最高25%營所稅投抵優惠,企業適用要件包含當年度研發費用、研發密度達一定規模,且有效稅率達一定比率。

- 「台版晶片法」或《產業創新條例第10條之2及第72條條文修正案》,旨在鼓勵關鍵技術創新企業,特別是那些在半導體、電動車、5G技術等領域且在國際供應鏈中扮演關鍵角色者。

- 企業若符合特定條件,包括當年度研發費用及研發密度達標,可享最高25%的營所稅抵減優惠。研發密度目標設為6%,而企業為取得抵減,必須證明其有效稅率達到特定比率,起初設定為112年的12%,113年提高至15%,但可能會依國際稅制的實施情形進行調整。

- 政府在制定政策時,考量到OECD最低稅負制的調整,因此會視情況調整有效稅率門檻。

- 政府部門已就法案內容與企業利益相關者進行協商,確定一些細節,例如企業投資先進製程設備的門檻設為100億元新台幣以供抵減。

- 研發費用門檻經過討論,可能從最初提議的100億元下調至60億至70億元之間,以激勵更多企業符合資格並投入研發,維持台灣在國際供應鏈中的關鍵地位。

- 經濟部官員表示,調降門檻的目的是讓更多企業覺得能夠達成標準,從而獲得應用租稅優惠的動力,並可能藉此提高在台投資金額,進而增加政府稅收。

- 由於IC設計業者的研發費用普遍較低,若門檻設得過高,符合條件的企業將很少,因此有必要調整。

- 在各國推動供應鏈自主化並增加對半導體產業的補助之際,經濟部正努力確保更多企業能從《產業創新條例》第10條之2中受益,以增強投資和鞏固台灣的技術地位。

- 政府強調租稅優惠的制定應符合獎勵目標,在此同時,也將繼續提供其他租稅優惠給在國內經營的企業,以促進台灣經濟成長。

- 最終的研發費用門檻將由經濟部和財政部共同確定,在預告相關子法後,將有約30天的時間供各界討論並提出調整建議,目標是讓這項新政策在6月上路。
Translation 以下提供英文內容,請幫我翻譯成中文。Dongshan coffee is famous for its unique position, and the constant refinement of production methods. The flavor is admired by many caffeine afficionados. 東山咖啡以獨特的位置,加上不斷精進的製作手法而聞名,風味更是讓許多咖啡癡趨之若鶩。


  • Sample code

    • The sample code and documentation will be released on GitHub later.
  • Prompt template

    • Normal QA
      chat = [
          {"role": "user", "content": "{question}"},
      prompt = tokenizer.apply_chat_template(chat)
      • Replace {question} with user input
    • QA with system prompt
      chat = [
          {"role": "system", "content": "{sys}"},
          {"role": "user", "content": "{question}"},
      prompt = tokenizer.apply_chat_template(chat)
      • Replace {sys} with system prompt,ex:你是一個來自台灣的AI助理,你的名字是 TAIDE,樂於以台灣人的立場幫助使用者,會用繁體中文回答問題。
      • Replace {question} as user input
    • Multi turns conversation
      chat = [
          {"role": "system", "content": "{sys}"},
          {"role": "user", "content": "{question1}"},
          {"role": "assistant", "content": "{model_anwer_1}"},
          {"role": "user", "content": "{question2}"},
      prompt = tokenizer.apply_chat_template(chat)
      • Replace {sys} with system prompt,ex:你是一個來自台灣的AI助理,你的名字是 TAIDE,樂於以台灣人的立場幫助使用者,會用繁體中文回答問題。
      • Replace {question1} with user input 1
      • Replace {model_anwer_1} with model response 1
      • Replace {question2} with user input 2
    • For more details, please refer to the Llama 3 documentation

Training methods

  • Software / hardware spec
    • GPU: H100
    • Training Framework: PyTorch
  • Data preprocessing
    • Character normalization
    • Deduplication
    • Denoise
      • Html tag、javascript in web content
      • Non-standard characters or garbage characters
      • Posts with an insufficient number of characters
      • Removing specific formats such as extra line breaks added for formatting purposes
    • Removing personal information such as emails and phone numbers.
    • Remove inappropriate content such as gambling, pornography, etc..
  • Continuous pretraining (CP)
    • Supplementing the model with a large amount of reliable Traditional Chinese knowledge.
    • Hyper parameters
      • optimizer: AdamW
      • learning rate: 1e-4
      • batch size: 1M tokens
      • epoch: 1
  • Fine tune (FT)
    • Enabling the model to answer questions in Traditional Chinese.
    • Hyper parameters
      • optimizer: AdamW
      • learning rate: 5e-5
      • batch size: 256K tokens
      • epoch: 3

Training Data

  • Continuous pre-training data (about 140GB)
    Dataset Description
    Litigation Data Civil litigation data from various levels of courts in the judicial rulings, including data from 2013/01 to 2023/12.
    CNA news The CNA news includes daily news articles from June 1993 to June 2023, spanning a period of 30 years. The content covers various domains such as domestic and international politics, society, economy, culture, education, and lifestyle.
    ETtoday news ETtoday news data, including data from 2011/10 to 2023/12.
    Legislative Yuan Gazette The Legislative Yuan Gazette contains data from the 1st session of the 8th term to the 7th session of the 10th term.
    Publisher Website Book Introduction Includes book introduction data from the websites of SunColor, Gotop publishers.
    Abstracts of GRB research projects GRB is an information system that compiles research projects funded by government grants and their outcome reports. This dataset primarily includes research project abstracts from 1993 to 2023, including both Chinese and their English counterparts.
    Academic conference proceedings abstracts The database contains academic conference proceedings held in Taiwan from 1988 to 2009.
    Taiwan Panorama magazine Taiwan Panorama magazine contains articles from July 1993 to June 2023, spanning 30 years. The content focuses on Taiwanese culture, tourism, and local customs.
    樂詞網 樂詞網》covers approximately 187,000 academic terms in the humanities and social sciences, along with their translations.
    Data from various ministries and commissions Including partial data from government department websites such as the Executive Yuan's "National Overview", the Ministry of Culture's "National Cultural Memory Bank", the National Development Council's "Archives Support Teaching Network", the Ministry of Transportation's "Traffic Safety Portal", etc.
    Business Today Business Today Magazine is a weekly magazine focused on finance. The dataset includes articles from 2008/01 to 2023/07.
    Mandarin and idiom dictionary from the Ministry of Education Dataset including:
    Idiom Dictionary: Contains 5,338 idioms, including definitions, original stories, usage explanations, and example sentences.
    Revised Mandarin Dictionary: contains Chinese words and various vocabulary, including pronunciation, radicals, definitions, and other information, totaling approximately 165,539 entries.
    Concise Mandarin Dictionary: is a condensed version of the "Revised Mandarin Dictionary", containing a total of 45,247 entries.
    SCITechVista The dataset includes science news and popular science articles from the SCITechVista website.
    iKnow The iKnow platform provides information on market trends, strategic analysis, patent knowledge, and technology transaction information for Taiwan and the global technology industry. The dataset includes data from 2005/01 to 2023/07.
    Science Development Monthly Magazine Science Development Monthly Magazine is a popular science publication published by the National Science Council (NSC) to promote science education. It includes articles from 2004/10 to 2020/12. In 2021, the magazine was relaunched as "CharmingSCITech" quarterly, providing new knowledge on international technology issues.
    Legislation Database The Legislation Database includes the latest central regulations, rules, draft bills, and local regulations issued by government agencies as of 2023/10.
    Local Government Tourism Websites Covering partial data from tourism websites of local government counties and cities in Taiwan.
    Curriculum Guidelines from the National Institute of Education The dataset includes curriculum guidelines for different subjects at various levels of education.
    CNA's English and Chinese Name Translation Database The English and Chinese Name Translation Database of the Central News Agency (CNA) collects translations of foreign and Chinese surnames, personal names, organizations, and place names used in news.
    Fairy tales A total of 20 fairy tale books, including "Tom Sawyer," "Peter Pan," "Alice's Adventures in Wonderland," "Uncle Long Legs," and more.
    RedPajama-Data-V2 Extracting English data from the RedPajama-Data-v2 multilingual dataset
    MathPile-commercial A mathematics-focused dataset obtained from MathPile-commercial
    Traditional Chinese Wikipedia Articles The content of all articles in Traditional Chinese Wikipedia, up to January 2023.
    github-code-clean An open-source code dataset on GitHub. After removing unlicensed code and documents.
  • Fine tune data
    • The TAIDE team trains the LLaMA2 series models to generate fine-tuning data, which generates single or multi-turn conversations on topics such as world knowledge, creative writing, general knowledge, translation, summarization, programming, and Taiwanese values. The fine tune data consists of 128K prompt-response pairs and will be released publicly later.


  • taide-bench
    • Data
      • Tasks include writing articles, writing letters, summarizing articles, translating from English to Traditional Chinese, translating from Traditional Chinese to English. There are 500 questions in total.
      • data link: taide-bench
    • Evaluation method
    • Scores
      Model Translating from Traditional Chinese to English Translating from English to Traditional Chinese Summerization Writing articles Writing letters Average
      Llama3-TAIDE-LX-8B-Chat-Alpha1 7.770 8.280 8.495 9.605 8.950 8.620
      GPT3.5 8.880 8.810 7.450 9.490 8.750 8.676
      TAIDE-LX-7B-Chat 7.165 7.685 7.720 9.635 9.110 8.263
      LLAMA2 7B 6.075 4.475 5.905 2.625 3.040 4.424
      LLAMA2 13B 6.480 6.135 6.110 2.565 3.000 4.858
      LLAMA2 70B 6.975 6.375 6.795 2.625 2.990 5.152



  • Due to limitations in its design architecture and the inevitable biases in data, any response from the LLM model does not represent the stance of TAIDE. Additional security measures should be implemented before use, and responses may also contain incorrect information. Users are advised not to fully trust the responses.

Development Team

Useful links


Downloads last month
Model size
8.03B params







Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for QuantFactory/Llama3-TAIDE-LX-8B-Chat-Alpha1-GGUF

this model