What Truly Korean-Capable LLM Development Should Prioritize?

by JunyoungPark - opened 6 days ago

6 days ago

I want to start by saying I genuinely support the progress of K-AI and I really appreciate seeing major players also contribute back to the open-source community. Healthy competition + open collaboration is exactly what helps the whole ecosystem move faster.

That said, when we talk about building a truly Korean-capable model beyond simply reusing Qwen (or any strong existing base), I think “good Korean understanding” shouldn’t be reduced to “high Korean benchmark scores.” In practice, the most surface-level effective path often looks like: plugging in an external open-source encoder, or tuning/patching a subset of layers with a Korean-specific dataset to boost headline numbers. Those approaches can be useful, but they don’t necessarily advance the foundational capabilities we’ll need long-term.

What matters more, in my view, is doing the hard work that actually makes a model robust in Korean: designing a tokenizer that behaves reliably for Korean morphology and real-world text distributions, and then proposing training recipes, modeling strategies, and technical insights built on top of that tokenizer. Even more valuable is using those foundations to suggest scalable, extensible research directions that others can build upon—rather than only optimizing for short-term evaluation wins.

So I don’t think the core issue is whether a Pearson correlation coefficient looks a certain way, or whether we argue about the strict definition of “from scratch.” The unfortunate part is that, in Korea, the company with arguably the largest capacity to invest in research should be pushing deeper on fundamental techniques—and it’s disappointing when that kind of foundational work doesn’t seem to be prioritized or shared.

(Again, I’m rooting for K-AI’s growth and open contributions—this is just my perspective on what would make the work most meaningful for the Korean NLP community in the long run.)

kimhyunwoo

6 days ago

여러 가지 시도를 해보는 건 좋은 것 같은데,
구글 Gemma 모델을 파인튜닝하면 Gemma 라이선스가 적용되잖아요.
그렇다면 Qwen 모델을 파인튜닝했을 때도 Qwen 라이선스를 따라야 하는 게 아닌지 궁금합니다.

Trying out different approaches seems like a good idea, but I’m a bit confused. When you fine-tune Google’s Gemma model, it falls under the Gemma license. So if you fine-tune a Qwen model, wouldn’t it also need to be released under the Qwen license?

JunyoungPark

5 days ago

•

edited 5 days ago

@kimhyunwoo 저는 이 모델의 개발자는 아니라 실제 구현/학습 방식은 확신할 수 없지만, 만약 Qwen 기반 파인튜닝/파생 모델이라면 원 라이선스(및 고지) 누락 여부는 한번 확인할 필요가 있다고 봅니다. 제가 생각했을때, 라이선스 관련 내용을 생각하면 아래와 같습니다. 언제까지나 Qwen 기반의 모델링을 "그대로" 사용했을 때는 문제가 된다는 점이고, 단순 아이디어 참고 등의 측면에서 활용했다면 아래의 시나리오가 부적절할 수 있음을 확인 부탁드립니다!

Before I summarize the scenarios, I want to point out one detail about this repo’s current licensing:
Current LICENSE file (in this repo): (It’s the HyperCLOVA X SEED 32B Think Model License Agreement, i.e., a custom license.)

With that in mind, here are the scenarios I’m considering:

Scenarios

Fine-tuned weights derived from Qwen

Assume a model was fine-tuned from Qwen-2.5 (VL) 32B (Apache-2.0) or from Qwen-2.5 (VL) 72B (Qwen LICENSE AGREEMENT).
The publisher uploads the fine-tuned weights to Hugging Face, but does not include the original base license (Apache-2.0 or Qwen LICENSE AGREEMENT), and instead attaches only a custom license (e.g., like the LICENSE above).

Architecture-only (no Qwen weights, no Qwen code copied)

The publisher does not use Qwen weights at all, and instead trains from scratch (or uses other weights),
but claims the model “uses Qwen architecture” / is “Qwen-like” / “Qwen-structured”.
In other words: only the model structure/idea is reused, without distributing Qwen Materials (weights/code).

My understanding... :) (please correct me if I’m wrong)

1) If the base was Qwen-2.5 32B (Apache-2.0)

Publishing the fine-tuned model only under a custom license (and omitting Apache-2.0) seems risky because Apache-2.0 redistribution requirements generally include:

Providing a copy of the Apache-2.0 license
Preserving copyright / attribution notices
Keeping NOTICE content (if applicable)
Marking modified files with prominent change notices

So even if this model adds extra terms, it seems this model cannot replace Apache-2.0 entirely — this model needs to keep Apache-2.0 compliance and then optionally add extra terms in addition (as long as they don’t conflict).

2) If the base was Qwen-2.5 72B (Qwen LICENSE AGREEMENT)

This seems even stricter because the Qwen license explicitly requires (when redistributing or making available as part of a product/service):

Providing recipients a copy of the Qwen LICENSE AGREEMENT
Keeping a required Notice file statement
Marking modified files with prominent change notices
And if Qwen (or its outputs) were used to create/train/fine-tune/improve another model that is distributed/made available, to prominently display “Built with Qwen” or “Improved using Qwen” in related documentation

So publishing a fine-tuned model while omitting these required notices/agreements and using only a custom license looks like a potential license violation.

3) If it’s architecture-only (no Qwen weights, no Qwen code copied)

My question here is: does either Apache-2.0 / Qwen LICENSE AGREEMENT impose attribution / notice obligations if only the “architecture idea” is reused, without distributing Qwen Materials?

I think the answer depends on whether any copyrightable material was actually reused:

If it’s purely a re-implementation from scratch (no code copied, no weights derived, no Qwen outputs used to train), then it might not be a “derivative work” of the Qwen Materials, and the Qwen/Apache notice obligations may not apply.
But if any Qwen code is copied/adapted (even partially), or if Qwen outputs are used for distillation / synthetic data training, then it likely becomes subject to the original license obligations (especially under the Qwen LICENSE AGREEMENT, which explicitly mentions outputs in its rules-of-use clause).

Additional concern

If the custom license text is written as an agreement between a different vendor (e.g., NAVER) and the user, attaching it to a Qwen-derived model may also be confusing or misleading about the true rights holder / origin of the weights.

sorryhyun

5 days ago

Why do people care about 'from scratch' or 'derived from other models'? Because one cannot find explicit contributions on both: academia and industrial. Honestly, seeking 'efforts' rather than contributions is 'student's attitude', in my opinion, hence naive differentiation between 'from scratch' and 'from others' should be abandoned. However, if anyone wants to get some money upon 'did your best' encouraging culture, the one should clearly reveal what they did from scratch or not.

pathosethoslogos

3 days ago

•

edited 3 days ago

Qwen 3 Coder is better and newer. If anything, please base on that.
As a side note, IQuest utilises a new approach to training coding, and though it's new, it looks even more promising than Qwen.

I think time to first token, speed, and (thinking) efficiency could be the Korean priority. 빨리빨리 LLM with 2 ping sounds like a potential. 🙂

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment