Lin Tan
lin-tan
4
followers
·
7 following
AI & ML interests
AI-Software Synergy. LLM4Code (binary and source code).
Mary J. Elmore New Frontiers Professor
Purdue University
Recent Activity
Reacted to
their
post
with 🔥
about 17 hours ago
Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks.
- Leaderboard https://lt-asset.github.io/REPOCOD/
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD
@jiang719 @shanchao @Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are
- General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues.
- With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).
Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with
- Whole function generation
- Repository-level context
- Validation with test cases, and
- Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)
Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with:
- 67 repository-level, 67 file-level, and 66 self-contains tasks
- Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens)
- GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks.
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD_Lite
#LLM4code #LLM #CodeGeneration #Security
Reacted to
their
post
with 🤗
about 17 hours ago
Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks.
- Leaderboard https://lt-asset.github.io/REPOCOD/
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD
@jiang719 @shanchao @Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are
- General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues.
- With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).
Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with
- Whole function generation
- Repository-level context
- Validation with test cases, and
- Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)
Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with:
- 67 repository-level, 67 file-level, and 66 self-contains tasks
- Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens)
- GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks.
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD_Lite
#LLM4code #LLM #CodeGeneration #Security
posted
an
update
about 17 hours ago
Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks.
- Leaderboard https://lt-asset.github.io/REPOCOD/
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD
@jiang719 @shanchao @Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are
- General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues.
- With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8).
Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with
- Whole function generation
- Repository-level context
- Validation with test cases, and
- Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00)
Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with:
- 67 repository-level, 67 file-level, and 66 self-contains tasks
- Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens)
- GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks.
- Dataset: https://huggingface.co/datasets/lt-asset/REPOCOD_Lite
#LLM4code #LLM #CodeGeneration #Security
View all activity
Organizations
view post
Can language models replace developers? #RepoCod says “Not Yet”, because GPT-4o and other LLMs have <30% accuracy/pass@1 on real-world code generation tasks. - Leaderboard https://lt-asset.github.io/REPOCOD/ - Dataset:
lt-asset/REPOCOD
@jiang719
@shanchao
@Yiran-Hu1007
Compared to #SWEBench, RepoCod tasks are - General code generation tasks, while SWE-Bench tasks resolve pull requests from GitHub issues. - With 2.6X more tests per task (313.5 compared to SWE-Bench’s 120.8). Compared to #HumanEval, #MBPP, #CoderEval, and #ClassEval, RepoCod has 980 instances from 11 Python projects, with - Whole function generation - Repository-level context - Validation with test cases, and - Real-world complex tasks: longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) Introducing hashtag #RepoCod-Lite 🐟 for faster evaluations: 200 of the toughest tasks from RepoCod with: - 67 repository-level, 67 file-level, and 66 self-contains tasks - Detailed problem descriptions (967 tokens) and long canonical solutions (918 tokens) - GPT-4o and other LLMs have < 10% accuracy/pass@1 on RepoCod-Lite tasks. - Dataset:
lt-asset/REPOCOD_Lite #LLM4code #LLM #CodeGeneration #Security