Repository navigation

humaneval

Website
Wikipedia

We introduced a new model designed for the Code generation task. Its test accuracy on the HumanEval base dataset surpasses that of GPT-4 Turbo (April 2024) and GPT-4o.

code-generation code-interpreter humaneval 大语言模型 text-generation 自然语言处理

Python

855

1 年前

the-crypt-keeper / can-ai-code

Self-evaluating interview for AI coders

人工智能 ggml langchain llama-cpp 大语言模型 humaneval transformers

Python

595

3 个月前

abacaj / code-eval

Run evaluation on LLMs using human-eval benchmark

humaneval wizardcoder

Python

420

2 年前

SkyWorkAIGC / SkyCode-AI-CodeX-GPT3

SkyCode是一个多语言开源编程大模型，采用GPT3模型结构，支持Java, JavaScript, C, C++, Python, Go, shell等多种主流编程语言，并能理解中文注释。模型可以对代码进行补全，拥有强大解题能力，使您从编程中解放出来，专心于解决更重要的问题。| SkyCode is an open source programming model, which adopts the GPT3 model structure. It supports Java, JavaScript, C, C++, Python, Go, shell and other languages, and can understand Chinese comments.

alphacode codex deepmind Go gpt-neo humaneval Java JavaScript openai Python gpt3 gpt-3 Shell

386

3 年前

zorse-project / COBOLEval

Evaluate LLM-generated COBOL

cobol evaluation humaneval 大语言模型

Python

1 年前

declare-lab / LLM-ReasoningTest

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

humaneval reasoning

Python

5 个月前

abhigupta2909 / LLMPerformanceLab

LLMs' performance analysis on CPU, GPU, Execution Time and Energy Usage

flask-restful humaneval Java JavaScript 大语言模型 mmlu ollama-api React Spring Boot MySQL

Java

2 年前

mousamax / Evaluation-Code-Generator-LLMs

JetBrains Task: Leveraging software evolution data with LLMs

huggingface humaneval

2 年前

mennahasan31 / llm_benchmark

llm_benchmark is a comprehensive benchmarking tool for evaluating the performance of various Large Language Models (LLMs) on a range of natural language processing tasks. It provides a standardized framework for comparing different models based on accuracy, speed, and efficiency.

ai-tools alibaba anthropic benchmark evals evaluation evaluation-metrics humaneval information-seeking mistral 自然语言处理 openai reasoning streetfighterai

7 个月前

scouzi1966 / qwen-humaneval

🧪 Automated LLM coding benchmarks with Ollama - HumanEval & MBPP evaluation suite with safe execution, comprehensive logging, and detailed analysis tools

benchmarking Code evaluation humaneval 大语言模型机器学习 ollama Python qwen

Python

2 个月前

arcxteam / fortytwo-node

Fortytwo Network Node Building AI on Monad

人工智能 ai-agents 机器学习 monad testnet humaneval llm-inference Rust

Shell

2 天前