Skip to main content

Custom Grading Script Examples

Complete, ready-to-use examples for building custom autograding scripts in Vocareum labs, covering test-based grading, AI-powered grading, and hybrid approaches.

Written by Mary Gordanier
Updated this week

For Teachers and Admins

This guide covers Vocareum Notebook, VS Code, and JupyterLab assignments. For setup instructions and an explanation of how the grading environment works, see Using Custom Grading Scripts in Vocareum Labs. For Vocareum Notebook-specific context — including when to use a custom script vs. the built-in nbgrader-based autograder — see Using Custom Grading Scripts in Vocareum Notebook.

The key facts you need to use the examples below:

  • Student files are at /voc/work/ ($VOC_HOME_DIR) — read submissions from here

  • Your scripts and support files go in /voc/scripts/ — reference them with $SCRIPT_DIR

  • Write scores to $vocareumGradeFile (CSV: Criterion Name, Score)

  • Write feedback to $vocareumReportFile (free-form text or HTML)

File Layout

Every autograding setup requires at minimum one file:

File

Location

Required

Purpose

grade.sh

/voc/scripts/grade.sh

Yes

Shell entry point Vocareum calls

grade.py

/voc/scripts/grade.py

No

Python grading logic (recommended for complex grading)

grade_prompt.txt

/voc/scripts/grade_prompt.txt

No

AI grading prompt (for AI-based grading)

All files under /voc/scripts/ are placed via Configure Workspace in the assignment settings.


The Grade File and Report File

Grade File ($vocareumGradeFile)

A CSV file where each line maps a rubric criterion to a score:

<Rubric Criterion Name>,<Score>

Rules:

  • Criterion names must exactly match (case-sensitive) the names defined in your Part rubric settings.

  • One criterion per line.

  • Scores are numeric (integer or decimal), up to the max score defined for that criterion.

Example (rubric has "Correctness" with max 7 and "Style" with max 3):

Correctness, 5 Style, 3

Report File ($vocareumReportFile)

Free-form text displayed to the student as feedback. Anything you write here appears in their submission report.

There are two patterns for producing report content — choose one per script:

Pattern 1: Print-based. Write report content to stdout using print() statements. Vocareum automatically appends all stdout from grade.sh to the report file. This is the simpler approach and works well when all script output is student-facing.

Pattern 2: File-based with VOC_NO_REPORT_OUTPUT. Write VOC_NO_REPORT_OUTPUT as the first line of the report file, then write all report content explicitly using file operations. Vocareum suppresses stdout when this line is present, giving you full control over what students see. Use this pattern when your script produces internal status output (such as progress messages or debug logging) that you do not want students to see.

Avoid mixing both patterns in the same script — if VOC_NO_REPORT_OUTPUT is present, print statements will not reach the student.


Test-Based Grading

Test-based grading runs the student's code against known inputs and checks the outputs. This is the most common and deterministic approach.


Simple Output Comparison

Assignment: Students write a function add(a, b) in submit.py that returns the sum of two numbers.

Rubric: One criterion — Score with max 10.

/voc/scripts/grade.sh

#!/bin/bash SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" python3 "$SCRIPT_DIR/grade.py" "$vocareumGradeFile" "$vocareumReportFile"

/voc/scripts/grade.py

import sys import importlib.util  grade_file = sys.argv[1] report_file = sys.argv[2]  # Load the student's module spec = importlib.util.spec_from_file_location("student", "submit.py") student = importlib.util.module_from_spec(spec)  try:     spec.loader.exec_module(student) except Exception as e:     # Student code fails to load — give 0 and report the error     with open(grade_file, "w") as f:         f.write("Score, 0\n")     with open(report_file, "w") as f:         f.write(f"ERROR: Could not load submit.py\n{e}\n")     sys.exit(0)  # Define test cases: (a, b, expected_result) test_cases = [     (1, 2, 3),     (0, 0, 0),     (-1, 1, 0),     (100, 200, 300),     (-5, -10, -15), ]  passed = 0 total = len(test_cases) report_lines = []  for i, (a, b, expected) in enumerate(test_cases, 1):     try:         result = student.add(a, b)         if result == expected:             passed += 1             report_lines.append(f"Test {i}: add({a}, {b}) = {result}  PASSED")         else:             report_lines.append(f"Test {i}: add({a}, {b}) = {result}, expected {expected}  FAILED")     except Exception as e:         report_lines.append(f"Test {i}: add({a}, {b}) raised {type(e).__name__}: {e}  ERROR")  score = round(passed / total * 10)  with open(grade_file, "w") as f:     f.write(f"Score, {score}\n")  with open(report_file, "w") as f:     f.write(f"Passed {passed}/{total} tests\n\n")     f.write("\n".join(report_lines))

Unit Testing with Multiple Criteria

Assignment: Students implement a Calculator class in submit.py with add, subtract, multiply, and divide methods.

Rubric: Four criteria — Addition (max 3), Subtraction (max 3), Multiplication (max 2), Division (max 2).

/voc/scripts/grade.sh

#!/bin/bash SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" python3 "$SCRIPT_DIR/grade.py" "$vocareumGradeFile" "$vocareumReportFile"

/voc/scripts/grade.py

import sys import importlib.util  grade_file = sys.argv[1] report_file = sys.argv[2]  # Load student module spec = importlib.util.spec_from_file_location("student", "submit.py") student = importlib.util.module_from_spec(spec)  try:     spec.loader.exec_module(student)     calc = student.Calculator() except Exception as e:     with open(grade_file, "w") as f:         f.write("Addition, 0\nSubtraction, 0\nMultiplication, 0\nDivision, 0\n")     with open(report_file, "w") as f:         f.write(f"ERROR: Could not create Calculator instance\n{e}\n")     sys.exit(0)   def run_tests(method_name, test_cases, max_score):     """Run tests for a single method. Returns (score, report_lines)."""     passed = 0     lines = [f"--- {method_name} ---"]     method = getattr(calc, method_name, None)      if method is None:         lines.append(f"  Method '{method_name}' not found")         return 0, lines      for args, expected in test_cases:         try:             result = method(*args)             if abs(result - expected) < 1e-9:                 passed += 1                 lines.append(f"  {method_name}{args} = {result}  PASSED")             else:                 lines.append(f"  {method_name}{args} = {result}, expected {expected}  FAILED")         except Exception as e:             lines.append(f"  {method_name}{args} raised {type(e).__name__}: {e}  ERROR")      score = round(passed / len(test_cases) * max_score)     lines.append(f"  Score: {score}/{max_score}")     return score, lines   # Define tests per method results = {} report = []  s, r = run_tests("add", [((2, 3), 5), ((0, 0), 0), ((-1, 1), 0)], max_score=3) results["Addition"] = s report.extend(r)  s, r = run_tests("subtract", [((5, 3), 2), ((0, 0), 0), ((1, 5), -4)], max_score=3) results["Subtraction"] = s report.extend(r)  s, r = run_tests("multiply", [((3, 4), 12), ((0, 5), 0)], max_score=2) results["Multiplication"] = s report.extend(r)  s, r = run_tests("divide", [((10, 2), 5), ((7, 3), 7/3)], max_score=2) results["Division"] = s report.extend(r)  # Write grades with open(grade_file, "w") as f:     for criterion, score in results.items():         f.write(f"{criterion}, {score}\n")  # Write report total = sum(results.values()) with open(report_file, "w") as f:     f.write(f"Total: {total}/10\n\n")     f.write("\n".join(report))

Testing with pytest

Assignment: Students implement sorting functions in submit.py.

Rubric: Score with max 10.

This approach uses pytest to run a test file and parses the results.

/voc/scripts/grade.sh

#!/bin/bash SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"  # Copy test file to student workspace so pytest can find the student module cp "$SCRIPT_DIR/test_solution.py" .  # Run pytest with JSON output python3 -m pytest test_solution.py -v --tb=short > /tmp/pytest_output.txt 2>&1 PYTEST_EXIT=$?  # Parse results and write grades python3 "$SCRIPT_DIR/parse_results.py" "$vocareumGradeFile" "$vocareumReportFile" /tmp/pytest_output.txt $PYTEST_EXIT

/voc/scripts/test_solution.py

from submit import bubble_sort, merge_sort  class TestBubbleSort:     def test_basic(self):         assert bubble_sort([3, 1, 2]) == [1, 2, 3]      def test_empty(self):         assert bubble_sort([]) == []      def test_single(self):         assert bubble_sort([1]) == [1]      def test_duplicates(self):         assert bubble_sort([3, 1, 3, 2]) == [1, 2, 3, 3]      def test_already_sorted(self):         assert bubble_sort([1, 2, 3, 4]) == [1, 2, 3, 4]  class TestMergeSort:     def test_basic(self):         assert merge_sort([3, 1, 2]) == [1, 2, 3]      def test_empty(self):         assert merge_sort([]) == []      def test_large(self):         import random         data = list(range(100))         random.shuffle(data)         assert merge_sort(data) == sorted(data)      def test_negative(self):         assert merge_sort([-3, -1, -2]) == [-3, -2, -1]      def test_mixed(self):         assert merge_sort([5, -2, 0, 3, -1]) == [-2, -1, 0, 3, 5]

/voc/scripts/parse_results.py

import sys import re  grade_file = sys.argv[1] report_file = sys.argv[2] output_file = sys.argv[3] exit_code = int(sys.argv[4])  with open(output_file) as f:     output = f.read()  # Count passed and failed tests from pytest verbose output passed = len(re.findall(r" PASSED", output)) failed = len(re.findall(r" FAILED", output)) errors = len(re.findall(r" ERROR", output)) total = passed + failed + errors  score = round(passed / total * 10) if total > 0 else 0  with open(grade_file, "w") as f:     f.write(f"Score, {score}\n")  with open(report_file, "w") as f:     f.write(f"Passed {passed}/{total} tests (Score: {score}/10)\n\n")     f.write(output)

AI-Powered Grading

AI-powered grading sends the student's code to one or more AI models for evaluation. This is ideal for assessing aspects that are hard to test mechanically — code quality, readability, documentation, design choices, and adherence to best practices.

Prerequisites: GenAI Gateway

Before using AI grading, your Vocareum organization must have:

  1. AI services configured in the GenAI Gateway (Control Center > GenAI > Services). See the GenAI Gateway documentation for setup instructions.

  2. GenAI enabled for your course (Control Center > Course > Course Resources > GenAI).

  3. GenAI API Key Generation enabled for the assignment part (Part settings > Resources).

  4. Python SDKs installed in the grading environment (see Installing Dependencies).

The Vocareum GenAI Gateway provides proxy endpoints so you use a single Vocareum API key (voc-...) to access all AI providers:

Provider

SDK

Vocareum Endpoint

Claude (Anthropic)

anthropic

ChatGPT (OpenAI)

openai

Gemini (Google)

google-genai

Important: The Gemini endpoint uses Google's native API, not the OpenAI-compatible format. You must use the google-genai SDK for Gemini.


Single-Model AI Grading (Claude)

The simplest AI grading setup — send student code to one AI model and use its evaluation directly.

Rubric: Score with max 10.

/voc/scripts/grade.sh

#!/bin/bash SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" python3 "$SCRIPT_DIR/grade.py" "$vocareumGradeFile" "$vocareumReportFile"

/voc/scripts/grade_prompt.txt

You are an expert code grader. You will be given a student's Python code submission. Evaluate the code thoroughly based on the following criteria:  1. **Correctness**: Does the code produce the expected output? Does it handle edge cases? 2. **Code Quality**: Is the code well-structured, readable, and following Python best practices? 3. **Efficiency**: Is the solution reasonably efficient in terms of time and space complexity? 4. **Error Handling**: Does the code handle potential errors gracefully? 5. **Documentation**: Are there appropriate comments or docstrings where needed?  Provide a detailed evaluation report and assign a score from 0 to 10.  You MUST respond with a valid JSON object in exactly this format and nothing else: {"report": "<detailed evaluation text>", "score": <integer 0-10>}

/voc/scripts/grade.py

import json import sys import os import anthropic  # Configuration # ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL are set automatically by Vocareum # when GenAI is enabled for your course — no hardcoded key is needed. api_key = os.environ.get("ANTHROPIC_API_KEY") base_url = os.environ.get("ANTHROPIC_BASE_URL") MODEL = "claude-sonnet-4-5-20250514" RUBRIC_CRITERION = "Score"   def read_file(path):     with open(path, "r") as f:         return f.read()   def parse_json_response(text):     """Extract and parse a JSON object from model response text."""     try:         return json.loads(text)     except json.JSONDecodeError:         pass     start = text.find("{")     end = text.rfind("}") + 1     if start != -1 and end > start:         raw = text[start:end]         try:             return json.loads(raw)         except json.JSONDecodeError:             pass         fixed = raw.replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t")         try:             return json.loads(fixed)         except json.JSONDecodeError:             pass     raise ValueError(f"Could not parse JSON from response:\n{text}")   def main():     script_dir = os.path.dirname(os.path.abspath(__file__))     grade_file = sys.argv[1]     report_file = sys.argv[2]      student_code = read_file("submit.py")     grade_prompt = read_file(os.path.join(script_dir, "grade_prompt.txt"))      client = anthropic.Anthropic(api_key=api_key, base_url=base_url)     response = client.messages.create(         model=MODEL,         max_tokens=4096,         system=grade_prompt,         messages=[             {"role": "user", "content": f"Here is the student's code:\n\n{student_code}"}         ],     )     result = parse_json_response(response.content[0].text)      with open(grade_file, "w") as f:         f.write(f"{RUBRIC_CRITERION}, {result['score']}\n")      with open(report_file, "w") as f:         f.write(result["report"])   if __name__ == "__main__":     main()

Hybrid Grading (Tests + AI)

Combine test-based grading for objective correctness with AI grading for subjective code quality. This gives you the best of both approaches — deterministic scoring for functionality and nuanced feedback for style.


Tests for Correctness, AI for Code Quality

Assignment: Students implement a top_k_words(text, k) function in submit.py.

Rubric: Two criteria — Correctness (max 6) and Code Quality (max 4).

/voc/scripts/grade.sh

#!/bin/bash SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" python3 "$SCRIPT_DIR/grade.py" "$vocareumGradeFile" "$vocareumReportFile"

/voc/scripts/quality_prompt.txt

You are an expert Python code reviewer. You will be given a student's Python code. Evaluate ONLY the code quality (not correctness — that is tested separately).  Assess these aspects: 1. **Readability**: Clear variable names, logical structure, appropriate whitespace 2. **Pythonic style**: Use of Python idioms, built-in functions, list comprehensions where appropriate 3. **Error handling**: Input validation, graceful handling of edge cases 4. **Documentation**: Docstrings, type hints, comments where needed  Assign a score from 0 to 4.  You MUST respond with a valid JSON object in exactly this format and nothing else: {"report": "<code quality evaluation>", "score": <integer 0-4>}

/voc/scripts/grade.py

import json import sys import os import importlib.util import anthropic  # ANTHROPIC_API_KEY and ANTHROPIC_BASE_URL are set automatically by Vocareum # when GenAI is enabled for your course — no hardcoded key is needed. api_key = os.environ.get("ANTHROPIC_API_KEY") base_url = os.environ.get("ANTHROPIC_BASE_URL") MODEL = "claude-sonnet-4-5-20250514"   def read_file(path):     with open(path, "r") as f:         return f.read()   def parse_json_response(text):     try:         return json.loads(text)     except json.JSONDecodeError:         pass     start = text.find("{")     end = text.rfind("}") + 1     if start != -1 and end > start:         raw = text[start:end]         try:             return json.loads(raw)         except json.JSONDecodeError:             pass         fixed = raw.replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t")         try:             return json.loads(fixed)         except json.JSONDecodeError:             pass     raise ValueError(f"Could not parse JSON from response:\n{text}")   # ── Part 1: Test-Based Correctness Grading ────────────────────────────  def run_correctness_tests():     """Run functional tests on student code. Returns (score, report_lines)."""     spec = importlib.util.spec_from_file_location("student", "submit.py")     student = importlib.util.module_from_spec(spec)      try:         spec.loader.exec_module(student)     except Exception as e:         return 0, [f"ERROR: Could not load submit.py — {e}"]      func = getattr(student, "top_k_words", None)     if func is None:         return 0, ["ERROR: Function 'top_k_words' not found in submit.py"]      tests = [         {             "name": "Basic functionality",             "args": ("the cat sat on the mat the cat", 2),             "check": lambda r: r[0][0] == "the" and r[0][1] == 3 and len(r) == 2,         },         {             "name": "k larger than unique words",             "args": ("hello world", 5),             "check": lambda r: len(r) == 2,         },         {             "name": "Empty string",             "args": ("", 3),             "check": lambda r: r == [],         },         {             "name": "Single word repeated",             "args": ("go go go", 1),             "check": lambda r: r[0][0] == "go" and r[0][1] == 3,         },         {             "name": "Punctuation handling",             "args": ("Hello, hello! HELLO.", 1),             "check": lambda r: r[0][1] == 3,         },         {             "name": "k equals zero",             "args": ("some words here", 0),             "check": lambda r: r == [],         },     ]      passed = 0     lines = ["=== Correctness Tests ==="]      for t in tests:         try:             result = func(*t["args"])             if t["check"](result):                 passed += 1                 lines.append(f"  {t['name']}: PASSED")             else:                 lines.append(f"  {t['name']}: FAILED (got {result})")         except Exception as e:             lines.append(f"  {t['name']}: ERROR — {type(e).__name__}: {e}")      score = round(passed / len(tests) * 6)     lines.append(f"\n  Correctness score: {score}/6 ({passed}/{len(tests)} tests passed)")     return score, lines   # ── Part 2: AI-Based Code Quality Grading ─────────────────────────────  def run_quality_review(student_code, prompt):     """Send code to Claude for quality review. Returns (score, report_lines)."""     client = anthropic.Anthropic(api_key=api_key, base_url=base_url)     response = client.messages.create(         model=MODEL,         max_tokens=2048,         system=prompt,         messages=[             {"role": "user", "content": f"Here is the student's code:\n\n{student_code}"}         ],     )     result = parse_json_response(response.content[0].text)     lines = [         "\n=== Code Quality Review (AI) ===",         result["report"],         f"\n  Code quality score: {result['score']}/4",     ]     return result["score"], lines   # ── Main ──────────────────────────────────────────────────────────────  def main():     script_dir = os.path.dirname(os.path.abspath(__file__))     grade_file = sys.argv[1]     report_file = sys.argv[2]      student_code = read_file("submit.py")     quality_prompt = read_file(os.path.join(script_dir, "quality_prompt.txt"))      # Run tests     correctness_score, correctness_report = run_correctness_tests()      # Run AI review     quality_score, quality_report = run_quality_review(student_code, quality_prompt)      total = correctness_score + quality_score      # Write grades (two separate rubric criteria)     with open(grade_file, "w") as f:         f.write(f"Correctness, {correctness_score}\n")         f.write(f"Code Quality, {quality_score}\n")      # Write combined report     with open(report_file, "w") as f:         f.write(f"Total Score: {total}/10\n\n")         f.write("\n".join(correctness_report))         f.write("\n")         f.write("\n".join(quality_report))   if __name__ == "__main__":     main()

Writing Effective AI Grading Prompts

The quality of AI grading depends heavily on the system prompt. A well-crafted prompt produces consistent, fair, and detailed evaluations. A vague prompt leads to inconsistent scores and unhelpful feedback.

Prompt Structure

Every grading prompt should have four parts:

  1. Role — Tell the AI what it is ("You are an expert code grader").

  2. Input — Describe what it will receive ("You will be given a student's Python code submission").

  3. Criteria — List specific evaluation criteria with clear descriptions.

  4. Output format — Require structured JSON output so it can be parsed programmatically.

Here is a minimal example that includes all four parts:

You are an expert code grader. You will be given a student's Python code submission. Evaluate the code based on correctness, code quality, and efficiency. Assign a score from 0 to 10.  You MUST respond with a valid JSON object in exactly this format and nothing else: {"report": "<detailed evaluation text>", "score": <integer 0-10>}

Be Specific About Criteria

The more specific your criteria, the more consistent and useful the grading will be.

Vague prompt:

Evaluate the code and give a score.

Better prompt:

Evaluate the code based on: 1. Correctness: Does it handle empty input, negative numbers, and    duplicate values? 2. Efficiency: Is it O(n log n) or better for sorting? 3. Style: Does it use descriptive variable names and avoid magic numbers?

Include Assignment Context

Tell the AI what the student was asked to do. Without this context, the AI cannot judge whether the student met the requirements.

The student was asked to implement a binary search function that: - Takes a sorted list and a target value - Returns the index if found, -1 otherwise - Must use iterative (not recursive) approach

If there are constraints (e.g., "must not use built-in sort"), include them so the AI can check for violations.

Define the Scoring Scale

Without guidance, different AI models interpret "0 to 10" differently. Define what each range means:

Scoring guide: - 9-10: Fully correct, clean code, handles all edge cases, well documented - 7-8: Mostly correct with minor issues, reasonable code quality - 5-6: Partially correct, some edge cases missed, code works but has         quality issues - 3-4: Significant correctness problems, poor structure - 1-2: Barely functional, major errors - 0: No meaningful attempt or code does not run

Enforce JSON Output

Your grading script needs to parse the AI's response programmatically. Always end the prompt with a strict output format requirement:

You MUST respond with a valid JSON object in exactly this format and nothing else: {"report": "<detailed evaluation text>", "score": <integer 0-10>}

Key points:

  • "and nothing else" prevents the AI from wrapping JSON in markdown code blocks or adding commentary.

  • Specify the exact field names (report, score) so your parsing code can rely on them.

  • Specify the data types (string for report, integer for score) to avoid getting floats or string scores.

Reconciliation Prompts

When using multiple AI models, you need a separate prompt for the reconciliation step:

You are an expert code grading reconciler. You will be given three independent evaluation reports of a student's Python code submission. Each report includes a detailed evaluation and a score from 0 to 10.  Your task: 1. Review all three evaluations carefully. 2. Identify points of agreement and disagreement. 3. Produce a final consolidated report. 4. Assign a final score from 0 to 10 that fairly reflects the student's    work, considering all three evaluations.  If the scores differ significantly, explain your reasoning for the final score.  You MUST respond with a valid JSON object in exactly this format and nothing else: {"report": "<final consolidated evaluation text>", "score": <integer 0-10>}

Prompt Tips

  • Test your prompt with a few sample submissions before deploying. Check that scores are consistent and feedback is useful.

  • Include percentage weights for each criterion so the AI understands relative importance.

  • Ask for specific feedback ("Reference specific lines") rather than generic comments.

  • Keep prompts under 1000 words. Excessively long prompts can dilute the AI's focus.

  • Iterate on your prompt if scores seem too generous or too harsh across test submissions.

  • Use the same prompt for all evaluator models (Claude, ChatGPT, Gemini) to ensure fair comparison during reconciliation.


Installing Dependencies

Many packages are already pre-configured in Vocareum lab environments. If you need to install any specialized packages for grading, refer to: Installing Packages in Vocareum Labs.

For Test-Based Grading

No additional dependencies are needed — Python's standard library is sufficient. Pytest is also pre-installed in the lab environment for Vocareum Notebook, VS Code, and JupyterLab assignments.

For AI-Powered Grading

Ensure the SDK is installed for each AI provider you use. Most are pre-installed in Vocareum Notebook, VS Code, and JupyterLab assignments.


Testing Your Grading Script

Test Locally in Configure Workspace

Configure Workspace is available for both VS Code and Vocareum Notebook labs. The steps and file paths below are the same for both lab types; the interface will reflect the IDE for your lab type.

  1. Open Configure Workspace for your assignment part.

  2. Place a sample student submission at /voc/work/submit.py.

  3. Run the grading script manually:

cd /voc/work bash /voc/scripts/grade.sh
  1. Check the generated grade and report files:

cat "$vocareumGradeFile"    # Should show: Score, 7  (or your criteria) cat "$vocareumReportFile"   # Should show the feedback report

If $vocareumGradeFile and $vocareumReportFile are not set (when running manually), the script should fall back to local file paths. The examples above handle this case.

Test with a Real Student Submission

  1. Log in as a test student.

  2. Submit a sample file.

  3. Trigger grading (automatically on submit, or manually from Dashboard > Control > Auto Grade).

  4. Check that the grade and report appear in the student's submission view.


Tips and Best Practices

General

  • Always read student files from the working directory (/voc/work/, available as $VOC_HOME_DIR), not from the script directory (/voc/scripts/). The grading script runs in the student's workspace.

  • Read your own support files from the script directory using os.path.dirname(os.path.abspath(__file__)) to get the path to /voc/scripts/.

  • Handle errors gracefully. If the student's code fails to load or crashes, still write a grade (usually 0) and a helpful error message rather than letting the grading script crash.

  • Use timeouts when running student code to prevent infinite loops from hanging the grader:

    result = subprocess.run(["python3", "submit.py"], capture_output=True, text=True, timeout=10)

Test-Based Grading

  • Test edge cases — empty input, single elements, very large inputs, negative numbers, None values.

  • Use importlib to load student modules rather than import, so you can catch import-time errors.

  • Isolate test execution — run student code in a subprocess if you're concerned about side effects or infinite loops.

  • Provide clear feedback — tell students which tests passed, which failed, and what the expected output was.

AI-Based Grading

  • Set the grading timeout to 120+ seconds for scripts that make AI API calls, which can take 15–30 seconds per call.

  • Use parse_json_response with fallback logic — AI models occasionally return JSON wrapped in markdown or with extra whitespace.

  • Consider cost — each grading run makes one or more API calls. Use lighter models (e.g., claude-sonnet-4-5-20250514) to reduce cost and latency.

  • Read API keys from environment variables. When GenAI is enabled for your course, Vocareum automatically sets ANTHROPIC_API_KEY, OPENAI_API_KEY, and the corresponding *_BASE_URL variables in the grading environment. Use os.environ.get("ANTHROPIC_API_KEY") rather than hardcoding a key in your script.

  • Each AI provider uses a different SDK through Vocareum:

Hybrid Approach

  • Use tests for objectively verifiable criteria (does it produce the right output?) and AI for subjective criteria (is the code clean and well-documented?).

  • Write to separate rubric criteria (e.g., Correctness and Code Quality) so students see how they scored in each area.

  • Run tests first — if the code doesn't load or fails all tests, you can skip the AI review and save API costs.

Did this answer your question?