CodeInstructionGenerator
About 512 wordsAbout 2 min
2025-11-10
CodeInstructionGenerator is an operator that randomly samples few-shot examples from a data pool and uses a large language model (LLM) to generate instructions of similar difficulty. This serves as the first step in a 'self-instruct' style data synthesis pipeline for the code domain.
__init__
class CodeInstructionGenerator(OperatorABC):
def __init__(self, llm_serving: LLMServingABC, prompt_template=None, num_few_shot: int = 3, num_generate: int = 10):| Parameter | Type | Default | Description |
|---|---|---|---|
| llm_serving | LLMServingABC | Required | Large language model serving instance for executing inference. |
| prompt_template | PromptABC / str | CodeCodeToInstructionGeneratorPrompt() | The prompt template object used to construct the input. Supports custom templates via string or DiyCodePrompt. |
| num_few_shot | int | 3 | The number of few-shot examples to sample. |
| num_generate | int | 10 | The number of similar instructions to generate. |
Prompt Template Descriptions
| Prompt Template Name | Primary Use | Applicable Scenarios | Feature Description |
|---|---|---|---|
| CodeInstructionGeneratePrompt | Generate new code instructions | Create new programming problems of similar style based on a few examples | Generates stylistically consistent instructions based on few-shot examples, maintaining similar difficulty and complexity, ensuring instructions are clear, specific, and solvable. |
run
def run(self, storage: DataFlowStorage, input_key: str = "prompt", output_key: str = "generated_instruction")| Parameter | Type | Default | Description |
|---|---|---|---|
| storage | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
| input_key | str | "prompt" | Input column name, corresponding to the example instruction field.。 |
| output_key | str | "generated_instruction" | Output column name, corresponding to the generated instruction field. |
🧠 Example Usage
🧾 Default Output Format
| Field | Type | Description |
|---|---|---|
| prompt | str | The input instruction. |
| generated_instruction | str | The instruction generated by the model. |
Example Input:
{"prompt": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"\n"}
{"prompt": "from typing import List\n\n\ndef separate_paren_groups(paren_string: str) -> List[str]:\n \"\"\" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to\n separate those group into separate strings and return the list of those.\n Separate groups are balanced (each open brace is properly closed) and not nested within each other\n Ignore any spaces in the input string.\n >>> separate_paren_groups('( ) (( )) (( )( ))')\n ['()', '(())', '(()())']\n \"\"\"\n"}
{"prompt": "\n\ndef truncate_number(number: float) -> float:\n \"\"\" Given a positive floating point number, it can be decomposed into\n and integer part (largest integer smaller than given number) and decimals\n (leftover part always smaller than 1).\n\n Return the decimal part of the number.\n >>> truncate_number(3.5)\n 0.5\n \"\"\"\n"}Example Output:
{"generated_instruction": "from typing import List\n\n\ndef parse_nested_parens(paren_string: str) -> List[int]:\n \"\"\" Input to this function is a string represented multiple groups for nested parentheses separated by spaces.\n For each of the group, output the deepest level of nesting of parentheses.\n E.g. (()()) has maximum two levels of nesting while ((())) has three.\n\n >>> parse_nested_parens('(()()) ((())) () ((())()())')\n [2, 3, 1, 3]\n \"\"\"\n"}
