Extended Backus-Naur Form (EBNF) is a syntax notation used to express context-free grammars, which are essential in defining the syntax of programming languages. Creating an interpreter that can read an EBNF description of a language and execute programs written in that language involves several key components:
This guide provides a comprehensive ANSI C implementation that accomplishes these tasks, enabling the interpretation of programs written in any language described by an EBNF grammar.
The first step is to read and parse the EBNF grammar file. This involves:
Once the grammar is parsed, the input program written in the defined language needs to be tokenized:
The tokenized input is then used to build an AST, which represents the hierarchical structure of the program:
The final step involves traversing the AST to execute the program:
To represent grammar rules and the AST, we'll define several structures:
typedef struct Rule {
char name[50];
char definition[500];
struct Rule *next;
} Rule;
typedef struct Grammar {
Rule *rules;
} Grammar;
These structures store grammar rules as a linked list, where each rule has a name and its corresponding definition.
typedef struct ASTNode {
char type[20]; // Type of node (e.g., "Number", "Add")
int value; // Value for numbers
struct ASTNode* left;
struct ASTNode* right;
} ASTNode;
The ASTNode structure represents nodes in the Abstract Syntax Tree, with fields for the node type, value, and pointers to child nodes.
Parsing the EBNF involves reading the grammar file and constructing the Grammar structure:
Grammar* parseEBNF(const char *filename) {
FILE *file = fopen(filename, "r");
if (!file) {
perror("Error opening EBNF file");
return NULL;
}
Grammar *grammar = (Grammar *)malloc(sizeof(Grammar));
grammar->rules = NULL;
char line[550];
while (fgets(line, sizeof(line), file)) {
trimWhitespace(line);
if (strlen(line) == 0 || line[0] == '#') continue; // Ignore empty lines or comments
char *delimiter = strstr(line, "::=");
if (!delimiter) {
fprintf(stderr, "Invalid EBNF syntax: %s\n", line);
continue;
}
char name[50];
char definition[500];
strncpy(name, line, delimiter - line);
name[delimiter - line] = '\0';
strcpy(definition, delimiter + 3);
trimWhitespace(name);
trimWhitespace(definition);
addRule(grammar, name, definition);
}
fclose(file);
return grammar;
}
This function reads the EBNF file, extracts rule names and definitions, trims whitespace, and adds them to the Grammar structure using the addRule
function.
void addRule(Grammar *grammar, const char *name, const char *definition) {
Rule *newRule = (Rule *)malloc(sizeof(Rule));
strcpy(newRule->name, name);
strcpy(newRule->definition, definition);
newRule->next = grammar->rules;
grammar->rules = newRule;
}
This function creates a new Rule and prepends it to the grammar's rules linked list.
void trimWhitespace(char *str) {
char *end;
// Trim leading space
while(isspace((unsigned char)*str)) str++;
if(*str == 0) return; // All spaces
// Trim trailing space
end = str + strlen(str) - 1;
while(end > str && isspace((unsigned char)*end)) end--;
// Write new null terminator
*(end+1) = '\0';
}
This function removes leading and trailing whitespaces from a string.
Tokenization involves breaking the input program into tokens based on the grammar rules:
// Structure for parser state
typedef struct {
const char *input;
int position;
} ParserState;
// Function to skip whitespace
void skipWhitespace(ParserState *state) {
while (state->input[state->position] && isspace(state->input[state->position])) {
state->position++;
}
}
// Function to get the next token
char* getNextToken(ParserState *state) {
skipWhitespace(state);
if (!state->input[state->position]) return NULL;
char *token = (char*)malloc(MAX_TOKEN_LENGTH);
int i = 0;
while (state->input[state->position] && !isspace(state->input[state->position]) && i < MAX_TOKEN_LENGTH - 1) {
token[i++] = state->input[state->position++];
}
token[i] = '\0';
return token;
}
The ParserState
structure maintains the current position in the input. The getNextToken
function extracts tokens sequentially, skipping any whitespace.
The AST represents the hierarchical structure of the program. Here's how to construct it:
ASTNode* createNode(const char* type, int value) {
ASTNode* node = (ASTNode*)malloc(sizeof(ASTNode));
strcpy(node->type, type);
node->value = value;
node->left = NULL;
node->right = NULL;
return node;
}
This function initializes a new AST node with the given type and value.
ASTNode* parseExpression(const char* input) {
// For simplicity, assume the input is a single number or a binary operation
if (strchr(input, '+')) {
ASTNode* node = createNode("Add", 0);
node->left = createNode("Number", atoi(input));
node->right = createNode("Number", atoi(strchr(input, '+') + 1));
return node;
} else if (strchr(input, '-')) {
ASTNode* node = createNode("Subtract", 0);
node->left = createNode("Number", atoi(input));
node->right = createNode("Number", atoi(strchr(input, '-') + 1));
return node;
} else {
return createNode("Number", atoi(input));
}
}
This simplified parser creates AST nodes based on the presence of '+' or '-' operators in the input.
Interpretation involves traversing the AST and executing operations:
int interpret(ASTNode* node) {
if (strcmp(node->type, "Number") == 0) {
return node->value;
} else if (strcmp(node->type, "Add") == 0) {
return interpret(node->left) + interpret(node->right);
} else if (strcmp(node->type, "Subtract") == 0) {
return interpret(node->left) - interpret(node->right);
}
return 0; // Default case
}
The interpret
function recursively evaluates the AST nodes to compute the final result of the program.
The main function brings together all components to form a complete interpreter:
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: %s <EBNF file> <program file>\n", argv[0]);
return 1;
}
const char *ebnfFile = argv[1];
const char *programFile = argv[2];
Grammar *grammar = parseEBNF(ebnfFile);
if (!grammar) {
return 1;
}
// Read the program file
FILE *progFile = fopen(programFile, "r");
if (!progFile) {
perror("Error opening program file");
freeGrammar(grammar);
return 1;
}
char inputProgram[1000];
fread(inputProgram, 1, sizeof(inputProgram)-1, progFile);
inputProgram[strcspn(inputProgram, "\n")] = 0; // Remove newline
fclose(progFile);
// Parse the expression
ASTNode* root = parseExpression(inputProgram);
// Interpret the AST
int result = interpret(root);
printf("Result: %d\n", result);
// Free memory
// (Free AST nodes and grammar here)
freeGrammar(grammar);
return 0;
}
This main function performs the following:
The current implementation utilizes a simplified parsing strategy suitable for basic arithmetic expressions. To handle more complex grammars defined by EBNF, consider implementing the following:
To support a wider range of language features, the AST can be extended with additional node types and properties:
Efficient memory management is crucial for performance and stability:
Below is the complete ANSI C code integrating all the components discussed:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define MAX_TOKEN_LENGTH 256
#define MAX_RULES 100
// Structure to represent a grammar rule
typedef struct Rule {
char name[50];
char definition[500];
struct Rule *next;
} Rule;
// Structure for the grammar
typedef struct Grammar {
Rule *rules;
} Grammar;
// Structure for parser state
typedef struct {
const char *input;
int position;
} ParserState;
// Structure for AST nodes
typedef struct ASTNode {
char type[20]; // Type of node (e.g., "Number", "Add")
int value; // Value for numbers
struct ASTNode* left;
struct ASTNode* right;
} ASTNode;
// Function prototypes
Grammar* parseEBNF(const char *filename);
void addRule(Grammar *grammar, const char *name, const char *definition);
void trimWhitespace(char *str);
ASTNode* createNode(const char* type, int value);
ASTNode* parseExpression(const char* input);
int interpret(ASTNode* node);
char* getNextToken(ParserState *state);
void skipWhitespace(ParserState *state);
void freeGrammar(Grammar *grammar);
void freeAST(ASTNode* node);
// Function to trim leading and trailing whitespace
void trimWhitespace(char *str) {
char *end;
// Trim leading space
while(isspace((unsigned char)*str)) str++;
if(*str == 0) return; // All spaces
// Trim trailing space
end = str + strlen(str) - 1;
while(end > str && isspace((unsigned char)*end)) end--;
// Write new null terminator
*(end+1) = '\0';
}
// Function to add a rule to the grammar
void addRule(Grammar *grammar, const char *name, const char *definition) {
if (grammar->rules == NULL) {
grammar->rules = (Rule *)malloc(sizeof(Rule));
strcpy(grammar->rules->name, name);
strcpy(grammar->rules->definition, definition);
grammar->rules->next = NULL;
} else {
Rule *current = grammar->rules;
while (current->next != NULL) {
current = current->next;
}
current->next = (Rule *)malloc(sizeof(Rule));
strcpy(current->next->name, name);
strcpy(current->next->definition, definition);
current->next->next = NULL;
}
}
// Function to parse the EBNF file and construct the grammar
Grammar* parseEBNF(const char *filename) {
FILE *file = fopen(filename, "r");
if (!file) {
perror("Error opening EBNF file");
return NULL;
}
Grammar *grammar = (Grammar *)malloc(sizeof(Grammar));
grammar->rules = NULL;
char line[550];
while (fgets(line, sizeof(line), file)) {
trimWhitespace(line);
if (strlen(line) == 0 || line[0] == '#') continue; // Ignore empty lines or comments
char *delimiter = strstr(line, "::=");
if (!delimiter) {
fprintf(stderr, "Invalid EBNF syntax: %s\n", line);
continue;
}
char name[50];
char definition[500];
strncpy(name, line, delimiter - line);
name[delimiter - line] = '\0';
strcpy(definition, delimiter + 3);
trimWhitespace(name);
trimWhitespace(definition);
addRule(grammar, name, definition);
}
fclose(file);
return grammar;
}
// Function to create a new AST node
ASTNode* createNode(const char* type, int value) {
ASTNode* node = (ASTNode*)malloc(sizeof(ASTNode));
strcpy(node->type, type);
node->value = value;
node->left = NULL;
node->right = NULL;
return node;
}
// Function to parse an expression and build the AST
ASTNode* parseExpression(const char* input) {
// For simplicity, assume the input is a single number or a binary operation
if (strchr(input, '+')) {
ASTNode* node = createNode("Add", 0);
node->left = createNode("Number", atoi(input));
node->right = createNode("Number", atoi(strchr(input, '+') + 1));
return node;
} else if (strchr(input, '-')) {
ASTNode* node = createNode("Subtract", 0);
node->left = createNode("Number", atoi(input));
node->right = createNode("Number", atoi(strchr(input, '-') + 1));
return node;
} else {
return createNode("Number", atoi(input));
}
}
// Function to interpret the AST
int interpret(ASTNode* node) {
if (strcmp(node->type, "Number") == 0) {
return node->value;
} else if (strcmp(node->type, "Add") == 0) {
return interpret(node->left) + interpret(node->right);
} else if (strcmp(node->type, "Subtract") == 0) {
return interpret(node->left) - interpret(node->right);
}
return 0; // Default case
}
// Function to skip whitespace in the input
void skipWhitespace(ParserState *state) {
while (state->input[state->position] && isspace(state->input[state->position])) {
state->position++;
}
}
// Function to get the next token from the input
char* getNextToken(ParserState *state) {
skipWhitespace(state);
if (!state->input[state->position]) return NULL;
char *token = (char*)malloc(MAX_TOKEN_LENGTH);
int i = 0;
while (state->input[state->position] && !isspace(state->input[state->position]) && i < MAX_TOKEN_LENGTH - 1) {
token[i++] = state->input[state->position++];
}
token[i] = '\0';
return token;
}
// Function to free the grammar
void freeGrammar(Grammar *grammar) {
Rule *current = grammar->rules;
while (current) {
Rule *temp = current;
current = current->next;
free(temp);
}
free(grammar);
}
// Function to free the AST
void freeAST(ASTNode* node) {
if (node == NULL) return;
freeAST(node->left);
freeAST(node->right);
free(node);
}
// Main function
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: %s <EBNF file> <program file>\n", argv[0]);
return 1;
}
const char *ebnfFile = argv[1];
const char *programFile = argv[2];
// Parse the EBNF grammar
Grammar *grammar = parseEBNF(ebnfFile);
if (!grammar) {
return 1;
}
// Read the program file
FILE *progFile = fopen(programFile, "r");
if (!progFile) {
perror("Error opening program file");
freeGrammar(grammar);
return 1;
}
char inputProgram[1000];
fread(inputProgram, 1, sizeof(inputProgram)-1, progFile);
inputProgram[strcspn(inputProgram, "\n")] = 0; // Remove newline
fclose(progFile);
// Parse the expression into an AST
ASTNode* root = parseExpression(inputProgram);
if (!root) {
fprintf(stderr, "Failed to parse the input program.\n");
freeGrammar(grammar);
return 1;
}
// Interpret the AST
int result = interpret(root);
printf("Result: %d\n", result);
// Free allocated memory
freeAST(root);
freeGrammar(grammar);
return 0;
}
The parseEBNF
function reads the EBNF file, ignores comments and empty lines, and extracts rules using the ::=
delimiter. Each rule is added to the Grammar
structure using the addRule
function.
Tokenization is handled by the getNextToken
function, which iterates through the input program, skipping whitespace and extracting tokens based on the defined grammar. This simplistic approach assumes that tokens are separated by spaces.
The AST is constructed by the parseExpression
function. For demonstration purposes, it handles basic arithmetic expressions with addition and subtraction. Each expression results in an AST node that can be recursively interpreted.
The interpret
function traverses the AST recursively. Depending on the node type, it performs the corresponding operation. For example, an "Add" node will interpret its left and right children and return their sum.
Proper memory management is crucial. The freeGrammar
and freeAST
functions ensure that all dynamically allocated memory is freed once it's no longer needed, preventing memory leaks.
The current implementation supports only basic arithmetic expressions. To handle more complex languages, the parser needs to be extended to support additional EBNF constructs like nested expressions, different data types, control structures, and more.
Improving error detection and providing meaningful error messages will make the interpreter more robust and user-friendly. Implementing mechanisms to recover from parsing errors without terminating the program abruptly is essential.
Optimizing the parser and interpreter for performance can be achieved through techniques like memoization in parsing, efficient memory allocation, and possibly introducing Just-In-Time (JIT) compilation for frequently executed code paths.
Developing an ANSI C program that reads an EBNF description and interprets programs written in that language involves a methodical approach encompassing grammar parsing, tokenization, AST construction, and interpretation. The provided implementation serves as a foundational framework that can be extended and optimized to support more complex language features and enhance performance. Proper memory management and error handling are critical to building a robust interpreter.
These references provide additional insights and examples for understanding and extending the EBNF parser and interpreter implemented in ANSI C.