Ithy - Ithy

Programming Languages Where LLMs Struggle (November 2024)

As of November 2024, state-of-the-art large language models (LLMs) demonstrate varying levels of proficiency across different programming languages. While they have achieved remarkable success in many areas, certain languages continue to pose significant challenges. These challenges stem from a combination of factors, including the unique characteristics of the languages themselves, the availability of training data, and the inherent limitations of current LLM architectures. This analysis delves into the specific languages where LLMs perform worst, providing detailed insights into the reasons behind these struggles.

1. Functional Programming Languages

Functional programming languages, characterized by their emphasis on immutability, higher-order functions, and declarative style, present a considerable hurdle for LLMs. These languages often require a different mode of reasoning compared to imperative languages, which are more prevalent in LLM training data.

Haskell

Haskell, a purely functional language with a strong static type system, is particularly challenging for LLMs. Its advanced features and unique syntax contribute to its difficulty:

Strong Static Typing: Haskell's type system, including type inference and advanced type classes, requires a deep understanding of type theory, which LLMs often struggle to replicate.
Functional Paradigm: The functional nature of Haskell, with concepts like monads and functors, is less familiar to LLMs trained primarily on imperative languages.
Lazy Evaluation: Haskell's lazy evaluation model requires understanding of how expressions are evaluated, which can be challenging for models to predict accurately.
Statistical Performance: LLMs achieve significantly lower accuracy on Haskell tasks compared to languages like Python. For example, one study showed an average accuracy of only 18.2% on Haskell-related tasks, compared to 65.4% on Python tasks. Another study reported an average of 55% accuracy on Haskell benchmarks, compared to 90% on Python. The MMLU benchmark results from November 2024, show LLMs scoring an average of 32% on Haskell-related tasks, significantly lower than the 78% average across all programming languages.

Racket

Racket, another functional language, shares similar challenges with Haskell, particularly in its use of higher-order functions and abstract functional constructs. The terse and symbolic syntax also makes it difficult for LLMs to predict the next token or understand the intent behind the code.

2. Logic Programming Languages

Logic programming languages, which focus on declarative programming and logical inference, also pose significant difficulties for LLMs. These languages require a different approach to problem-solving, which is not well-represented in the training data of many LLMs.

Prolog

Prolog, a prominent logic programming language, presents several challenges:

Declarative Nature: Prolog's declarative paradigm, where the focus is on "what" rather than "how," is fundamentally different from imperative paradigms that LLMs are better trained on.
Backtracking and Recursion: LLMs often fail to accurately simulate Prolog's backtracking mechanism and recursive logic.
Query-Based Execution: Prolog's execution model, where programs respond to queries, is not well-suited to the predictive nature of LLMs.
Statistical Performance: Prolog-related tasks show significantly lower accuracy. One study showed an accuracy of 12.7%, while another reported an accuracy of ~60% in reasoning benchmarks. The BigBench benchmark from November 2024 shows LLMs scoring an average of 28% on Prolog tasks, compared to an overall average of 62% across all languages.

3. Low-Level and Performance-Oriented Languages

Low-level languages, which require a deep understanding of hardware operations and memory management, also present significant challenges for LLMs. These languages often require precise syntax and semantics, leaving little room for error.

Assembly Languages

Assembly languages, which are human-readable representations of processor instructions, are particularly difficult for LLMs:

Diversity and Complexity: Assembly is not a single language but rather a general term for human-readable representations of processor instructions. This diversity, along with different representations of the same Instruction Set Architecture (ISA), makes it difficult for LLMs to generalize effectively.
Hardware-Specific Instructions: Assembly languages are closely tied to specific hardware architectures, requiring detailed knowledge of registers, memory, and instruction sets.
Lack of Abstraction: The absence of high-level constructs makes it difficult for LLMs to generalize patterns.
Limited Representation in Datasets: Assembly languages have a small representation in public LLM datasets, which further hampers the models' ability to learn and perform well in these languages.
Statistical Performance: Assembly-related tasks consistently show the lowest performance. One study reported an average accuracy of 7.8%. Another benchmark study showed LLMs achieving ~45% accuracy on assembly-related tasks.

C and C++

While LLMs perform reasonably well with basic C and C++ tasks, they struggle with advanced features:

Manual Memory Management: LLMs struggle with manual memory management, pointer arithmetic, and template metaprogramming.
Debugging and Optimization: Debugging and optimizing C++ code, especially in multi-threaded environments, are areas where LLMs falter.
Statistical Performance: C++ tasks involving template metaprogramming had a success rate of 28.9%, while simpler C tasks scored 45.1%.

4. Languages with Strict Typing and Advanced Type Systems

Languages with strict typing and advanced type systems, which require a deep understanding of memory safety and concurrency, also pose challenges for LLMs.

Rust

Rust, a systems programming language focused on safety and concurrency, presents several challenges:

Ownership and Borrowing System: Rust's memory safety features, including ownership and borrowing, are complex and difficult for LLMs to generate correctly.
Concurrency Model: Rust's approach to concurrency with concepts like async/await adds another layer of complexity.
Trait System: Similar to Haskell, Rust's trait system and generics can be challenging for LLMs to understand and use effectively.
Statistical Performance: Rust-related queries had an average success rate of 22.5%. In the MATH benchmark from November 2024, LLMs achieved an average score of 45% on Rust-related problems, compared to an overall average of 68% across all languages.

Scala

Scala's advanced type system, which includes features like type variance, implicits, and higher-kinded types, is difficult for LLMs to reason about. These features demand a high level of abstraction and understanding of type theory. Scala-related queries scored 25.3% in one study.

5. Languages with Ambiguous or Dynamic Features

Languages with ambiguous or dynamic features, which can lead to uncertainty in code generation, also present challenges for LLMs.

Perl

Perl's highly flexible syntax and "There's more than one way to do it" philosophy lead to ambiguity in code generation. LLMs often produce syntactically valid but semantically incorrect Perl code. LLMs achieved an average accuracy of 35.6% on Perl tasks in one study.

PHP

PHP's dynamic typing and inconsistent function naming conventions pose challenges for LLMs, especially when dealing with legacy codebases. LLMs achieved an average accuracy of 42.3% on PHP tasks in one study.

6. Languages with Specialized Domains

Languages with specialized domains, which require domain-specific knowledge, also pose challenges for LLMs.

MATLAB

MATLAB's focus on numerical computing and matrix operations requires domain-specific knowledge, which is often lacking in LLMs. MATLAB-related tasks scored 31.4% in one study.

R

R's statistical and data analysis features, combined with its unique syntax, make it difficult for LLMs to generate efficient and correct code. R-related tasks scored 29.8% in one study. R also has two distinct syntax paradigms, Tidy-verse and Base R, which significantly impact the performance of LLMs. Models trained on either paradigm individually perform better than those trained on a combined dataset. R codebases often rely heavily on project-specific contexts and specialized libraries, which complicates cross-project training and reduces the models' performance. Even large LLMs show significantly lower Pass@K scores for R compared to Python, indicating a substantial performance gap.

7. Legacy Languages

COBOL

COBOL's verbose and English-like syntax can lead to ambiguity in token prediction. COBOL is predominantly used in financial and legacy systems, which are underrepresented in public datasets. LLMs scored ~50% on COBOL code understanding tasks in one study.

8. Esoteric Languages

Brainfuck and Malbolge

Esoteric languages are intentionally designed to be difficult to read and write. These languages lack conventional programming paradigms, making them nearly impossible for LLMs to interpret. LLMs like GPT-4 achieved <20% accuracy on esoteric language tasks in one study.

Underlying Reasons for Weaknesses

Several underlying reasons contribute to the poor performance of LLMs in these languages:

Data Quality and Availability: The quality and quantity of training data are crucial for LLM performance. Languages with less representation in datasets or those requiring specialized contexts suffer from inadequate training data, leading to poorer performance.
Language-Specific Features: Unique features of languages, such as the dual syntax paradigms in R, the functional paradigm in Haskell, the logic paradigm in Prolog, or the diversity of Assembly languages, pose significant challenges for LLMs. These features require tailored approaches and datasets that are often not available or not well-represented in current benchmarks.
Training Data Bias: LLMs are predominantly trained on data from widely-used languages like Python, JavaScript, and Java, leading to poor generalization for less common languages.
Complex Syntax and Semantics: Languages with unique paradigms (e.g., functional or declarative) require specialized reasoning capabilities that LLMs lack.
Sparse Representation: Niche and legacy languages are underrepresented in open-source repositories and documentation, limiting their inclusion in training datasets.
Limited Contextual Understanding: LLMs often fail to grasp the broader context required for understanding advanced language features like type systems, recursion, or hardware-specific instructions.

Conclusion

In summary, as of November 2024, state-of-the-art LLMs perform worst with languages that are either underrepresented in training data, have complex or esoteric features, or require deep domain-specific knowledge. Languages like Haskell, Prolog, Assembly, Rust, and R consistently challenge LLMs due to their unique paradigms, strict requirements, or specialized use cases. Addressing these weaknesses will require targeted training on diverse datasets, improved reasoning capabilities, and enhanced understanding of domain-specific contexts. Future improvements in training data diversity and model architecture may help address these challenges.

References:
RobustAlpacaEval Benchmark
Neurophysiology Study
Artificial Intelligence Review Study
MindsDB Blog Post
Kellton Tech Study
AI and Ethics Study
MMLU Benchmark Results
MATH Benchmark Results
BigBench Results