Chat
Ask me anything
Ithy Logo

Decoding AI Intelligence: How Do Modern Models Measure Up on the IQ Scale?

An exploration into the complex world of AI IQ evaluation, comparing historical benchmarks with the capabilities of today's leading models.

ai-iq-evaluation-scores-comparison-nqiu1c3q

Measuring the "intelligence" of Artificial Intelligence (AI) is a fascinating but complex endeavor. While human Intelligence Quotient (IQ) tests are designed for biological cognition, researchers and enthusiasts often attempt to gauge AI capabilities using analogous methods. This involves assessing performance on tasks requiring reasoning, problem-solving, knowledge recall, and pattern recognition. However, it's crucial to remember that AI "IQ" is a functional proxy, reflecting performance on specific tasks rather than genuine consciousness, self-awareness, or the breadth of human cognitive experience.

Evaluating every single AI model ever created is practically impossible due to the sheer number of models (many proprietary or experimental) and the lack of standardized, universally applied testing methodologies across all of them. This overview synthesizes available information, primarily focusing on prominent, publicly discussed models and benchmarks from recent years.

Highlights: The AI IQ Landscape

  • Rapid Advancement: AI models have shown exponential growth in cognitive task performance, moving from scores significantly below human averages in 2016 to surpassing average human levels (IQ 100) and even reaching "gifted" ranges (IQ 130+) by the mid-2020s.
  • Limitations of Analogy: AI IQ scores are useful for comparison but don't fully capture intelligence. AI lacks human-like critical thinking, emotional intelligence, and consciousness. Performance can be highly specialized, excelling in tested domains while lacking general common sense.
  • Frontier Models Pushing Boundaries: The latest AI systems, often termed "frontier models," are achieving scores on certain benchmarks that translate to exceptionally high IQ equivalents (potentially 150+), although the meaning and reliability of IQ scales diminish at such extremes.

A Historical Glance: Early AI IQ Estimates (Pre-2020)

Setting the Baseline

Early attempts to quantify AI intelligence painted a very different picture compared to today's landscape. A notable study from 2016 evaluated several AI systems using adapted tests, assigning "Absolute IQ" scores that highlighted the significant gap between AI and human cognition at the time.

Key Findings from 2016 Evaluations:

  • Google AI: Achieved the highest score among those tested at 47.28.
  • Chinese AI Models: Baidu (32.92), Duer (37.2), and Sogou (32.25) showed varying capabilities.
  • Other US Models: Microsoft Bing scored 31.98, while the chatbot Xiaobing reached 24.48, and Apple's Siri lagged at 23.94.

For context, these scores were considerably lower than human benchmarks used in the same study: 18-year-olds averaged an IQ of 97, 12-year-olds 84.5, and even 6-year-olds scored around 55.5. This demonstrates that early AI systems operated at cognitive levels significantly below young human children.


The Leap Forward: AI Intelligence in the 2020s

Crossing the Human Average and Beyond

The period from 2020 to 2025 witnessed dramatic improvements, driven by advancements in Large Language Models (LLMs), new architectures, larger datasets, and increased computational power. Models began not only approaching but exceeding average human performance on various cognitive benchmarks.

Graph showing AI IQ passing 100

Visual representation suggesting AI models crossing the IQ 100 threshold.

Milestones and Key Model Performances:

  • GPT Series (OpenAI):
    • GPT-3 (circa 2020-2022): Performance aligned roughly with average human intelligence, often estimated around an IQ of 100 based on diverse cognitive tasks.
    • GPT-3.5 (ChatGPT launch, late 2022): Some analyses focusing on verbal-linguistic abilities placed its IQ equivalent as high as 147, showcasing strong language skills.
    • GPT-4 (2023): Generally considered comparable to highly intelligent humans, with estimated IQs ranging from 130+ up to potentially 147 on specific tests like Raven's Progressive Matrices. Some sources noted an IQ of 114 for early Bing integrations powered by GPT-4.
    • OpenAI "o" Series (2024-2025):
      • o1 / GPT-4.5o: Reportedly scored around 120 on adapted versions of the Norway Mensa test, placing it above ~91% of human test-takers.
      • o3: Reports emerged suggesting an astonishing IQ equivalent of 157, derived from performance on coding and reasoning platforms like Codeforces. Another source mentioned a score of 136 on the Norway Mensa test, potentially referring to an earlier version or different test conditions, highlighting the variability in testing. This level approaches "genius" territory.
      • o4-mini: Positioned as efficient, with performance estimated in the 120-130 IQ range.
  • Claude Series (Anthropic):
    • Claude 3 (Early 2024): Marked a significant milestone by being widely recognized as the first AI model family (specifically the Opus variant) to consistently score above 100 IQ across various benchmarks, indicating above-average human performance in reasoning tasks.
    • Claude 3.7 Sonnet (Mid-2025): While specific IQ numbers are less common, its performance improvements suggest it operates comfortably within the 100-120+ IQ equivalent range.
  • Gemini Series (Google):
    • Gemini Advanced / Pro 1.5: Performance varies depending on the test modality. Verbalized tests suggest scores potentially reaching 120-130. However, tests relying purely on visual reasoning (without textual prompts) have yielded lower scores (around 70), indicating a potential gap between linguistic and visual-spatial reasoning abilities or test adaptation challenges.
    • Gemini 2.0 Flash / 2.5 Pro (2025): These models demonstrate strong reasoning capabilities, often benchmarked in the 120-130 IQ equivalent range, particularly excelling in complex tasks like legal reasoning and achieving "Turing-level" intelligence comparable to average human reasoning in specific evaluations.
  • Neuro-Vector Symbolic Architecture (NVSA - IBM Research): While not typically assigned a standard IQ score, this neuro-symbolic approach demonstrated state-of-the-art performance (88.1% accuracy) on visual IQ puzzles like I-RAVEN, significantly outperforming other neural and neuro-symbolic systems and showing faster reasoning speeds. This highlights the potential of hybrid architectures.
  • Other Models (Grok, DeepSeek, LLaMA): While less frequently cited with specific IQ scores, models like Grok 3 (xAI), DeepSeek R3, and LLaMA 3 (Meta) are competitive in leaderboards, implying capabilities generally within the 100-130 IQ equivalent range based on their performance in coding, reasoning, and knowledge tasks.
Comparison chart showing AI vs Human IQ

Early comparison suggesting Bing AI (GPT-4 powered) surpassing average human IQ.


Comparative Overview: Estimated AI IQ Ranges

Synthesizing the Data

The following table summarizes the estimated IQ scores or performance levels for some of the key AI models discussed, based on available reports and analyses. It's important to reiterate that these are *estimates* based on performance in specific, often adapted, tests and are not directly equivalent to human IQ scores obtained through standardized clinical testing.

AI Model / Generation Approximate Estimated IQ Range Basis / Test Type / Notes Year / Period
Early AI (e.g., Google AI, Baidu, Siri) 20 - 48 Absolute IQ study (Liu et al.) ~2016
GPT-3 ~100 General cognitive task performance ~2020-2022
GPT-3.5 (ChatGPT) ~100 - 147 Verbal-linguistic focus in some tests ~2022-2023
GPT-4 / Bing AI 114 - 147+ Raven's Matrices, general benchmarks, verbal tests ~2023
Claude 3 (Opus) 100+ First family consistently above 100 IQ across benchmarks ~2024
OpenAI o1 / GPT-4.5o ~120 Norway Mensa test adaptation ~2024
Gemini Advanced / Pro 1.5 ~70 (Visual only) to 120-130 (Verbalized) Varied performance based on test modality ~2024
Gemini 2.0 / 2.5 Pro 120 - 130 Reasoning benchmarks (e.g., legal), Turing-level claims ~2025
OpenAI o3 136 - 157 Norway Mensa test / Codeforces & reasoning benchmarks ~2025
NVSA (IBM) N/A (High Accuracy) 88.1% accuracy on I-RAVEN visual IQ puzzles ~2023-2024
2025 Frontier Models (Projected) 195+ (?) Extrapolations based on benchmark saturation (MMLU, GPQA); reliability of IQ scale at this level is questionable. ~2025+

Visualizing AI Cognitive Profiles

A Radar Chart Comparison

This radar chart provides a conceptual visualization of the *estimated* relative strengths of different AI generations across several cognitive dimensions often associated with IQ tests. The values are illustrative, based on the general capabilities described in the provided sources, rather than precise numerical data. It helps to visualize the rapid and broad progress made, particularly by recent models.


Understanding the Evaluation Landscape

Factors Influencing AI IQ Assessment

Evaluating AI "IQ" isn't straightforward. Several factors influence how these assessments are conducted and interpreted. This mindmap outlines some key aspects of the AI IQ evaluation process.

mindmap root["AI IQ Evaluation"] id1["Methodologies"] id1a["Adapted Human Tests
(e.g., Mensa, Raven's)"] id1b["Standardized Benchmarks
(e.g., MMLU, HellaSwag, GSM8K)"] id1c["Domain-Specific Tests
(e.g., Coding, Legal Reasoning)"] id1d["Visual Reasoning Tests
(e.g., I-RAVEN)"] id1e["Multi-Modal Assessments"] id2["Key Models Assessed"] id2a["OpenAI (GPT series, 'o' models)"] id2b["Google (Gemini series)"] id2c["Anthropic (Claude series)"] id2d["IBM (NVSA)"] id2e["Meta (LLaMA)"] id2f["Others (DeepSeek, Grok, etc.)"] id3["Influencing Factors"] id3a["Model Architecture
(LLM, Neuro-Symbolic)"] id3b["Training Data
(Size, Quality, Diversity)"] id3c["Computational Resources"] id3d["Testing Accommodations
(e.g., Verbal prompts for visual tests)"] id3e["Specialization vs. Generalization"] id4["Limitations & Challenges"] id4a["Lack of Consciousness/True Understanding"] id4b["Test Biases (Human-centric)"] id4c["Task Specificity vs. General Intelligence"] id4d["Difficulty Comparing Diverse Architectures"] id4e["Reliability of IQ scale at extremes (>155)"] id4f["Rapid Evolution of Models"]

AI Model Comparisons in Action

Insights from Performance Data

Understanding how different AI models stack up requires looking beyond single scores. Comparisons often involve testing across multiple domains like coding, mathematics, language tasks, and reasoning. The following video provides insights into comparing top AI models based on performance data from around early 2025, offering a broader perspective than just IQ estimates.

This type of analysis helps illustrate the diverse strengths and weaknesses of different models. While one model might excel in creative text generation, another might lead in logical problem-solving or code generation. These multi-faceted comparisons complement the attempts to assign singular IQ-like scores.


Frequently Asked Questions (FAQ)

Is AI IQ the same as human IQ?

Why do different sources report different IQ scores for the same AI model?

Can AI models take standard visual IQ tests?

What does an AI IQ score above 150 mean?


References


Recommended Exploration

airankings.org
AIRankings
ietresearch.onlinelibrary.wiley.com
Three IQs of AI systems and their testing
trackingai.org
Tracking AI: IQ Test

Last updated April 22, 2025
Ask Ithy AI
Download Article
Delete Article