统计手写数字图像白色字迹像素数量的平均值并可视化

Interactive Heatmap / Matrix Visualization - Stack Overflow

概述

在本指南中，我们将详细介绍如何使用Python对大量手写数字的灰度图像进行处理，统计每种数字偏白色字迹部分的像素数量的平均值，并通过直方图进行可视化。具体步骤包括图像读取与预处理、像素统计、数据分析以及可视化展示。

所需工具与库

为了完成这一任务，我们将使用以下Python库：

OpenCV：用于图像读取和处理。
NumPy：用于高效的数值计算。
Matplotlib：用于数据可视化。
os：用于文件和目录操作。

请确保已安装上述库。如未安装，可以使用以下命令通过pip进行安装：


pip install opencv-python numpy matplotlib

步骤详解

1. 读取和预处理图像

首先，我们需要从指定的文件夹中加载所有.jpg格式的灰度图像，并将其转换为适合处理的NumPy数组。


import os
import cv2
import numpy as np
import matplotlib.pyplot as plt

def load_images(folder_path):
    """
    加载指定文件夹中的所有.jpg图像，并转换为灰度图。
    
    参数：
        folder_path (str): 图像文件夹的路径。
    
    返回：
        list: 包含所有图像的灰度NumPy数组。
    """
    images = []
    filenames = []
    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.jpg'):
            img_path = os.path.join(folder_path, filename)
            img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            if img is not None:
                images.append(img)
                filenames.append(filename)
            else:
                print(f"警告: 无法读取图像 {img_path}")
    return images, filenames

2. 提取数字标签

根据图像文件名的最后一位数字提取数字标签（0-9）。假设文件名遵循一定的命名规范，例如 "image_00001_5.jpg" 表示数字5。


def extract_digit(filename):
    """
    从文件名中提取最后一位数字作为标签。
    
    参数：
        filename (str): 图像文件的文件名。
    
    返回：
        int: 提取的数字标签。如果无法提取，则返回None。
    """
    digits = [char for char in filename if char.isdigit()]
    if digits:
        return int(digits[-1])
    else:
        return None

3. 统计白色字迹像素数量

对于每张图像，统计灰度值大于200的像素数量，这些像素代表白色字迹部分。


def count_white_pixels(image, threshold=200):
    """
    统计图像中灰度值大于阈值的像素数量。
    
    参数：
        image (numpy.ndarray): 灰度图像。
        threshold (int): 像素阈值。默认值为200。
    
    返回：
        int: 大于阈值的像素数量。
    """
    white_pixels = np.sum(image > threshold)
    return white_pixels

4. 分组统计与计算平均值

将所有图像按数字标签分组，计算每组中白色字迹像素数量的平均值。


def compute_average_white_pixels(images, filenames):
    """
    计算每个数字标签的平均白色字迹像素数量。
    
    参数：
        images (list): 图像的灰度NumPy数组列表。
        filenames (list): 对应图像的文件名列表。
    
    返回：
        dict: 每个数字标签对应的平均白色字迹像素数量。
    """
    pixel_counts = {i: [] for i in range(10)}
    
    for img, fname in zip(images, filenames):
        digit = extract_digit(fname)
        if digit is not None and 0 <= digit <= 9:
            count = count_white_pixels(img)
            pixel_counts[digit].append(count)
        else:
            print(f"警告: 无法提取或无效的数字标签从文件 {fname}")
    
    average_counts = {}
    for digit, counts in pixel_counts.items():
        if counts:
            average = np.mean(counts)
            average_counts[digit] = average
        else:
            average_counts[digit] = 0
            print(f"警告: 数字 {digit} 没有对应的像素数据。")
    
    return average_counts

5. 可视化结果

使用Matplotlib绘制直方图，展示每个数字对应的平均白色字迹像素数量。


def plot_histogram(average_counts):
    """
    绘制每个数字的平均白色字迹像素数量的直方图。
    
    参数：
        average_counts (dict): 每个数字对应的平均像素数量。
    """
    digits = list(average_counts.keys())
    averages = list(average_counts.values())
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(digits, averages, color='#388278', edgecolor='black')
    
    plt.xlabel('数字', fontsize=14)
    plt.ylabel('平均白色字迹像素数量', fontsize=14)
    plt.title('手写数字的白色字迹平均像素数量', fontsize=16)
    plt.xticks(digits, fontsize=12)
    plt.yticks(fontsize=12)
    
    # 在条形上添加数值标签
    for bar in bars:
        height = bar.get_height()
        plt.annotate(f'{height:.2f}',
                     xy=(bar.get_x() + bar.get_width() / 2, height),
                     xytext=(0, 3),  # 3 points vertical offset
                     textcoords="offset points",
                     ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.show()

完整的Python脚本示例

下面是将上述步骤整合到一个完整Python脚本中的示例：


import os
import cv2
import numpy as np
import matplotlib.pyplot as plt

def load_images(folder_path):
    images = []
    filenames = []
    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.jpg'):
            img_path = os.path.join(folder_path, filename)
            img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            if img is not None:
                images.append(img)
                filenames.append(filename)
            else:
                print(f"警告: 无法读取图像 {img_path}")
    return images, filenames

def extract_digit(filename):
    digits = [char for char in filename if char.isdigit()]
    if digits:
        return int(digits[-1])
    else:
        return None

def count_white_pixels(image, threshold=200):
    return np.sum(image > threshold)

def compute_average_white_pixels(images, filenames):
    pixel_counts = {i: [] for i in range(10)}
    
    for img, fname in zip(images, filenames):
        digit = extract_digit(fname)
        if digit is not None and 0 <= digit <= 9:
            count = count_white_pixels(img)
            pixel_counts[digit].append(count)
        else:
            print(f"警告: 无法提取或无效的数字标签从文件 {fname}")
    
    average_counts = {}
    for digit, counts in pixel_counts.items():
        if counts:
            average = np.mean(counts)
            average_counts[digit] = average
        else:
            average_counts[digit] = 0
            print(f"警告: 数字 {digit} 没有对应的像素数据。")
    
    return average_counts

def plot_histogram(average_counts):
    digits = list(average_counts.keys())
    averages = list(average_counts.values())
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(digits, averages, color='#388278', edgecolor='black')
    
    plt.xlabel('数字', fontsize=14)
    plt.ylabel('平均白色字迹像素数量', fontsize=14)
    plt.title('手写数字的白色字迹平均像素数量', fontsize=16)
    plt.xticks(digits, fontsize=12)
    plt.yticks(fontsize=12)
    
    for bar in bars:
        height = bar.get_height()
        plt.annotate(f'{height:.2f}',
                     xy=(bar.get_x() + bar.get_width() / 2, height),
                     xytext=(0, 3),
                     textcoords="offset points",
                     ha='center', va='bottom', fontsize=10)
    
    plt.tight_layout()
    plt.show()

def main():
    folder_path = 'path/to/your/images'  # 替换为实际图像文件夹路径
    images, filenames = load_images(folder_path)
    if not images:
        print("错误: 未加载到任何图像。请检查文件夹路径和图像格式。")
        return
    
    average_counts = compute_average_white_pixels(images, filenames)
    print("每个数字的平均白色字迹像素数量：")
    for digit, avg in average_counts.items():
        print(f"数字 {digit}: {avg:.2f}")
    
    plot_histogram(average_counts)

if __name__ == "__main__":
    main()

详细解释

1. 数据加载与预处理

函数 load_images 遍历指定文件夹，读取所有.jpg格式的图像，并将其转换为灰度图像。转换后的图像存储为NumPy数组，方便后续的数值处理。同时，收集所有成功读取的图像文件名，以便从中提取数字标签。

2. 数字标签提取

函数 extract_digit 从图像文件名中提取最后一个数字字符，作为该图像的标签。如果文件名中没有数字或提取的字符无法转换为整数，则返回None，并发出警告。

3. 白色字迹像素统计

函数 count_white_pixels 计算图像中灰度值大于200的像素数量，这部分像素代表了手写的白色字迹。此操作利用NumPy的向量化运算，极大地提升了计算效率。

4. 平均值计算

函数 compute_average_white_pixels 将图像按照数字标签分组，统计每组中白色字迹的总像素数量，并计算每个数字的平均值。如果某个数字的图像数量为零，会将其平均值设为0，并发出警告。

5. 数据可视化

函数 plot_histogram 使用Matplotlib绘制直方图，展示每个数字对应的平均白色字迹像素数量。为了增强可读性，图表中每个条形上都会标注具体的平均值。

注意事项

确保图像文件夹路径正确，且所有图像均为28x28的灰度.jpg文件。
文件命名需包含表示数字标签的最后一个数字字符。例如，"digit_0001_3.jpg" 表示数字3。
在运行脚本前，请安装所有必要的Python库。
考虑到图像数量较多（2万+），建议在具备足够内存和处理能力的机器上运行此脚本，以避免内存不足或长时间运行的问题。
如果处理过程耗时过长，可以考虑对图像进行批处理或并行处理，以提升效率。

优化与扩展

根据具体需求，以下是一些可能的优化与扩展方向：

1. 并行处理

对于大量图像，可以使用多线程或多进程的方法，利用多核CPU加速图像处理过程。例如，使用Python的multiprocessing库。

2. 数据存储

如果需要多次进行相同的统计分析，可以考虑将中间结果存储在文件或数据库中，以避免重复计算。

3. 可视化增强

除了直方图，还可以使用其他可视化工具，如热力图或箱线图，以更全面地展示数据分布情况。

4. 错误处理与日志记录

在生产环境中，建议引入更加完善的错误处理机制和日志记录系统，以便更好地跟踪和调试。

结论

通过以上步骤，您可以高效地统计和分析大量手写数字图像中的白色字迹像素数量，并通过直方图直观地展示不同数字的平均值。这不仅有助于理解数据分布，还可为后续的机器学习或图像处理任务提供有价值的参考。

参考资料

docs.opencv.org

https://docs.opencv.org/

numpy.org

https://numpy.org/doc/

matplotlib.org

https://matplotlib.org/stable/contents.html