Have I Been Pwned (HIBP) is a widely recognized service that allows individuals to check if their email addresses have been compromised in data breaches. While HIBP offers an official API for accessing breach data programmatically, there are scenarios where developers might need to access this information without using the API. This comprehensive guide explores various methods to interact with the HIBP website using Python, focusing on web scraping and browser automation techniques. We will delve into using the requests library in combination with BeautifulSoup, as well as leveraging Selenium for browser automation.
One of the most straightforward methods to interact with HIBP without the official API is by using HTTP requests to send queries to the website and parsing the responses to extract breach information. This method relies on the requests library to handle HTTP requests and BeautifulSoup to parse the HTML content.
Below is a Python script that demonstrates how to check if an email has been pwned by scraping the HIBP website:
import requests
from bs4 import BeautifulSoup
def check_email_breach(email):
"""
Check if the provided email has been pwned using HIBP website scraping.
Args:
email (str): The email address to check.
Returns:
dict: A dictionary containing breach information or a message indicating no breaches.
"""
url = "https://haveibeenpwned.com/unifiedsearch/" + email
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Check for breach notifications
breaches_section = soup.find('div', class_='breaches')
if breaches_section:
breaches = breaches_section.find_all('div', class_='breach')
breach_list = []
for breach in breaches:
title = breach.find('h3').text.strip()
domain = breach.find('p', class_='domain').text.strip()
breach_date = breach.find('span', class_='breach-date').text.strip()
breach_info = {
'Title': title,
'Domain': domain,
'BreachDate': breach_date
}
breach_list.append(breach_info)
return {'Breaches': breach_list}
else:
return {'Message': f"The email '{email}' has not been pwned."}
except requests.exceptions.RequestException as e:
return {'Error': f"An error occurred: {e}"}
# Example usage
if __name__ == "__main__":
email_to_check = "example@example.com"
result = check_email_breach(email_to_check)
if 'Breaches' in result:
print(f"Breaches found for {email_to_check}:")
for breach in result['Breaches']:
print(f"- {breach['Title']}: {breach['Domain']} ({breach['BreachDate']})")
elif 'Message' in result:
print(result['Message'])
else:
print(result['Error'])
Selenium is a powerful tool for automating web browsers. It can be used to interact with websites in a way that simulates human behavior, making it useful for accessing dynamic content that may not be easily accessible through standard HTTP requests. This method is particularly useful if the website employs JavaScript to load content dynamically.
Here is a Python script that utilizes Selenium to automate the process of checking if an email has been pwned:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
def check_email_breach_selenium(email):
"""
Check if the provided email has been pwned using HIBP website automation with Selenium.
Args:
email (str): The email address to check.
Returns:
str: A message indicating the breach status of the email.
"""
# Initialize the Chrome WebDriver (ensure ChromeDriver is installed and in PATH)
driver = webdriver.Chrome()
try:
driver.get("https://haveibeenpwned.com/")
# Locate the email input field and enter the email address
email_input = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "Account"))
)
email_input.send_keys(email)
# Locate and click the search button
search_button = driver.find_element(By.ID, "searchPwnage")
search_button.click()
# Wait for the results to load
time.sleep(3)
# Check for breach notification
try:
breach_section = driver.find_element(By.CLASS_NAME, "breaches")
breaches = breach_section.find_elements(By.CLASS_NAME, "breach")
breach_list = []
for breach in breaches:
title = breach.find_element(By.TAG_NAME, "h3").text.strip()
domain = breach.find_element(By.CLASS_NAME, "domain").text.strip()
breach_date = breach.find_element(By.CLASS_NAME, "breach-date").text.strip()
breach_info = f"{title}: {domain} ({breach_date})"
breach_list.append(breach_info)
result = "\n".join(breach_list)
except:
result = "Good news — no pwnage found!"
return result
except Exception as e:
return f"An error occurred: {str(e)}"
finally:
driver.quit()
# Example usage
if __name__ == "__main__":
email_to_check = "example@example.com"
result = check_email_breach_selenium(email_to_check)
print(f"Results for {email_to_check}:")
print(result)
The choice between using HTTP requests with BeautifulSoup and browser automation with Selenium depends on the specific requirements and constraints of your project:
Regardless of the method chosen, implementing robust error handling is crucial for creating reliable scripts. Consider the following strategies:
To avoid being blocked or violating usage policies, adhere to the following guidelines:
When handling breach data, it is imperative to maintain data privacy and security:
Some websites employ CAPTCHA challenges to prevent automated access. To handle such scenarios:
To further reduce the risk of being blocked, consider rotating your IP addresses using proxies:
Maintaining logs and monitoring the performance of your scripts is essential for troubleshooting and ensuring smooth operation:
Accessing Have I Been Pwned without the official API is feasible through methods such as web scraping with requests and BeautifulSoup, or browser automation with Selenium. Each approach has its own set of advantages and challenges:
Regardless of the chosen method, it is imperative to respect ethical considerations, implement robust error handling, and adhere to best practices to ensure reliable and responsible access to breach data.