Exploring Web Scraping & RDF Querying
In today’s data-driven world, the ability to collect, analyze, and interpret data is invaluable. Python is often the tool of choice for data. In this article, we’ll delve into a Python script that combines data analysis and web scraping to fetch and analyze information related to meetings and attendees. We’ll also take a deeper dive into the technologies and concepts underpinning the script.
The Libraries
The Python script provided in this article employs several libraries and techniques to achieve its goals.
from bs4 import BeautifulSoup as bs # for web crawling
from urllib.request import urlopen
import numpy as np
import SPARQLWrapper as sp # SPARQL TIME!
import json
import matplotlib.pyplot as plt
Web Crawling Time
Web pages are structured using HTML
(Hypertext Markup Language).
The script leverages the BeautifulSoup
library to parse HTML
documents.
If you feel a little adventurous today, and you are thinking of parsing the
HTML documents on your own, you may want to keep in mind that HTML documents
are not a regular language :) .
Parsing involves breaking down the HTML
document into a structured tree-like
format, known as the Document Object Model (DOM
).
from bs4 import BeautifulSoup as bs
# Parse an HTML document
soup = bs(html_content, 'html.parser')
Selecting Elements
To interact with specific elements within the DOM
, BeautifulSoup
provides
methods for element selection. For instance, we can find all elements with a
particular HTML
tag or retrieve elements with specific attributes.
# Find all <a> tags in the HTML
links = soup.find_all('a')
Data Extraction
Once we’ve selected the desired elements, data can be extracted from them.
This involves accessing the element’s attributes or text content.
Functions like get_text_from_inside_tag
and get_text_from_list
are used
in the script to extract text data from HTML
elements.
# Extract text content from an HTML element
text = element.getText()
Navigating Web Pages
Web scraping often involves navigating through multiple web pages by following
links or interacting with forms. The script utilizes the urlopen
function from
the urllib.request
module to open web pages, fetch their content, and then parse
them with BeautifulSoup
.
from urllib.request import urlopen
# Open a web page and fetch its content
response = urlopen(link)
html_content = response.read()
Practice Makes Perfect
The primary goal of the script is to fetch data from web pages related to meetings and compare it with data from a database. Here’s an abstracted extract of the code.
def get_html_page_from_link(link):
response = urlopen(link)
soup = bs(response, 'html.parser')
return soup
# example: property="besluit:heeftAanwezigeBijStart"
def get_tags_from_property(html, property):
Tags = html.find_all(property)
# Some pre-and-post processing
return Tags
def get_text_from_inside_tag(tag):
person_names = tag.getText()
# Additional code for name parsing and formatting
return person_names
def get_text_from_list(list_of_tags):
people = []
amt_present = len(list_of_tags)
if(amt_present >= 1):
for tag in list_of_tags:
tag_text = get_text_from_inside_tag(tag)
people.append(tag_text)
return people
def comparing(nested_list_of_names_client, nested_list_of_names_db , mandaatinfo):
for sitting in list(mandaatinfo.keys()):
temp = [x for x in nested_list_of_names_client[sitting] if x not in nested_list_of_names_db[sitting]]
# Additional code for finding missing attendees
ontbrekendePersonen = get_dict_of_manda(temp, sitting, mandaatinfo)
mandaatinfo[sitting][2] = missing_persons
Querying RDF Data
Back to the fun part, Querying RDF Data. RDF
represents data as a graph,
with data structured as triples (subjects, predicates, objects).
Triples describe relationships between resources identified with an URI
(which is an acronym for uniform resource identifier).
# example of an RDF Triple
<http://example.com/person/john> <http://example.com/vocab/name> "John Doe"
SPARQL
(SPARQL Protocol and RDF Query Language) is a query language for RDF
data.
It allows us to query RDF
graphs to retrieve specific information. SPARQL
queries
are designed to match patterns in RDF
data and return results in a structured format.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ex: <http://example.org/vocab/>
SELECT ?name ?email
WHERE {
?person ex:name ?name .
?person ex:email ?email .
}
As can be seen above, we declared some prefixes in order to create Semantic Namespaces. This helps to improve readability.
Example Queries
Our first query demonstrates how we might retrieve multiple pieces of information about a person:
SELECT ?name ?email ?age
WHERE {
?person ex:name ?name .
?person ex:email ?email .
OPTIONAL { ?person ex:age ?age }
}
This query tells a story of flexible information gathering. We’re searching for people’s names and email addresses, with an interesting twist, the age is optional. If an age exists in the database, it’ll be returned, but its absence won’t prevent the query from working. It’s like casting a net that can catch different types of information without getting tangled.
Sometimes, we want to be more specific in our data exploration:
SELECT ?name ?email
WHERE {
?person ex:name ?name .
?person ex:email ?email .
FILTER (CONTAINS(?name, "John"))
}
Here, we’re hunting for all individuals named John. The CONTAINS
filter acts like a precise search tool, finding any name that includes
“John”. So “Johnny”, “Johnson”, or “John Smith” would all be caught in
this query’s embrace. It’s similar to how you might use a search function,
but with the precision of a well-crafted database query.
Data can be overwhelming, so sometimes we need a more controlled approach:
SELECT ?name ?email
WHERE {
?person ex:name ?name .
?person ex:email ?email .
}
ORDER BY ?name
LIMIT 10
This query is like a librarian organizing books. It retrieves names and emails, carefully arranging them alphabetically and then presenting only the first ten results.
Query Execution
SPARQL
queries are executed against RDF
data stores or endpoints.
In the script, the SPARQLWrapper
library is used to send SPARQL
queries
to a specific RDF
endpoint and retrieve results. The results are typically
returned in a structured format like JSON
.
import SPARQLWrapper as sp
sparql = sp.SPARQLWrapper("https://example.com/sparql_endpoint")
sparql.setQuery("SELECT ?subject ?predicate ?object WHERE {?subject ?predicate ?object.}")
result = sparql.query().convert()
Practice Makes Perfect
Once again, here’s an abstracted extract of the code. Of course, one has to be careful with security when working with dynamic query generation, it’s crucial to be vigilant about potential security vulnerabilities. String-based formatting in database and RDF queries can expose systems to injection attacks, where malicious actors might manipulate query parameters to gain unauthorized access or extract sensitive information, e.g. SQL injections.
def build_dynamic_query(parameters):
"""
Construct a SPARQL query with flexible filtering
"""
query_base = """
PREFIX ex: <http://example.org/vocab/>
SELECT ?resource
WHERE {
# Dynamic filters based on input parameters
?resource ex:type ex:Meeting .
%s
}
%s
"""
# Dynamic filter generation
filters = []
limit_clause = ""
if parameters.get('type'):
filters.append(f"?resource ex:meetingType '{parameters['type']}'")
if parameters.get('limit'):
limit_clause = f"LIMIT {parameters['limit']}"
# Combine filters and construct final query
final_query = query_base % (" . ".join(filters), limit_clause)
return final_query
Example Output
Let’s take a look at an example of the script’s output. It returns a dictionary structure containing information about meetings, including URLs to ‘notulen’, URLs to associated governing bodies, and a dictionary of attendees with their names and corresponding URLs.
{
"Meeting URL": ["Notulen URL", "Governing Body URL", {
"Attendee Name": "Attendee URL",
// Additional attendees...
}],
// Additional meeting entries...
}