Web Scrawler With RDF Querying

Exploring Web Scraping & RDF Querying

In today’s data-driven world, the ability to collect, analyze, and interpret data is invaluable. Python is often the tool of choice for data. In this article, we’ll delve into a Python script that combines data analysis and web scraping to fetch and analyze information related to meetings and attendees. We’ll also take a deeper dive into the technologies and concepts underpinning the script.

The Libraries

The Python script provided in this article employs several libraries and techniques to achieve its goals.

from bs4 import BeautifulSoup as bs # for web crawling
from urllib.request import urlopen
import numpy as np
import SPARQLWrapper as sp # SPARQL TIME!
import json
import matplotlib.pyplot as plt

Web Crawling Time

Web pages are structured using HTML (Hypertext Markup Language). The script leverages the BeautifulSoup library to parse HTML documents. If you feel a little adventurous today, and you are thinking of parsing the HTML documents on your own, you may want to keep in mind that HTML documents are not a regular language :) . Parsing involves breaking down the HTML document into a structured tree-like format, known as the Document Object Model (DOM).

from bs4 import BeautifulSoup as bs

# Parse an HTML document
soup = bs(html_content, 'html.parser')

Selecting Elements

To interact with specific elements within the DOM, BeautifulSoup provides methods for element selection. For instance, we can find all elements with a particular HTML tag or retrieve elements with specific attributes.

# Find all <a> tags in the HTML
links = soup.find_all('a')

Data Extraction

Once we’ve selected the desired elements, data can be extracted from them. This involves accessing the element’s attributes or text content. Functions like get_text_from_inside_tag and get_text_from_list are used in the script to extract text data from HTML elements.

# Extract text content from an HTML element
text = element.getText()

Web scraping often involves navigating through multiple web pages by following links or interacting with forms. The script utilizes the urlopen function from the urllib.request module to open web pages, fetch their content, and then parse them with BeautifulSoup.

from urllib.request import urlopen

# Open a web page and fetch its content
response = urlopen(link)
html_content = response.read()

Practice Makes Perfect

The primary goal of the script is to fetch data from web pages related to meetings and compare it with data from a database. Here’s an abstracted extract of the code.

def get_html_page_from_link(link):
    response = urlopen(link)
    soup = bs(response, 'html.parser')
    return soup

# example: property="besluit:heeftAanwezigeBijStart" 
def get_tags_from_property(html, property):
    Tags = html.find_all(property)
    # Some pre-and-post processing
    return Tags

def get_text_from_inside_tag(tag):
    person_names = tag.getText()
    # Additional code for name parsing and formatting
    return person_names 

def get_text_from_list(list_of_tags):
    people = []
    amt_present = len(list_of_tags)
    if(amt_present >= 1):
        for tag in list_of_tags:
            tag_text = get_text_from_inside_tag(tag)
            people.append(tag_text)
    return people 

def comparing(nested_list_of_names_client, nested_list_of_names_db , mandaatinfo):
    for sitting in list(mandaatinfo.keys()):
        temp = [x for x in nested_list_of_names_client[sitting] if x not in nested_list_of_names_db[sitting]]
        # Additional code for finding missing attendees
        ontbrekendePersonen = get_dict_of_manda(temp, sitting, mandaatinfo)
        mandaatinfo[sitting][2] = missing_persons

Querying RDF Data

Back to the fun part, Querying RDF Data. RDF represents data as a graph, with data structured as triples (subjects, predicates, objects). Triples describe relationships between resources identified with an URI (which is an acronym for uniform resource identifier).


# example of an RDF Triple
<http://example.com/person/john> <http://example.com/vocab/name> "John Doe"


SPARQL (SPARQL Protocol and RDF Query Language) is a query language for RDF data. It allows us to query RDF graphs to retrieve specific information. SPARQL queries are designed to match patterns in RDF data and return results in a structured format.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ex: <http://example.org/vocab/>

SELECT ?name ?email
WHERE {
    ?person ex:name ?name .
    ?person ex:email ?email .
}

As can be seen above, we declared some prefixes in order to create Semantic Namespaces. This helps to improve readability.

Example Queries

Our first query demonstrates how we might retrieve multiple pieces of information about a person:

SELECT ?name ?email ?age
WHERE {
    ?person ex:name ?name .
    ?person ex:email ?email .
    OPTIONAL { ?person ex:age ?age }
}

This query tells a story of flexible information gathering. We’re searching for people’s names and email addresses, with an interesting twist, the age is optional. If an age exists in the database, it’ll be returned, but its absence won’t prevent the query from working. It’s like casting a net that can catch different types of information without getting tangled.

Sometimes, we want to be more specific in our data exploration:

SELECT ?name ?email
WHERE {
    ?person ex:name ?name .
    ?person ex:email ?email .
    FILTER (CONTAINS(?name, "John"))
}

Here, we’re hunting for all individuals named John. The CONTAINS filter acts like a precise search tool, finding any name that includes “John”. So “Johnny”, “Johnson”, or “John Smith” would all be caught in this query’s embrace. It’s similar to how you might use a search function, but with the precision of a well-crafted database query.

Data can be overwhelming, so sometimes we need a more controlled approach:

SELECT ?name ?email
WHERE {
    ?person ex:name ?name .
    ?person ex:email ?email .
}
ORDER BY ?name
LIMIT 10

This query is like a librarian organizing books. It retrieves names and emails, carefully arranging them alphabetically and then presenting only the first ten results.

Query Execution

SPARQL queries are executed against RDF data stores or endpoints. In the script, the SPARQLWrapper library is used to send SPARQL queries to a specific RDF endpoint and retrieve results. The results are typically returned in a structured format like JSON.


import SPARQLWrapper as sp

sparql = sp.SPARQLWrapper("https://example.com/sparql_endpoint")
sparql.setQuery("SELECT ?subject ?predicate ?object WHERE {?subject ?predicate ?object.}")
result = sparql.query().convert()


Practice Makes Perfect

Once again, here’s an abstracted extract of the code. Of course, one has to be careful with security when working with dynamic query generation, it’s crucial to be vigilant about potential security vulnerabilities. String-based formatting in database and RDF queries can expose systems to injection attacks, where malicious actors might manipulate query parameters to gain unauthorized access or extract sensitive information, e.g. SQL injections.


def build_dynamic_query(parameters):
    """
    Construct a SPARQL query with flexible filtering
    """
    query_base = """
    PREFIX ex: <http://example.org/vocab/>
    
    SELECT ?resource
    WHERE {
        # Dynamic filters based on input parameters
        ?resource ex:type ex:Meeting .
        %s
    }
    %s
    """
    
    # Dynamic filter generation
    filters = []
    limit_clause = ""
    
    if parameters.get('type'):
        filters.append(f"?resource ex:meetingType '{parameters['type']}'")
    
    if parameters.get('limit'):
        limit_clause = f"LIMIT {parameters['limit']}"
    
    # Combine filters and construct final query
    final_query = query_base % (" . ".join(filters), limit_clause)
    return final_query


Example Output

Let’s take a look at an example of the script’s output. It returns a dictionary structure containing information about meetings, including URLs to ‘notulen’, URLs to associated governing bodies, and a dictionary of attendees with their names and corresponding URLs.

{
    "Meeting URL": ["Notulen URL", "Governing Body URL", {
        "Attendee Name": "Attendee URL",
        // Additional attendees...
    }],
    // Additional meeting entries...
}