Python: richieste HTTP con il modulo urllib

In questo articolo scopriremo i fondamenti delle richieste HTTP in Python usando il modulo core urllib.

Quando si effettua una richiesta HTTP, per prima cosa occorre verificare che il protocollo specificato nell'URL sia HTTP o HTTPS. In questo esempio useremo il protocollo HTTPS, ma potrebbe essere richiesto di usare il semplice HTTP nel caso in cui il sito remoto non avesse un certificato SSL.

urllib.request dispone della classe Request che, come ricordato nella documentazione, ha lo scopo primario di gestire URL HTTP.

La nostra richiesta HTTP comincia con la creazione di un'istanza di questa classe a cui viene passato un URL.

from urllib.request import urlopen, Request

request_url = 'https://gabrieleromanato.com'
request = Request(request_url)

Ora dobbiamo effettivamente aprire l'URL remoto usando la funzione urlopen() che crea uno stream. Essendo uno stream, il modo consigliato per gestirlo è quello di usare un context manager con l'operatore with.

with urlopen(request) as response:
    pass

response è un'istanza di urllib.response e dispone delle proprietà status, che restituisce il numero intero del codice di stato HTTP restituito dal server, e headers che restituisce gli header HTTP della risposta come istanza della classe EmailMessage. Il metodo EmailMessage.items() restituisce una lista di tuple contenenti il nome dell'header HTTP e il suo valore. Possiamo anche convertire questa lista di tuple in una lista di dizionari usando la funzione map().

def create_single_header(data):
    key, value = data
    d = {}
    d[key] = value
    return d
  
request_url = 'https://gabrieleromanato.com'  
request = Request(request_url)

with urlopen(request) as response:
    status_code = response.status
    res_headers = list(map(create_single_header, response.headers.items()))

Per reperire invece il corpo della risposta restituito dal server, ossia il documento HTML, dobbiamo tenere presente che all'interno del nostro context manager stiamo operando con uno stream di byte, quindi possiamo utilizzare lo stesso approccio che usiamo con i file (read() e decode()).

request_url = 'https://gabrieleromanato.com'  
request = Request(request_url)

with urlopen(request) as response:
    status_code = response.status
    res_headers = list(map(create_single_header, response.headers.items()))  
    body = response.read().decode('utf-8')

Il nostro codice, tuttavia, manca di una parte fondamentale: la gestione delle eccezioni. Infatti l'URL potrebbe non essere valido, il server potrebbe restituire un errore HTTP o la connessione potrebbe generare un timeout. Possiamo aggiungere la gestione delle eccezioni in questo modo.

from urllib.error import HTTPError, URLError
from urllib.request import urlopen, Request


def create_single_header(data):
    key, value = data
    d = {}
    d[key] = value
    return d


def send_http_request(request_url):
    if not request_url.startswith('https://'):
        return False
    try:
        request = Request(request_url)
        with urlopen(request) as response:
            status_code = response.status
            res_headers = list(map(create_single_header, response.headers.items()))
            body = response.read().decode('utf-8')
            return {
                'status': status_code,
                'headers': res_headers,
                'body': body
            }
    except URLError as err:
        return err.reason
    except HTTPError as error:
        return error
    except TimeoutError as tm_err:
        return tm_err

Infine, possiamo utilizzare il nostro codice nel modo seguente:

def main():
    print(send_http_request('https://gabrieleromanato.com'))


if __name__ == '__main__':
    main()