SpaCy¶

Author: Lukas Brandt

TL;DR¶

SpaCy offers NLP processing capabilities through pre-trained Pipelines. These Pipelines process the Text and extract information from it, which can be used for further processing or training of LLMs or Chatbots. The pre-trained pipelines can also be customized and trained with additional data.

This article aims to provide a brief overview of spaCy without delving into excessive detail. Relevant links for further information are included.

Introduction¶

SpaCy is a Python package developed by ExplosionAI GmbH. SpaCy is considered 'production ready' and is open-source.

It provides pre-trained pipelines for many languages, including components for dependency parsing, sentence segmentation, lemmatization, named entity recognition, tokenization, rule-based matching, and more. When provided with text, it returns a 'doc' object. This object contains all the information extracted from the text by the pipeline. This information can be used e.g. in machine learning, LLMs or chatbots.

Pipelines¶

The pre-trained pipelines follow a naming convention: lang_type_genre_size.

Part	Values	Description
lang	en, de, fi, ...	Language
type	core, dep, ...	Capabilities of the pipeline
genre	news, web, ...	type of training text
size	sm, md, lg, trf	Size

The transformer sized pipelines are the largest but also the most accurate pipelines.

Pre-trained pipelines¶

What these pretrained pipelines look like in more detail:

pre-trained pipeline

Name	Component	Creates	Description
tokenizer	Tokenizer	Doc	Segment text into tokens.
tagger	Tagger	Token.tag	Assign part-of-speech tags.
parser	DependencyParser	Token.head, Token.dep, Doc.sents, Doc.noun_chunks	Assign dependency labels.
ner	EntityRecognizer	Doc.ents, Token.ent_iob, Token.ent_type	Detect and label named entities.
lemmatizer	Lemmatizer	Token.lemma	Assign base forms.
textcat	TextCategorizer	Doc.cats	Assign document labels.
custom	custom components	Doc..xxx, Token..xxx, Span._.xxx	Assign custom attributes, methods or properties.

For further reading on Doc objects and pipelines check here.

Training and customization¶

The existing pipelines can be customized or further trained. To customize a pipeline certain parts of it can, for example, be deactivated:

nlp_ner = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger",
                                                "parser", "attribute_ruler", "lemmatizer"])

This example deactivates every component in the pipeline, that is not needed for Named Entity Recognition.

If further training is desired. It is necessary to convert the training and validation data to spaCy's binary format: .spacy. To begin training a config needs to be created. This can either be done on the website itself oder via the spacy init config config.cfg command. This config contains everything needed, apart from the paths to the training and validation data.

Training is then commenced either via python:

from spacy.cli.train import train

train("./config.cfg", overrides={"paths.train": "./train.spacy", "paths.dev": "./dev.spacy"})

or in the shell:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

To use this now trained pipeline, the command to load it changes to nlp = assemble('config.cfg').

For further reading on training look here.

Large-Language Models and spaCy¶

SpaCy can be used in conjunction with LLMs by utilizing the spacy-llm python package.This package supports self-hosted models built using pyTorch or TensorFlow, OpenAI's API, and models from Hugging Face.

The configuration file defines how to use these models. Here is an example configuration using OpenAI's GPT3.5 model.

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["COMPLIMENT", "INSULT"]

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.0}

For further reading on LLM's and spacy look here.

Example¶

# Installation
python -m venv .venv
source .venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy

# Download of a pre trained pipeline
python -m spacy download en_core_web_sm

import spacy

# Creating the NLP object
nlp = spacy.load("en_core_web_sm")

# Example text and using nlp
doc = nlp("This is an example sentence with numbers 234.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

# Output:
###
# This this PRON DT nsubj Xxxx True True
# is be AUX VBZ ROOT xx True True
# an an DET DT det xx True True
# example example ADJ JJ compound xxxx True False
# sentence sentence NOUN NN attr xxxx True False
# with with ADP IN prep xxxx True True
# numbers number NOUN NNS pobj xxxx True False
# 234 234 NUM CD nummod ddd False False
# . . PUNCT . punct . False False
###

Key Takeaways¶

spaCy is a python package used for NLP tasks.
It offers many text processing capabilities that prepare the text for further usage.
Pre-trained pipelines are availabile in many languages and sizes.
Pre-trained pipelines can be customized and further trained.
It is possible to use spaCy in conjunction with LLM's.
The documentation is the best first place too look for answers as it is vast and contains multiple examples.
Projects that use spaCy can be found here.