Demystifying LLMs: The Hidden Gems of GPT, Cohere, and More

Published On Sun May 18 2025
Demystifying LLMs: The Hidden Gems of GPT, Cohere, and More

The tools you know, the upgrades you didn't see coming

Building GenAI infra sounds cool—until it’s 3am and your LLM is down. This free guide helps you avoid the pitfalls. Learn the hidden costs, real-world tradeoffs, and decision framework to confidently answer: build or buy? Includes battle-tested tips from Checkr, Convirza & more.

Here's what's happening in the world of AI, which has been buzzing with groundbreaking developments! This week, we're tracking OpenAI's global partnerships for democratic AI, the transparency debate sparked by Anthropic's Claude 3.7 prompt leak, and Google's powerful Gemini 2.5 Pro debut alongside a fresh 'G' logo. We also explore the intersection of tech and Saudi investment, a surprising Microsoft-Google collaboration for AI agent interoperability, Anthropic's real-time web search integration into Claude, and OpenAI's practical guide for enterprise AI adoption. Ready to explore the cutting edge? Let's dive into the most captivating stories making headlines in the world of AI right now.

LLM Expert Insights, Packt

With the dominance of LLMs, it may seem like we’ve acquired a magic wand capable of solving nearly any task — from checking the weather to writing code for the next enterprise solution. In this context, one might wonder: are our favorite Python libraries, which we've long relied on, still relevant?

Today, we’ll talk about one such library, spaCy. Despite the rise of LLMs, spaCy remains highly relevant in the NLP landscape. However, its role has evolved. It now serves as a faster, more efficient, and lightweight alternative to large language models for many practical use cases.

Consider, for example, an HR screening system at a Fortune 500 company. spaCy can extract information such as names, skills, experience and other relevant details from resumes, and even flag profiles that best match a particular job description. Now imagine the cost per resume if, instead of spaCy, an LLM handled these tasks. spaCy excels at tokenization, part-of-speech (POS) tagging, named entity recognition (NER), dependency parsing, and even building custom components using rule-based or machine learning-based annotators.

In this issue, we’ll briefly explore the spaCy NLP pipeline, as detailed in the Packt book, Mastering spaCy, Second Edition, by Déborah Mesquita and Duygu Altinok. Here’s a high-level overview of the spaCy processing pipeline, which includes a tokenizer, tagger, parser, and entity recognizer.

Tokenization

1. Tokenization

Tokenization refers to splitting a sentence into its individual tokens. A token is the smallest meaningful unit of a piece of text — it could be a word, number, punctuation mark, currency symbol, or any other element that serves as a building block of a sentence. Tokenization can be complex, as it requires handling special characters, punctuation, whitespace, numbers, and more. spaCy’s tokenizer uses language-specific rules to perform this task effectively.

Consider the following piece of code:

import spacynlp = spacy.load("en_core_web_md")doc = nlp("I forwarded you an email.")print([token.text for token in doc])

2. POS tagging

Part-of-speech (POS) tags help us identify verbs, nouns, and other grammatical categories in a sentence. They also contribute to tasks such as word sense disambiguation (WSD). Each word is assigned a POS tag based on its context, the surrounding words, and their respective POS tags.

POS Tagging

To display the POS tags for the sentence in the previous example, you can iterate through each token as follows:

for token in doc:print(token.text, "tag:", token.tag_)

3. Dependency parser

While POS tags provide insights into the grammatical roles of neighboring words, they do not reveal the relationships between words that are not directly adjacent in a sentence. Dependency parsing, on the other hand, analyzes the syntactic structure of a sentence by tagging the syntactic relations between tokens and linking those that are syntactically connected.

Let’s look at how dependency relationships appear in this sentence:

for token in doc:print(token.text, "dep:", token.dep_)

4. Named Entity Recognition (NER)

A named entity is any real-world object such as a person, a place (e.g., city, country, landmark, or famous building), an organization, a company, a product, a date, a time, a percentage, a monetary amount, a drug, or a disease name.

Let’s see how spaCy recognizes the entities in a sentence in the following code snippet:

doc = nlp("I forwarded you an email from Microsoft.")print(doc.ents)token = doc[6]print(token.ent_type_, spacy.explain(token.ent_type_))

This was just a quick peek into spaCy pipelines — but there’s much more to explore.

Dependency Parsing

For instance, the spacy-transformers extension integrates pretrained transformer models directly into your spaCy pipelines, enabling state-of-the-art performance. Additionally, the spacy-llm plugin allows you to incorporate LLMs like GPT, Cohere, etc. for inference and prompt-based NLP tasks.

The book Mastering spaCy, Second Edition by Déborah Mesquita and Duygu Altinok is your comprehensive guide to building end-to-end NLP pipelines with spaCy.

Join Packt’s Accelerated Agentic AI Bootcamp this June and learn to design, build, and deploy autonomous agents using LangChain, AutoGen, and CrewAI. Hands-on training, expert guidance, and a portfolio-worthy project—delivered live, fast, and with purpose.

OpenAI Launches Global AI Partnership Initiatives

OpenAI has launched "OpenAI for Countries," a global initiative aimed at assisting nations in developing AI infrastructure aligned with democratic values. It is partnering with the US government in these projects. Through these infrastructure collaborations, the program seeks to promote AI development that upholds principles like individual freedom, market competition, and the prevention of authoritarian control. This effort is part of OpenAI's broader mission to ensure AI benefits are widely distributed and to provide a democratic alternative to authoritarian AI models.

Claude 3.7 System Prompt Leak Sparks Debate on AI Transparency and Security

A leak revealed the 24,000-token system prompt of Anthropic's Claude 3.7 Sonnet. System prompts are the foundational instructions that guide an AI's behaviour, tools, and filtering mechanisms, essentially its rulebook. While showcasing Anthropic’s commitment to transparency and constitutional AI, the exposure raises security concerns about potential manipulation. The incident highlights tensions between openness and system integrity as AI models increasingly influence information access and decision-making across sectors.