Text Data in PostgreSQL – Sean Angiolillo

After demonstrating the principles of database design in my last post, the next major focus of the PostgreSQL for Everybody specialization was working with text data, which is again convenient to explore with my dataset of Indian Independence day speeches.

Filter by regular expression

I had known about the LIKE operator to search for specified patterns in a column, but the specialization introduced how you can also just use regular expressions with the ~ operator.

For example, I could find all of the references to the US (using a regex to capture the most common spelling variations—while excluding the USSR from the results).

SELECT year, pm.name,
  regexp_matches(text.speech, 
  '(United States|United States of America|USA|U.S.A.|U\.S\.|US(?!SR))', 
  'g') AS us_occurrence
FROM corpus
JOIN pm ON corpus.pm_id = pm.id
JOIN text ON corpus.text_id = text.id
ORDER BY year
LIMIT 11;

Displaying records 1 - 10
year	name	us_occurrence
1951	Jawaharlal Nehru	{“United States of America”}
1959	Jawaharlal Nehru	{“United States”}
1970	Indira Gandhi	{U.S.}
1994	P.V. Narasimha Rao	{USA}
2003	A.B. Vajpayee	{US}
2005	Manmohan Singh	{“United States”}
2005	Manmohan Singh	{“United States”}
2006	Manmohan Singh	{“United States of America”}
2009	Manmohan Singh	{“United States”}
2015	Narendra Modi	{“United States”}

The US actually only comes up in 10 speeches—most consistently during the Manmohan Singh era.

Find speech lengths

If I wanted to do some text analysis on the speeches, the length of speeches might be a basic place to start. char_length() returns the number of characters in a string.

-- the longest speeches by character length
SELECT year, pm.name AS pm, party.name AS party, 
  char_length(text.speech) AS speech_characters
FROM corpus
JOIN pm ON corpus.pm_id = pm.id
JOIN party ON corpus.party_id = party.id
JOIN text on corpus.text_id = text.id
ORDER BY char_length(text.speech) DESC
LIMIT 5;

5 records
year	pm	party	speech_characters
2016	Narendra Modi	BJP	62178
2023	Narendra Modi	BJP	54020
2019	Narendra Modi	BJP	52982
2020	Narendra Modi	BJP	50356
2015	Narendra Modi	BJP	49399

All of the longest speeches come from Narendra Modi. Instead of character lengths though, I’d usually be more interested in counting words, which requires a different table structure.

Create one row per word

I’ve previously done this kind of basic text analysis work in R using the {tidytext} package. However, the third course in the specialization demonstrated how PostgreSQL can handle routine NLP tasks like tokenization, stemming, and removing stopwords.

Instead of using tidytext::unnest_tokens() to produce a table of one token per row, I used Postgres’ string_to_array() and unnest() functions, along with a lateral join to split the speeches on white space and create one row per word.

-- https://stackoverflow.com/questions/29419993/split-column-into-multiple-rows-in-postgres
SELECT c.year, s.word
FROM corpus c
JOIN text t ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(word)
ORDER BY c.year
LIMIT 5;

5 records
year	word
1947	it
1947	has
1947	Fellow
1947	countrymen,
1947	been

Find the most frequent words

I can then use a simple group to find the most frequent words in the entire speech, or change the GROUP BY clause to find the most frequent words per speech.

-- most popular words in entire corpus
SELECT s.token, COUNT(*)
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(token)
GROUP BY s.token
ORDER BY COUNT(*) DESC
LIMIT 5;

5 records
token	count
the	12955
of	8981
to	8375
and	7338
	6624

Find distinct words

Without any basic preprocessing, this isn’t too interesting yet. To start, I first need to convert all text to lowercase; remove stop words; and apply stemming depending on the exact purpose.

I can see that the entire corpus of speeches contains about 250k words, but only about 32k are distinct.

-- total words in corpus
SELECT COUNT(*) AS total_words, 
  COUNT(DISTINCT s.token) AS distinct_words
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(token);

1 records
total_words	distinct_words
267937	33816

This difference, as explained in the specialization, is one reason you don’t want to store every word in your index!

Counts without stop words

Even these distinct words though include many words that don’t contain any meaning. For that, I need to remove stop words. PostgreSQL has built-in functions for all of this, but I can do it manually. First creating the table structure:

DROP TABLE IF EXISTS stop_words;
CREATE TABLE IF NOT EXISTS stop_words (word TEXT unique);

And then copying in a simple list found on GitHub.

-- wget https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/nlp/stopwords.csv
\copy stop_words (word) FROM 'stopwords.csv' WITH DELIMITER ',' CSV HEADER;

I then can filter out stop words with a subquery before counting word frequencies again. Now it returns words of some value (though the list of stop words could be expanded).

SELECT s.word, 
  COUNT(s.word)
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(word)
WHERE s.word NOT IN (SELECT word FROM stop_words)
GROUP BY s.word
ORDER BY COUNT(s.word) DESC
LIMIT 10;

Displaying records 1 - 10
word	count
	6624
people	1048
country	1017
india	990
also	785
us	689
new	681
would	612
one	595
want	481

*I’m actually not sure why an empty space is the most frequent word…

Count stemmed keywords

Another step forward would be to apply stemming to the keywords before counting them. This is a particularly important step if creating an index.

The specialization showed how to create a kind of stem-to-keyword dictionary to demonstrate how it works. But to implement it on my own, it’s obviously much better to use the built-in features of PostgreSQL.

PostgreSQL has a function to_tsvector() that turns a text string (like a speech) into an array that contains natural language features like stemming, stop words removal, etc.

I’ll use this function to create a list of words that represent the speech document, and then apply ts_stat() to extract some information, like the most common keywords.

SELECT * FROM ts_stat(
  $$SELECT to_tsvector('english', t.speech) 
  FROM corpus c
  JOIN text t
  ON c.text_id = t.id$$)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

Displaying records 1 - 10
word	ndoc	nentry
countri	75	2395
india	75	1924
peopl	75	1616
year	74	1197
us	75	1041
also	73	965
today	75	959
nation	75	925
govern	68	900
world	74	871

Not surprisingly, patriotic words stems referring to “country”, “india”, “people”, and “nation” are the most common.