I had known about the LIKE operator to search for specified patterns in a column, but the specialization introduced how you can also just use regular expressions with the ~ operator.
For example, I could find all of the references to the US (using a regex to capture the most common spelling variations—while excluding the USSR from the results).
SELECTyear, pm.name, regexp_matches(text.speech, '(United States|United States of America|USA|U.S.A.|U\.S\.|US(?!SR))', 'g') AS us_occurrenceFROM corpusJOIN pm ON corpus.pm_id = pm.idJOIN text ON corpus.text_id = text.idORDERBYyearLIMIT11;
Displaying records 1 - 10
year
name
us_occurrence
1951
Jawaharlal Nehru
{“United States of America”}
1959
Jawaharlal Nehru
{“United States”}
1970
Indira Gandhi
{U.S.}
1994
P.V. Narasimha Rao
{USA}
2003
A.B. Vajpayee
{US}
2005
Manmohan Singh
{“United States”}
2005
Manmohan Singh
{“United States”}
2006
Manmohan Singh
{“United States of America”}
2009
Manmohan Singh
{“United States”}
2015
Narendra Modi
{“United States”}
The US actually only comes up in 10 speeches—most consistently during the Manmohan Singh era.
Find speech lengths
If I wanted to do some text analysis on the speeches, the length of speeches might be a basic place to start. char_length() returns the number of characters in a string.
-- the longest speeches by character lengthSELECTyear, pm.name AS pm, party.name AS party, char_length(text.speech) AS speech_charactersFROM corpusJOIN pm ON corpus.pm_id = pm.idJOIN party ON corpus.party_id = party.idJOIN text on corpus.text_id = text.idORDERBY char_length(text.speech) DESCLIMIT5;
5 records
year
pm
party
speech_characters
2016
Narendra Modi
BJP
62178
2023
Narendra Modi
BJP
54020
2019
Narendra Modi
BJP
52982
2020
Narendra Modi
BJP
50356
2015
Narendra Modi
BJP
49399
All of the longest speeches come from Narendra Modi. Instead of character lengths though, I’d usually be more interested in counting words, which requires a different table structure.
Create one row per word
I’ve previously done this kind of basic text analysis work in R using the {tidytext} package. However, the third course in the specialization demonstrated how PostgreSQL can handle routine NLP tasks like tokenization, stemming, and removing stopwords.
Instead of using tidytext::unnest_tokens() to produce a table of one token per row, I used Postgres’ string_to_array() and unnest() functions, along with a lateral join to split the speeches on white space and create one row per word.
-- https://stackoverflow.com/questions/29419993/split-column-into-multiple-rows-in-postgresSELECT c.year, s.wordFROM corpus cJOIN text t ON c.text_id = t.idCROSSJOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(word)ORDERBY c.yearLIMIT5;
5 records
year
word
1947
it
1947
has
1947
Fellow
1947
countrymen,
1947
been
Find the most frequent words
I can then use a simple group to find the most frequent words in the entire speech, or change the GROUP BY clause to find the most frequent words per speech.
-- most popular words in entire corpusSELECT s.token, COUNT(*)FROM corpus cJOIN text tON c.text_id = t.idCROSSJOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(token)GROUPBY s.tokenORDERBYCOUNT(*) DESCLIMIT5;
5 records
token
count
the
12955
of
8981
to
8375
and
7338
6624
Find distinct words
Without any basic preprocessing, this isn’t too interesting yet. To start, I first need to convert all text to lowercase; remove stop words; and apply stemming depending on the exact purpose.
I can see that the entire corpus of speeches contains about 250k words, but only about 32k are distinct.
-- total words in corpusSELECTCOUNT(*) AS total_words, COUNT(DISTINCT s.token) AS distinct_wordsFROM corpus cJOIN text tON c.text_id = t.idCROSSJOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(token);
1 records
total_words
distinct_words
267937
33816
This difference, as explained in the specialization, is one reason you don’t want to store every word in your index!
Counts without stop words
Even these distinct words though include many words that don’t contain any meaning. For that, I need to remove stop words. PostgreSQL has built-in functions for all of this, but I can do it manually. First creating the table structure:
DROPTABLEIFEXISTS stop_words;CREATETABLEIFNOTEXISTS stop_words (word TEXT unique);
And then copying in a simple list found on GitHub.
I then can filter out stop words with a subquery before counting word frequencies again. Now it returns words of some value (though the list of stop words could be expanded).
SELECT s.word, COUNT(s.word)FROM corpus cJOIN text tON c.text_id = t.idCROSSJOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(word)WHERE s.word NOTIN (SELECT word FROM stop_words)GROUPBY s.wordORDERBYCOUNT(s.word) DESCLIMIT10;
Displaying records 1 - 10
word
count
6624
people
1048
country
1017
india
990
also
785
us
689
new
681
would
612
one
595
want
481
*I’m actually not sure why an empty space is the most frequent word…
Count stemmed keywords
Another step forward would be to apply stemming to the keywords before counting them. This is a particularly important step if creating an index.
The specialization showed how to create a kind of stem-to-keyword dictionary to demonstrate how it works. But to implement it on my own, it’s obviously much better to use the built-in features of PostgreSQL.
PostgreSQL has a function to_tsvector() that turns a text string (like a speech) into an array that contains natural language features like stemming, stop words removal, etc.
I’ll use this function to create a list of words that represent the speech document, and then apply ts_stat() to extract some information, like the most common keywords.
SELECT*FROM ts_stat( $$SELECT to_tsvector('english', t.speech) FROM corpus cJOIN text tON c.text_id = t.id$$)ORDERBY nentry DESC, ndoc DESC, wordLIMIT10;
Displaying records 1 - 10
word
ndoc
nentry
countri
75
2395
india
75
1924
peopl
75
1616
year
74
1197
us
75
1041
also
73
965
today
75
959
nation
75
925
govern
68
900
world
74
871
Not surprisingly, patriotic words stems referring to “country”, “india”, “people”, and “nation” are the most common.