Use regular expressions and text analysis functions on India’s Independence day speeches.
After demonstrating the principles of database design in my last post, the next major focus of the PostgreSQL for Everybody specialization was working with text data, which is again convenient to explore with my dataset of Indian Independence day speeches.
I had known about the LIKE
operator to search for specified patterns in a column, but the specialization introduced how you can also just use regular expressions with the ~
operator.
For example, I could find all of the references to the US (using a regex to capture the most common spelling variations—while excluding the USSR from the results).
SELECT year, pm.name,
regexp_matches(text.speech, '(United States|United States of America|USA|U.S.A.|U\.S\.|US(?!SR))',
'g') AS us_occurrence
FROM corpus
JOIN pm ON corpus.pm_id = pm.id
JOIN text ON corpus.text_id = text.id
ORDER BY year
LIMIT 11;
year | name | us_occurrence |
---|---|---|
1951 | Jawaharlal Nehru | {“United States of America”} |
1959 | Jawaharlal Nehru | {“United States”} |
1970 | Indira Gandhi | {U.S.} |
1994 | P.V. Narasimha Rao | {USA} |
2003 | A.B. Vajpayee | {US} |
2005 | Manmohan Singh | {“United States”} |
2005 | Manmohan Singh | {“United States”} |
2006 | Manmohan Singh | {“United States of America”} |
2009 | Manmohan Singh | {“United States”} |
2015 | Narendra Modi | {“United States”} |
The US actually only comes up in 10 speeches—most consistently during the Manmohan Singh era.
If I wanted to do some text analysis on the speeches, the length of speeches might be a basic place to start. char_length()
returns the number of characters in a string.
-- the longest speeches by character length
SELECT year, pm.name AS pm, party.name AS party,
AS speech_characters
char_length(text.speech) FROM corpus
JOIN pm ON corpus.pm_id = pm.id
JOIN party ON corpus.party_id = party.id
JOIN text on corpus.text_id = text.id
ORDER BY char_length(text.speech) DESC
LIMIT 5;
year | pm | party | speech_characters |
---|---|---|---|
2016 | Narendra Modi | BJP | 62178 |
2019 | Narendra Modi | BJP | 52982 |
2020 | Narendra Modi | BJP | 50356 |
2015 | Narendra Modi | BJP | 49399 |
2018 | Narendra Modi | BJP | 47444 |
All of the longest speeches come from Narendra Modi. Instead of character lengths though, I’d usually be more interested in counting words, which requires a different table structure.
I’ve previously done this kind of basic text analysis work in R using the {tidytext} package. However, the third course in the specialization demonstrated how PostgreSQL can handle routine NLP tasks like tokenization, stemming, and removing stopwords.
Instead of using tidytext::unnest_tokens()
to produce a table of one token per row, I used Postgres’ string_to_array()
and unnest()
functions, along with a lateral join to split the speeches on white space and create one row per word.
-- https://stackoverflow.com/questions/29419993/split-column-into-multiple-rows-in-postgres
SELECT c.year, s.word
FROM corpus c
JOIN text t ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(word)
ORDER BY c.year
LIMIT 5;
year | word |
---|---|
1947 | it |
1947 | has |
1947 | Fellow |
1947 | countrymen, |
1947 | been |
I can then use a simple group to find the most frequent words in the entire speech, or change the GROUP BY
clause to find the most frequent words per speech.
-- most popular words in entire corpus
SELECT s.token, COUNT(*)
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(t.speech, ' ')) AS s(token)
GROUP BY s.token
ORDER BY COUNT(*) DESC
LIMIT 5;
token | count |
---|---|
the | 11971 |
of | 8325 |
to | 7954 |
and | 6974 |
6624 |
Without any basic preprocessing, this isn’t too interesting yet. To start, I first need to convert all text to lowercase; remove stop words; and apply stemming depending on the exact purpose.
I can see that the entire corpus of speeches contains about 250k words, but only about 32k are distinct.
-- total words in corpus
SELECT COUNT(*) AS total_words,
COUNT(DISTINCT s.token) AS distinct_words
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(token);
total_words | distinct_words |
---|---|
251971 | 32204 |
This difference, as explained in the specialization, is one reason you don’t want to store every word in your index!
Even these distinct words though include many words that don’t contain any meaning. For that, I need to remove stop words. PostgreSQL has built-in functions for all of this, but I can do it manually. First creating the table structure:
DROP TABLE IF EXISTS stop_words;
CREATE TABLE IF NOT EXISTS stop_words (word TEXT unique);
And then copying in a simple list found on GitHub.
-- wget https://raw.githubusercontent.com/h2oai/h2o-tutorials/master/h2o-world-2017/nlp/stopwords.csv
copy stop_words (word) FROM 'stopwords.csv' WITH DELIMITER ',' CSV HEADER; \
I then can filter out stop words with a subquery before counting word frequencies again. Now it returns words of some value (though the list of stop words could be expanded).
SELECT s.word,
COUNT(s.word)
FROM corpus c
JOIN text t
ON c.text_id = t.id
CROSS JOIN LATERAL unnest(string_to_array(LOWER(t.speech), ' ')) AS s(word)
WHERE s.word NOT IN (SELECT word FROM stop_words)
GROUP BY s.word
ORDER BY COUNT(s.word) DESC
LIMIT 10;
word | count |
---|---|
6624 | |
people | 1003 |
country | 917 |
india | 902 |
also | 746 |
us | 659 |
new | 625 |
would | 600 |
one | 564 |
want | 455 |
*I’m actually not sure why an empty space is the most frequent word…
Another step forward would be to apply stemming to the keywords before counting them. This is a particularly important step if creating an index.
The specialization showed how to create a kind of stem-to-keyword dictionary to demonstrate how it works. But to implement it on my own, it’s obviously much better to use the built-in features of PostgreSQL.
PostgreSQL has a function to_tsvector()
that turns a text string (like a speech) into an array that contains natural language features like stemming, stop words removal, etc.
I’ll use this function to create a list of words that represent the speech document, and then apply ts_stat()
to extract some information, like the most common keywords.
SELECT * FROM ts_stat(
'english', t.speech)
$$SELECT to_tsvector(FROM corpus c
JOIN text t
ON c.text_id = t.id$$)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry |
---|---|---|
countri | 73 | 2191 |
india | 73 | 1753 |
peopl | 73 | 1548 |
year | 72 | 1112 |
us | 73 | 993 |
also | 71 | 922 |
nation | 73 | 869 |
govern | 66 | 859 |
today | 73 | 844 |
new | 67 | 781 |
Not surprisingly, patriotic words stems referring to “country”, “india”, “people”, and “nation” are the most common.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".