Saturday, April 15, 2023

Get the text in TG4 Subtitles

I have been watching Ros na Rún. Mainly because I am a messy bitch that lives for drama but also to learn Irish. The subtitles are really high quality. Here is how to extract them.

Go to the episode you want. And make the developer tools viewable at the bottom of the browser


Also make sure the subtitles are turned on



We want the network tab in the developer tools



Search for VTT (in the second search box). and open that file in a new browser tab.

This can show slightly funny as Irish is encoded in utf-8 and your browser might think it is simple ascii that does not have fada's. But if you save the page and open it in another editor it should look fine. 

WEBVTT
X-TIMESTAMP-MAP=LOCAL:00:00:00.000,MPEGTS:0

01:21.800 --> 01:24.560
Tá siad ag ceapadh anois
gur seipsis atá i gceist.

01:24.640 --> 01:26.680
Ach tiocfaidh sé tríd,
nach dtiocfaidh?


Why do this? One of the best ways to learn a language seems to be to listen to stories and when a new word comes along learn it then. Linq uses this but Irish is not popular enough for them to have it in their options. Soaps and plays are also good as they are much closer to how people actually speak than literature is. 

TG4 have gone to the trouble of making great subtitles. And they want people to learn the language. It would be good to turn this resource into something that helped people even more. For learners to watch an TV episode and have a list of new words and their meaning might really help their learning.