Workshop in Text Processing (Warsaw University, November 26, 2010)
Texts
Belarusian: Якуб Колас — У палескай глушы
English: Martin Luther King — I have a Dream
Programs
- ActivePerl 5.8.9.827, Windows Installer (MSI) (also version 5.10 is ok)
- Notepad++
Stuff to read
Regular Expressions
“Регулярные выражения на пальцах” (in Russian and on PHP-examples)
Денисов Ю.А. Лекция “Регулярные выражения” (из курса “Программирование для гуманитариев”)
How-to for Perl concordancer
- Download Simple concordancer (Perl script stool.pl) — click & save.
- Unzip downloaded file into any folder. For example into folder stool on disk D:
-
Go to D:stool and open file text.txt with Notepad++ or create your text file.
Make sure that text file is encoded in “UTF-8 without BOM encoding“:
- Open your text in a browser or in Word.
- Copy all content of the document.
- Paste it in your text file that is opened in Notepad++ and save it. Check whether copying was successful and symbols were copied correctly.
- Prepare text for processing: remove trash and information that is not concerned with the text under investigation (author name, part/chapter numbers, page numbers etc.).
- Start the Command Line and go to D:stool.
- Open stool.pl from D:stool with Notepad++.
- Change setting of stool.pl depending on your needs, save the changes.
- In the Command Prompt window run the command
If you have problems, make sure that keyboard layout switched to the language of the document you copy.
cd /d D:stool
perl stool.pl
If you see the message:
“perl” is not recognized as an internal or external command operable program or batch file
or
“perl” не является внутренней или внешней командой, исполняемой программой или пакетным файлом
run the command
C:perlbinperl stool.pl
If the problem still persists, make sure that you have Perl installed (and accordingly folder Perl on disk C: does exist).
If you’re lucky guy, you see that:
done: freq dict
done: all
or
done: freq dict
done: concordance
done: all
It depends on your setting in stool.pl: to get a concordance, you should set the word form to search for.