Workshop in Text Processing (Warsaw University, November 26, 2010)

Texts

Belarusian: Якуб Колас — У палескай глушы

English: Martin Luther King — I have a Dream

Programs


Stuff to read

Regular Expressions

Introduction

Perl Regular Expressions

“Регулярные выражения на пальцах” (in Russian and on PHP-examples)

Денисов Ю.А. Лекция “Регулярные выражения” (из курса “Программирование для гуманитариев”)

Регулярные выражения

perlcheat


How-to for Perl concordancer

  1. Download Simple concordancer (Perl script stool.pl) — click & save.
  2. Unzip downloaded file into any folder. For example into folder stool on disk D:
  3. Go to D:stool and open file text.txt with Notepad++ or create your text file.

    Make sure that text file is encoded in “UTF-8 without BOM encoding“:

    Notepad++ with UTF-encoded text file

  4. Open your text in a browser or in Word.
  5. Copy all content of the document.
  6. Paste it in your text file that is opened in Notepad++ and save it. Check whether copying was successful and symbols were copied correctly.
  7. If you have problems, make sure that keyboard layout switched to the language of the document you copy.

  8. Prepare text for processing: remove trash and information that is not concerned with the text under investigation (author name, part/chapter numbers, page numbers etc.).
  9. Start the Command Line and go to D:stool.
  10. cd /d D:stool
  11. Open stool.pl from D:stool with Notepad++.
  12. Change setting of stool.pl depending on your needs, save the changes.
  13. In the Command Prompt window run the command
  14. perl stool.pl

    If you see the message:

    “perl” is not recognized as an internal or external command operable program or batch file

    or

    “perl” не является внутренней или внешней командой, исполняемой программой или пакетным файлом

    run the command

    C:perlbinperl stool.pl

    If the problem still persists, make sure that you have Perl installed (and accordingly folder Perl on disk C: does exist).

    If you’re lucky guy, you see that:

    done: freq dict
    done: all

    or

    done: freq dict
    done: concordance
    done: all

    It depends on your setting in stool.pl: to get a concordance, you should set the word form to search for.


Дадаць каментар