|
| | | | T O K E N D U M P
| | | | |
tokendump-0.1
--------------
---
This program was written mainly to create dictionaries from ordinary text
files. By default [0-9] [a-z] [A-Z] and Latin-1 characters will form tokens
but this can be changed by specifying DELIMCHARS.
---
NOTES:
[*]
Requires 68020(no FPU)+, OS2.04(theoretically)+, 16+ kilo of free memory.
[*]
For your convinience i have added an ISO 8859 character map, so you can
easily pick on DELIMCHARS. On your right a table that contains chars that
will form tokens.
0 1 2 3 4 5 6 7 8 9 a b c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f
0 0
1 1
2 ! " # $ % & ' ( ) * + , - . / 2
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 3 0 1 2 3 4 5 6 7 8 9
4 @ A B C D E F G H I J K L M N O 4 A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _ 5 P Q R S T U V W X Y Z
6 ` a b c d e f g h i j k l m n o 6 a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ 7 p q r s t u v w x y z
8 8
9 9
a ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ a
b ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ b
c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
d Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß d Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
e à á â ã ä å æ ç è é ê ë ì í î ï e à á â ã ä å æ ç è é ê ë ì í î ï
f ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ f ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
[*]
After tokenization pass the resulting file to 'dupfilter' to get rid of
possibly duplicated tokens.
---
HELP:
> tokendump ?
TEXTFILE/A,ND=NODIGITS/S,NL=NOLATIN/S,DELIMCHARS
TEXTFILE/A - Text file to be wordified/tokenified. Max line size is
16 kilos.
ND=NODIGITS/S - Tokens must not contain digits(0x30-0x39).
NL=NOLATIN/S - Tokens must not contain Latin-1 characters(0xC0-0xFF).
DELIMCHARS - Chars that will be treated as a delimiter. You can
specify them in hexadecimal or decimal notation. Each
character must be comma separated. Passing negative
value means that this char must not be delimiter. By
default 0x01 - 0x1F, 0x21 - 0x2F, 0x3A - 0x40,
0x5B - 0x60 and 0x7B - 0xBF will be translated to
0x20(delimiter).
---
USAGE:
; Tokenize text file allowing dashes and underscores
tokendump <textfile> -0x2D,-0x5F >tokens.txt
---
megacz
| |
| | | | |
|