Wednesday 25 April 2012

Computer language mystery solved by humans


Computers have languages, too. According to an article in the American Scientist, even the experts do not agree how many programming languages there are – estimates range from 2,500 to over 8,500.

One recent example which highlighted this variety was the mystery of the programming language used in the creation of “Duqu”, a computer Trojan which has been studied by heavyweight anti-virus companies like Symantec, Kaspersky Labs and F-Secure. These IT giants were able to see the code which this Trojan consisted of, but they were not able to identify which programming language had been used to compile this code.

Why didn’t they ask a computer?
To me, as a mere computer user without a programming background, the solution appears simple. It is a computer language, and a computer is obviously able to follow the instructions in the code (otherwise the Trojan would be of no use to the crooks who created it). So a computer should be able to identify what language it is. This seems to be an obvious logical conclusion.

But it is not so. Igor Soumenkov, a Kaspersky Lab Expert, wrote a blog article “The Mystery of the Duqu Framework”. The article outlines the history of the study of Duqu and the structure of the threat which it poses, and it ends with an appeal which amazed me: We would like to make an appeal to the programming community and ask anyone who recognizes the framework, toolkit or the programming language that can generate similar code constructions, to contact us or drop us a comment in this blogpost.

Digital guesswork?
Soumenkov received a flood of blog comments and e-mail responses, and the mystery of the programming language has now been solved. But it is interesting to check out the wording of the 159 comments on the original blog article. They are peppered with phrases like:
That code looks familiar
It may be a tool developed by ...
I think it's a ...
What about ...?
Just a guess ... the first thing that pops to my mind is ...
Sounds a lot like ...
I am not a specialist but I would say it could be ...
One more guess ...
This does smell to me a little bit like ...
I'm gonna take a wild guess ...
Plus a generous sprinkling of words like might, perhaps, maybe, probably, similar, clue, feel, remember, possibility and similar vague terms.

Data or brains?
For me, this throws an interesting light on the use of computers in natural language processing. The human guesswork in the comments on Duqu included many ideas that turned out to be wrong, but the brainstorming process was helpful to the computer experts involved, and the fuzzy process of human thinking led to a solution which evidently was not possible with the computer alone. And all of this for a language which is only useful in computers and has no meaning for human communication (when did you last _class_2.setup_class13)[esi]?).

The situation in translation between human languages is comparable. Automatic translation programs from Google, Microsoft, IBM and others can achieve a certain amount of pattern recognition and sometimes come up with plausible solutions. But only a competent human being can evaluate whether this solution is really accurate or appropriate. So these programs can be a useful tool in the hands of an expert, but there is a distinct risk that they may get the wrong end of the stick.

Friday 2 March 2012

Would I advise my grandchildren to translate?

Bang, bang, bang.
Is this another nail in the coffin of freelance translation as a career?
A recent article on the blog of the Translation Automation User Society (TAUS) does not hold out much hope for specialist translators. The title of the article is “Who gets paid for translation in 2020?”. I would love to quote the author of this article by name, but no name is given. Perhaps this is a model article, generated by a computer, untouched by human hand. This would graphically illustrate the creed which underlies the article:
“In 2020 words are ‘free’. Almost every word has already been translated before. Our words will be stored somewhere and used again, legitimately in the eyes of the law or not. .... Even today ‘robots’ are crawling websites to retrieve billions of words that help to train machine translation engines. The latent demand for translation created by unprecedented globalization is making piracy an act of common sense.”
The TAUS vision paints a glowing picture of a completely automated future, with instant computerised translation in every hand-held device, every computer application and on every website, without any need for specialist intervention. To achieve this, TAUS aims to build up a database of all the translation work done in the world. It seems to envisage three methods to do this:
LinkBEG, SCAVENGE and STEAL
BEG: In conference lectures, blog articles and other publications, TAUS calls on translators to donate their translations to its central database. The reward for doing this is to know that we are contributing to the BRAVE NEW WORLD of global computerised translation. There may be some payback in the form of access to databases provided by others, but the rhetoric of the begging prose is that we should contribute for free to the ideal of a humanity without language barriers.
SCAVENGE: The above quote speaks of the “robots” which are retrieving billions of translated words to train machine translation engines. But a scavenger takes everything that it can find. A scavenger cannot afford to be fussy about quality. There are two experts in the industry who have important things to say about this. First of all Kirti Vashee in his blog eMpTy Pages. Kirti is an ardent advocate of machine translation, but he insists that the data used to train the translation engines must be of extremely high quality. The danger of the TAUS vision of innumerable robots scavenging for more and more data is that this can include lots of low quality data, so the resulting translations will be inherently problematical. The other expert is Miguel Llorens, a highly insightful freelance translator who ridicules many of the assumptions of the machine translation gurus and elegantly criticises buzzwords such as the “content tsunami” and “crowdsourcing”.
As an aside: Kirti and Miguel disagree on many things - I suppose it is not often that they are recommended as two leading experts in the debate on machine translation.
STEAL: It has often been suggested that Internet giants such as Google and Facebook are in fact data-gobbling monsters which think nothing of violating data protection standards. But at least in their public statements, they usually claim to respect the privacy of their users and to comply with data protection laws. Not so TAUS. In the above quotation, TAUS explicitly suggests that piracy is “an act of common sense”. I wonder if the similarity to the confiscation of private assets in the ideology of Marx, Stalin and others is merely accidental. Brave new world indeed!
Translation and my grandchildren
By the time the brave new world predicted by TAUS comes to pass (2020), my own translation career will be drawing to a close, or perhaps already ended. But what about my wonderful grandchildren? They will be on the threshold of their working lives (and some will be still in primary school). What should I tell them if they ask about translation as a career?
I will say: “Why not - if that is what you are really good at.” Of course I will point out the general principles of working in a career like translation: real language expertise in two languages, realistic self-appraisal and self-management, translating skills, the need for solid specialisation, how to use the tools of the trade (including computer-aided translation and various forms of machine translation), how to advertise and find customers and much more.
This is because essentially I do not accept the TAUS creed that “Almost every word has already been translated before.”. Even at the word level, in my work I regularly come across newly created terms or compound words (German legal and architectural prose has an amazing level of inventiveness in this respect). And at the sentence level, every language on earth has an incredible potential for creative new combinations of ideas and even new linguistic structures - after all, I believe that we are still building the tower and city of Babel.

Tuesday 17 January 2012

12 facts, hints and ideas on databases in DVX2

Déjà Vu X2 is a “Translation Memory” program (TM). It does not come with pre-packaged language content. Instead, it remembers your own work, i.e. it acts as a “memory” for what you have “already seen” (= “déjà vu” in French).

1. There are three types of memory:
The TM (Translation Memory), the TB (Termbase) and the lexicon for each project.
  • The TM is a database where you can save the sentences from your source text together with your finished translation.
  • The TB is a terminology database which you can use for single words or whole phrases.
  • The lexicon is a database which only applies to the individual project. For every project file you can create a new lexicon.
When you then work on your project DVX2 combines the content of these three database types to suggest translations and help you in your work. The methods which DVX2 uses to make these suggestions are known as “Pretranslate”, “Assemble” and “AutoAssemble” – but that is another topic for another day.

2. Big Mama and Big Papa:
You can keep all of your work in just one TM (“Big Mama”) and one TB (“Big Papa”). If you are careful to give your entries the appropriate subject and client codes, DVX2 will take these codes into account when suggesting translations from your databases. My main TM contains about 40,000 sentence pairs accumulated over 12 years, and my main TB has about 55,000 entries.

3. Separate TMs and TBs:
In DVX2 Professional you can have up to 5 TMs and 5 TBs open in any project, and DVX2 Workgroup has no limitation. So you can use your Big Mama/Papa together with external databases, e.g. a TM or terminology list provided by the client, general reference material such as the EU DGT database, or terminology lists from major enterprises such as Microsoft, SAP or from various banks. Or you may even decide to keep separate databases for different subjects or clients instead of a Big Mama or Big Papa. You may feel that this is safer if you work on texts for competing engineering or IT firms which deliberately use different terminology for their own brands. The problem is that it may be more difficult to access all of your reference material, for example if you know that you have dealt with a term or sentence in DVX2, but you can’t remember which database you were using at the time.

4. Fuzzy matching:
You can allow DVX2 to find matching material which is not quite exact. Under Tools>Options>General you can set a percentage figure for the variants which DVX2 is allowed to find (= “Minimum Score”). The default setting is 75%, but depending on the type of inflections which occur in your languages it may be useful to set it to 50% or less. The percentage applies to both the TM and TB. It does not apply to the lexicon – only exact matches are found in the lexicon. And the “minimum score” does not affect the performance of the DVX2 functions DeepMiner and AutoWrite.


5. Adding new entries:
This is very quick and easy in DVX2. For the TM you enable AutoSend (either with the tick box at Tools>Options>Environment, or via the icons at the bottom of the DVX2 window – AutoSend is the second icon from the right). Then all you need to do is click CTRL-DownArrow when you have finished each segment. For the lexicon you have to highlight the word or phrase in the source and target text, then hit the F10 key. For the TB you again highlight the word or phrase in the source and target text, then hit F11. This brings up the following window:

Here you can edit the term in either language to add or remove declensions, correct spelling problems etc. You can check that the terms are marked with the right subject and client codes. There are additional fields, too (Definition, Part of Speech, Gender, Number, and you may also see a field called Context). I have not yet seen any reason to use any of these fields, although some users may have found ways to do so.
The termbase (TB) is one of the keys to productivity in DVX2. It is advisable to add words, and even whole phrases, as often as you can. Some users have the principle of adding an entry to the TB in every single sentence they translate. Steven Marzuola’s article about using the terminology database was based on the previous version of DVX (now often called DVX1), but it offers great advice which is also relevant to DVX2.

6. Subject and client codes:
These are important, because DVX2 refers to them when it decides what material to offer to help you with your current translation. When you first install DVX2, you will see a suggested list of subjects, but you can easily delete this and create your own list if you think this is better for your work. Each subject consists of a short index code (435 in my example above) and a descriptive text (Regional planning/ecology). When DVX2 decides how close the subject is to your current project, it works hierarchically, so in this example it would consider that entries with my subject codes 43 (Urban planning) and 4 (Building) are closely related. You can use letters instead of numbers if this suits your work.


7. Build lexicon:
This is a function which you can find in the “Lexicon” menu, and which is sometimes useful in preparation for a job which is heavy on terminology. I use this function for between 5% and 10% of my jobs. My procedure is as follows. First I call up “Build lexicon” and define the maximum number of words (usually 4). The program then takes a couple of minutes to find solutions. Then I open the lexicon (with the Project Explorer), click on the heading over the left hand column and define the sort criteria: 1. Number of words (descending), 2. Frequency (descending). Then I go through the list manually from the top. First I decide which four-word phrases are worth adding a lexicon entry for. This is usually only worthwhile for phrases which are meaningful in themselves and which occur frequently. When I get down to phrases which appear three times or less, I then use the scroll bar to move down to the most frequent three-word phrases. And so on, until I have defined a number of lexicon entries. Then I select “Remove entries” from the Lexicon menu, click on “Entries with empty targets” and OK. Typically, this gives me between 30 and 50 lexicon entries for a job consisting of several hundred segments, but they are entries which occur frequently and require consistency, so this preliminary process improves the results achieved by Pretranslate or Assemble as I work on the job.

This function (Build lexicon) can also be used to identify terms that can be used for a terminology list to be delivered to the client if this is part of the client’s instructions for the job. Over the years I have only had one such project, but this may be relevant for translators who often work in highly technical fields.

8. Names, places and proprietary titles:
These are the classic elements which should be added to the lexicon. If you have a product name or number, this is normally only relevant to the job in hand. You do not usually want this term to occur in jobs for other clients. The same applies to the names of the people who work for the client. Therefore, such elements should only be sent to the lexicon, and not to the termbase. But some names occur so often that they may be useful in the TB. My general principle here: if names could be confused with actual words in the language, they are not suitable for the TB. So the common German name Helmut is not in my TB because, depending on the level of fuzzy matching, it could be confused with the word Helm=helmet (and the declined forms Helme/Helmen/Helmes). Similarly, the surname Kohl is not in the TB to avoid confusion with Kohl=cabbage (and the near-match Kohle=coal). But the two names together are in the TB – i.e. the former German Chancellor Helmut Kohl. And other famous politicians are there too with the spelling in German and English, such as Gorbatschow/Gorbachev.
9. Adapting your use of the databases to your languages:
In some cases, your language pair and translation direction will influence the way you use the different databases because of issues such as word order and inflection. One example of this is the English phrase “public green spaces”. In French the words come in a different order, e.g. “espaces verts publics”, and alternative wordings are possible, e.g. “espaces verts des lieux publics”, “espace verts ouverts au public”, “espaces verts pour le public” etc. (Thanks to Dave Turner for providing these and other examples). In German the first translation that comes to mind is “öffentliche Grünflächen”, although the first word could also be declined as “öffentlichen”.

If you are translating from French to English, you will probably want to enter each and every French phrase as a lexical unit, especially if it occurs frequently in the type of text you deal with. Merely entering the elements does not help very much, because the order of the words must be changed. Depending on your type of work and the frequency of such phrases, you may decide to store them in the lexicon, the TB or the TM.
If you are translating from German, in this case it is sufficient to add the two words to the termbase and let DVX2 handle the endings as “fuzzy matches”. Even if we consider phrases with a greater number of inflected variants such as “public building”, (“öffentliche Gebäude”, “öffentliches Gebäude”, “öffentlichen Gebäudes”, “öffentlichem Gebäude”), it is still possible to enter just one version of each word and use fuzzy matching. The advantage here is that although the German source is inflected, the English target phrase is not.
Translating from a largely uninflected language into inflected languages like French and German can be more complicated, so you will have to find a strategy which fits the languages that you work with. There is no single solution which will work for all languages and all subject areas, but DVX2 offers flexibility in the use of the databases.

10. Looking things up in the database:
There are various ways to access the information that is in your databases. The first is that DVX2 uses this information to compile its suggested translation (when you use the functions “Pretranslate”, “Assemble” or “AutoAssemble”). When you have done that, you will see that some words or phrases in the suggested translation are underlined in blue. These are terms for which your databases contain several possibilities. Right clicking on the word or phrase will show you the other suggestions, and you can examine these and select them with the mouse or by using the number shown. The third way to see the relevant content of your database is by looking at the “Portions” window or windows. There are several screenshots illustrating this here. The fourth way to look up the information is to use Scan (CTRL-S) to call up a concordance from the TM, or Lookup (CTRL-L) to see entries from the TB.


11. Moving databases to another computer:
If you need to move your work to a different computer, e.g. to work on a laptop while you are travelling, you will need to copy certain files to the other computer. The first file is your project file, which has the extension .dvprj. The project file contains the lexicon, so no special steps are needed to transfer the lexicon. The termbase is a single file with the extension .dvtdb. The TM consists of at least four files. The main content is in a file with the extension .dvmdb. Then there is an index file for each of your languages; my index files have the extension en.dvmdi and de.dvmdi (for English and German). There is also a file with the extension .dvmdx. When you open the project on the other computer, DVX2 may complain that it cannot find the databases. But this is not a problem – when the project is open, you can select them with Project>Properties>Databases.

Another file which is worth moving to the other computer is the settings file with the extension .dvset. This contains your subject and client lists and various other settings. And don’t forget your dongle, or if you use an electronic licence key, make sure that the key will apply to the other computer.

12. How to find out more:
For more detailed information it is worth looking at the DVX2 User Guide for DVX2 Professional or DVX2 Workgroup. The link is at the bottom of the page, and the user guides are PDF files with over 600 pages. On the website http://www.atril.com there are also links to various videos, webinars and training courses, and also to the mailing list dejavu-l (under Support>Technical forum).

I already mentioned Steven Marzuola’s article on terminology databases. It is also worth looking at Nelson Laterman’s collection of tips and tricks for DVX1 (and even its predecessor DV3).
I am sure there are plenty of tips and questions which I have not covered, so I am looking forward to reading comments by my readers.