Regular Expressions: An Introduction for Translators

Regular expressions (also known as RegEx) are a very powerful resource and open a full range of possibilities in different programs, including some computer-assisted translation (CAT) tools. You can think of regular expressions as a search-and-replace function on steroids. Regular expressions can assist our translation work by allowing us to search, replace, and filter text in ways that would otherwise be impossible in our software tools.

Have you ever wondered how much easier it would be if you could:

  • Perform two or more separate searches at the same time (e.g., searching for different forms of the same term, or perhaps for different words altogether)?
  • Filter text in your CAT tool to display only those segments that are capitalized differently between the source and target?
  • Search a glossary for all capitalized headwords and change them to lowercase while leaving acronyms and other terms that are in all uppercase unchanged—and do all this in a single operation?
  • Filter text in your CAT tool to find the segments where the end punctuation differs between the source and target?
  • Filter text in your CAT tool to display only the segments that contain certain words in the source, or the segments that don’t contain a specific word in the target?

In short, have you ever wondered how much easier it would be if you could do something beyond what the normal search-and-replace function can do? If yes, then regular expressions may help you.

At first, regular expressions may appear cryptic, but once you’ve learned the basics and seen how useful they can be, you’ll be able to decide how much time you want to invest to become more proficient at using them. The following will focus on using regular expressions for searching, replacing, and filtering text in CAT tools such as SDL Trados Studio, memoQ, or Xbench. CAT tools also use regular expressions for creating segmentation and auto-translation rules, or for protecting tags.

Figure 1: Using a Single Regular Expression to Find Different Forms of the Same Term

What Are Regular Expressions and How Can They Help Translators?

A regular expression is a special sequence of characters or symbols that define a search pattern. This pattern is then used to search for (or replace) specific instances of words or phrases in a text.1 Regular expressions are used by search engines, text editors, text processing utilities, and for lexical analysis.

The simplest regular expressions use no symbols, just normal characters. For example, to find all instances of words that contain “able,” you would use the regular expression comprising the search string “able,” which would not only find the word “able,” but also “enable,” “able-bodied,” and “agreeable.” But if this was all that regular expressions could do, they wouldn’t be interesting, or particularly useful, or challenging to learn.

As we’ll see from the examples that follow, what makes regular expressions powerful are the various symbols and characters you can use with them. (Please note that the regular expressions appear in red in the examples.)

Ways to Use Regular Expressions

Finding Different Forms of the Same Term: Here is how you can use a regular expression to ensure that the terms “gray” and “preamplifier” are spelled consistently in your translation:

  • gr(a|e)y or gr[ae]y: Finds the word “gray” spelled both with an “a” (“gray”) and with an “e” (“grey”).
  • pre[ -]?amplifier: Finds instances of “pre amplifier,” “pre-amplifier,” and “preamplifier.” (See Figure 1.)

Let’s see how these regular expressions work.

  • gr(a|e)y: This regular expression searches for the letters “gr” followed by a group (enclosed in parentheses) that contains an alternative spelling (marked by the pipe symbol “|”): either the letter “a” or “e” followed by the letter “y.”
  • gr[ae]y: Here we do the same, but this time using a set (enclosed in brackets) of the characters possible in that position: the letters “a” or “e.” (Note: both the group and the set in these examples represent searches for only one letter, but provide alternatives for which letter that could be.)
  • Pre[ -]?amplifier: This expression uses a set (enclosed in brackets) of a space or a hyphen to match “Pre amplifier” or “Pre-amplifier,” followed by a question mark. The question mark symbol is a quantifier that tells the regular expression how many times the preceding element or character should be matched. The question mark quantifier means “0 or 1” times. So, this regular expression searches for words containing “Pre,” followed either by a space, hyphen, or by nothing at all (this is where the “zero times” comes into play), followed by “amplifier.” (Note: In Xbench, this regular expression would need to be changed to
    Pre[ \-]amplifier.)

Searching for Multiple Words at the Same Time: Let’s say you’ve just received a long translation to edit. Since it was done by translators from different countries, you notice that they used different words for the same term. You want to filter the target text to see all the segments that contain either the words “melocotón” or “durazno,” two alternative translations for the word “peach.” Using the simple regular expression (melocotón|durazno) does the trick. (Note that searching for alternatives is not limited to only two terms.)

Figure 2: Find All Segments Where Capitalization Doesn’t Match between the Source
and Target Text

 

 

 

Finding All Segments Where Target Capitalization Doesn’t Match the Source: The following pair of regular expressions can be used to ensure that capitalization in your target document matches capitalization in your source:

  • ^[A-Z] (capitalization in the source)
  • ^[a-z] (capitalization in the target)

This works in the text filter of tools like memoQ, Studio 2017, or Xbench to find segments that are capitalized in the source text but not in the target. (In these tools, you would use the regular expression search mode and select “match case” or “case sensitive.” See Figure 2.)

In the examples above, the caret symbol (^) at the beginning of the regular expression signals the beginning of a string or segment. This is followed by two sets of letters in brackets. Each set contains ranges: the hyphen between the letters marks the range within the set. The first set is the range of all uppercase letters and the second set marks the range of all lowercase letters. You can specify different ranges as necessary and have several ranges in a set. For example, you can use [A–G] to designate the range of all uppercase letters from “A” through “G,” and [0-9A-Za-z] to designate the range of all digits and all capital or lowercase letters.

Normalize Capitalization of Headwords in a Glossary: Let’s say you have a glossary in a tab-delimited format and it’s a mess: some headwords are capitalized, some are not, and some are acronyms in all uppercase. When preparing to import the glossary into your termbase, you decide you want to find all the capitalized headwords and replace them with lowercase while leaving the terms that are all in capital letters untouched. That is, you want to change this:

Acción correctiva

Corrective action

ácido nitrilotriacético

NTA, Nitrolotriacetic Acid

ADN recombinante

Recombinant DNA

bajada del nivel de agua

drawdown

Datos EMAP

EMAP data

DDT

DDT

Desperdicios domésticos

Household waste

Empaque a prueba de niños

CRP, Child-Resistant Packaging

 

into this:

acción correctiva

corrective action

ácido nitrilotriacético

NTA, Nitrolotriacetic Acid

ADN recombinante

recombinant DNA

bajada del nivel de agua

drawdown

datos EMAP

EMAP data

DDT

DDT

desperdicios domésticos

household waste

empaque a prueba de niños

CRP, child-resistant packaging

You can do this in a text editor like Notepad++ using a pair of regular expressions: (^|\t)([A–Z])([a–z]) in the search field, and $1\L$2$3 in the “Replace” field. (See Figure 3.) Let’s walk through the process.

In the first highlighted box in the “Find what” section in Figure 3, we start by telling Notepad++ to search for the beginning of a line (represented by the symbol “^”) or a tab character (\t), and then for a word that begins with any capital letter ([A–Z]) followed by any lowercase letter ([a–z]). Each of these items is enclosed in parentheses to form its own group. In the second highlighted box in the “Replace with” section, we tell Notepad++ to replace the beginning of the line or tab character (^|\t) in the first group—in Figure 3, ($1) indicates the first group—with the same character. We then tell Notepad++ to replace each initial capital letter (L) in the second group with the same letter, but lowercase (\L$2), and to leave the third group unchanged ($3).2 The end result: Notepad++ will search for words consisting of a capital letter followed by a lowercase one and skip any acronyms that are all uppercase.

Figure 3: Normalize Glossary Capitalization in Notepad++

Figure 4: Finding Segments Where the End Punctuation Doesn’t Match between the Source Text and the Target

Finding All Segments Where the End Punctuation in the Target Text Doesn’t Match the Source: Here are two regular expressions you can use to ensure that punctuation in your target text matches the punctuation in the source:

  • \.$ (punctuation in the source)
  • [^.]$ (punctuation in the target)

These expressions find all segment pairs that end with a period in the source text but not in the target. In the first expression, the “$” signals the end of a string or segment, and the backslash (\) followed by the “.” signals the period. (The backslash is the escape character.3) This tells the regular expression to find all segments that end in a period. In the second expression, the caret inside the set marks negation, so [^.] indicates “any character that is not a period.” Therefore, [^.]$ will find all the segments that don’t end in a period. (See Figure 4.) You can modify this expression to search for other punctuation marks (e.g., \?$ and [^?]$ would find all segments ending with a question mark in the source but not in the target).

Finding All Terms Enclosed in Double Quotes: You can use these regular expressions to find all quoted terms in a document so you can add them to your termbase:

  • (“|“).*?(“|”): This finds all items enclosed in double quotes—both straight and curly quotes. First, the regular expression finds the opening double quotes—either straight or curly. Then it finds the content enclosed in the quotes, ending with the closing double quotes. In this expression the “.” means “any character that is not a paragraph mark (new line).” The asterisk “*” is another quantifier that means “between zero and any number of times,” while the question mark “?” here means “but only until you find the first of the following character.” Without the question mark, the regular expression would find matches until the last closing double quotes in the segment. This regular expression works in both memoQ and Studio. (See Figure 5.)
  • (“|“)[^””]*(“|”): The [^””]* means “any content that is not a closing quote.” This regular expression is similar to the one above, but also works in Xbench (the first one doesn’t). Remember, not all tools use the same regular expression search engine, so what works in one tool may not work the same way in another.4

Figure 5: Find All Terms Enclosed in Double Quotes

Figure 6: An Annotated List of Regular Expressions

How Do You Learn Regular Expressions?

A good way is to start with a tutorial. The best I know is online at Regular-Expressions.info (www.regular-expressions.info). Next, get into the habit of expressing in words what you want to do and try to see how to convert that in regular expressions. There are also several tools and websites that can help you build and test regular expressions.

Expresso

www.ultrapico.com/expresso.htm
Expresso is free for use with .NET regular expressions only (i.e., with the regular expressions used in Studio and memoQ).

Regular Expressions 101

https://regex101.com
A free online tool that explains what each element of your regular expression does.

Regex Buddy

www.regexbuddy.com
Regex Buddy is a commercial tool that will integrate with your favorite searching and editing tools for instant access. It will also help you collect and document libraries of regular expressions for future reuse.

Regardless of the tool you use, a general suggestion is to keep a list of the regular expressions you use and write a brief description to remember what each does. No need for anything fancy, a simple text file will do. (See Figure 6 for an example.)

Notes
  1. Based on the definition provided by Wikipedia.
  2. This, unfortunately, doesn’t work for accented letters, which are left unchanged.
  3. The backslash is used because the dot has a special meaning in regular expressions. When you need to search for the period itself, you need to escape it. For certain regular expression engines, when within a set (but not elsewhere), the dot just indicates the period character.
  4. My thanks to Josep Condal of ApSIC for suggesting this regular expression for Xbench.

 

Regular Expression Cheat Sheet

(Note: These are some of the more important RegEx symbols. See the references on page 31, or your CAT tool help file, for more.)

RegEx Symbol

Explanation

( )

Group

[ ]

Set

|

Alternative

.

Any character that is not a paragraph mark

?

Quantifier: matches the previous character between zero and one time

+

Quantifier: matches the previous character one or more times

*

Quantifier: matches the previous character between zero and more times

{n}

Exact quantifier: matches the previous character exactly n times

{n,m}

Exact range quantifier: matches the previous character between n and m times

^

Designated the beginning of a string or segment

$

Designated the end of a string or segment

Range operator: for example, [A–D] is the range of all capital letters between A and D.

\t

Tab character

\d

The class of all digits, so any digit. The same as [0–9].

\s

The class of all white space, so space, non-breaking space, etc.

[^0-5]

Negated class: this means “no digit between 0 and 5”

\

Escape character: used to search for a character that otherwise would mean something else in a regular expression.
For example, to search for a question mark, we must escape it: \?

$1, $2, etc.

In a replacement operation, these represent the first group, the second group, etc.

\L

In a replacement operation, this means to change the letter following the “\” to lowercase. Note that this will not work with accented characters.

\U

In a replacement operation, this means to change the letter following the “\” to uppercase. This will not work with accented characters.

RegEx Example

Explanation

pre[ -]?amplifier

Finds “amplifier,” “pre-amplifier,” and “preamplifier” (Studio and memoQ).

pre[\- ]?amplifier

Finds “pre amplifier,” “pre-amplifier,” and “preamplifier” (Xbench).

(apple|orange)

Finds all the segments containing either the words “apple” or “orange.” Note that this is not restricted to only two options: (apple|orange|banana) finds all segments that contain “apple,” “orange,” or “banana.”

(“.*?”)|(“.*?”)

Finds all items enclosed in double quotes (both straight quotes and curly quotes).

(“|“).*?(“|”)

Finds all items enclosed in double quotes (both straight quotes and curly quotes), and it finds them even when mismatched (e.g., opening straight quotes and closing curly quotes, and vice versa). Note that (“|“).*(“|”) without the question mark will find items from the first opening quote to the last closing quote.

([““«]).*?([“”»])

Finds all items enclosed in double quotes (both straight quotes and curly quotes), but this will also find three different types of double quotes.

(“|“)[^””]*(“|”)

This works the same way, but for Xbench.

((?<=\s)|^)[-+\(]?((\d{1,3}(,\d{3})*)|\d+)(\.\d+)?\)?((?=\s)|$)

Finds all numbers with a dot (instead of a comma) as a decimal separator, and numbers with a comma as the thousands separator. This could be changed for different numeric patterns.

^((?!search string).)*$

Finds all segments that don’t contain the search string (works for memoQ and Studio, not for Xbench).

-“search string

Same as above, but works for Xbench (with Power Search on).

RegEx (Source)

RegEx (Target)

Explanation

^[A-Z]

^[a-z]

This works in the text filter of a tool like memoQ, Studio 2017, and Xbench—together with the selection of “Regular Expressions” and “Case sensitive”—to find the segments that are capitalized in the source but not in the target.

\.$

[^.]$

Finds mismatched closing punctuation (in this case, the period, but the same regular expression can be adapted to search for other punctuation).

“([\.\?,!;:]$)=1”

-@1$

(Xbench, with Power search on) Finds mismatch in closing punctuation between the source and target. This is for several marks at a time. (Thanks to Oscar Martin of ApSIC for suggesting this pattern.)

Search Field

Replace Field

Explanation

^([A-Z])([a-z])

\L$1$2

(At the beginning of a line, with match case on) Searches for all strings at the beginning of a segment that begin with an uppercase letter followed by a lowercase letter and replaces them with all lowercase. This skips words (such as acronyms) that are all uppercase.

(\t)([A-Z])([a-z])

$1\L$2$3

(After a tab) Same as above, but after a tab, instead of at the beginning of a line. These two RegEx search and replace strings are useful for converting to lowercase glossaries that were written with capitalized entries.

(^|\t)([A-Z])([a-z])

$1\L$2$3

This combines the two previous search and replace operations: converts to lowercase words at the beginning of a line and after a tab.

Which Translation Tools Use Regular Expressions?

CAT Tools
Translation Quality Assurance and Support Tools
Text Editors and Word Processors
Additional References

Multifarious
https://multifarious.filkin.com
Paul Filkin’s blog: he often writes about how to use regular expressions in Studio.

Translation Tribulations
http://www.translationtribulations.com/
Kevin Lossner’s blog is a great resource for memoQ, with various posts that explain how to use regular expressions to fine-tune memoQ.

Regular-Expressions.info
www.regular-expressions.info
A great tutorial and reference site that covers regular expressions in depth.

RegExLib.com
www.regexlib.com/Default.aspx
The internet’s first regular expression library.

Regular Expression Language—Quick Reference
http://bit.ly/Microsoft-regex
A reference to Microsoft’s .NET regular expressions, which are used in memoQ and Studio.


Riccardo Schiaffino has always worked in translation—first as a freelance translator, then as a partner in a translation agency, and later as a translator and translation manager for a major software company. He is particularly interested in using software tools to help improve translation quality. He currently works for Aliquantum, a company specializing in Italian and Spanish legal, medical, and IT translation and localization. He also teaches translation, translation tools, and localization at Denver University. Contact: RSchiaffino@aliquantuminc.com.

Remember, if you have any ideas and/or suggestions regarding helpful resources or tools you would like to see featured, please e-mail Jost Zetzsche at jzetzsche@internationalwriters.com.

The ATA Chronicle © 2019 All rights reserved.