blog
Regex

Introduction

Regular expression or regex is used to find some patterns in a text file. If at any time you tried to search the word "hello" in samli code, than you go to search menu and type hello. Mt manager will look to any words with a sequence of letters h, e, l, _l _and o. Unless you checked the option Match case, the results will look also for any capital letters, like Hello, or HELLO.

Now lets say you want to look at the word gray, but sometimes, it is written with letter _e _instead of letter a like grey. How you will search for both options? This is when you can choose the third option in search menu called Regex. To do so you need to type the following in search: gr(e|a)y. The details will be discussed later.

So regex is like an advance way to do search for a pattern. If you want to simply search for a word, regex may not be needed for you, but if you have a pattern that you need to search it, or replace it with something else, than regex is a powerful tool to do it.
Regex is not a programming language. You can't do if statement or any sort of loops. You can use Perl language or java language with regex.

Please note that there is many way to perform a search using Regex for exact results. Use the one you like the most. The best way to learn it is to try it and practice it. I will follow a basic tutorial from oracle documentation to explain it, and will demonstrate it using Mt Manager on Smali code.


Metacharacters

This API also supports a number of special characters that affect the way a pattern is matched. Change the regular expression to cat. and the input string to cats.

The metacharacters supported by the Regex API are: <(\[{\\^-=$!|\]})?\*+.\>

Note: In certain situations the special characters listed above will not be treated as metacharacters.


Character Classes

  • \[abc\] means search for the letter a, the letter b or the letter c. We saw an example in introduction section about searching for the word gray or grey. We explain that we can search for _gr(e|a)y_. Here is another way to do the same. We can instead search for _gr\[ea\]y_. Notice that in the character classes cases, you dont need to put | between the letters.
  • \[^abc\] is like negation. Instead of writing all the letters or symbols you want to search for, you search for all the character except the a, b, or c.
  • \[a-z\] the symbol - between the letters represents the range. So if you type \[0-3\] means look for a number between 0 and 3. Similar to type \[0123\]. So \[a-z\] means look for any letter between a and z. And \[A-Z\] will be the same for capital letters. \[a-zA-Z\] means look for a range of letters between a and z or, A and Z.
  • The same way you can type it like \[a-z\[A-Z\]\]. I prefer to keep the regex simpler, so will use the above version for Union.
  • \[a-z&&\[def\]\] will choose the intersection beyween the 2 options (before && symbols and after). In this case it's like typing \[def\].
  • If you want to look for a range except some than use ^ symbol inside the character class. \[1-6&&\[^4\]\] will be similar to \[12356\]; 4 is excluded from the range.
  • You can do the same between 2 ranges like \[1-9&&\[^4-6\]\]. This is similar to \[123789\].

Predefined Character Classes

  • Dot symbol is used to replace any character. So if you want to find every 3 consecutive letters, type ...
  • \\d is exactly the same as what we saw previously \[0-9\] which mean find any number.
  • \\D is anything other than digit. It is like saying the negation of \\d. So the search will look to match every characters except numbers.
  • \\s matches the space. Not only space but also tab space and new lines.
  • \\S matches anything except spaces. It is the negation of \\s.
  • \\w matches word character including numbers and underscore.
  • \\W matches all other symbols like :, ! or ?

Quantifiers

Quantifiers allow you to specify the number of occurrences to match against. For convenience, the three sections of the Pattern API specification describing greedy, reluctant, and possessive quantifiers are presented above.

*, ?, And + are quantifiers used for repetition.

  • ? is described as _optional. _Minimum required zero matches and maximum one match.
  • \* is like _any amount is ok. _Minimum required_zero _matches and has_no maximum limit. _
  • + is like _at least one. Minimum requied one match and has_no maximum limit.
  • {n} means the exact amount of time for the repetition.
  • {n, } means repetition of at least n times. No maximum limit.
  • {n,m} means at least n times but maximum m times. It is obvious that m should be larger than n.

Capturing Groups

If you notice, until now all the search was on 1 letter, number or symbol. Now if you want to capture many letters together like "A3M", tgan you need to group them together as a unit or group.

If you search for A3M, the engine will first look at A, than digit 3 and finally letter M, in sequence.

To group them together as one unit, use parentheses like that: _(A3M)_. Now the engine will treat them as one group.

Let's make our example a little more complex. Let's say you want to find the combination of A3M that appear in the text 3 times like _A3MA3MA3M_. So you can of course search for _A3MA3MA3M_. But the elegant way to do it is to group them together and say look when this group is appearing 3times in row like that: _(A3M){3}._ Now you can see the advantage also of repetition.

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

  1. ((A)(B(C)))
  2. (A)
  3. (B(C))
  4. (C)

Combining grouping with all what you learn before can make the search very powerful.

Grouping also will be used in the replacement process. Each group will be expressed as $group_number, like $1 for the first group, $2 for tye sexond group, ...


Conclusion

There is a lot more useful trick and tips can be used with regex in particular for smali. I will leave you discover them by yourself.

I may add some examples in this article later if requested.

Notes

^ and $ metacharacters are to indicate the beginning and the end of line, respectively.

?: Non capturing group.

Use \\ before a symbol will disable its metacharacter. Like \\\[ means use the symbol \[, not the metacharacter \[.

.\* means any character or sequence of characters including space, symbols and numbers.