Locale parser with fparsec
Localizing an application consists of extracting out user directed text and managing it outside of hardcoded strings in your code. This lets you tweak strings without having to recompile, and if done properly, allows you to support multiple languages. Localizing is no easy task, it messes up spacing, formatting, name/date other cultural information, but thats a separate issue. The crux of localizing is text.
But, who just uses bare text to display things to the user? Usually you want to have text be a little dynamic. Something like
Hello {user}! Welcome!
Here, user
will be some sort of dynamic property. To support this, your locale files need a way to handle arguments.
One way of storing contents in a locale file is like this:
ExampleText = Some Text {argName:argType} other text etc
= This is on a seperate newline
UserLoginText = ...
This consists of an identifier, followed by an equals sign, followed by some text with arguments of the form {x:y}. To make a new line you have a new line with an equals sign and you continue your text. When you reach a string with an identifier, you have a new locale element.
But you can also have comments, like
# this is a comment, ignore me!
And to throw a monkey wrench in the problem, you can also have arguments with no types, of the form {argName}
.
The end goal, is to be able to reference your locale contents in code, something like
Locale.ExampleText ("foo");
But to get to the point where you can reference this you need to translate your locale files into something workable, kind of like a syntax tree. If you have a working syntax tree of your locale files you can generate strongly typed locale code for you to use in your application.
The data
To parse a locale file of this format I used fparsec. One reason was that it already handles lookaheads and backtracking, and another reason is that I wanted to play with it :)
Going with a data first design, I thought about what I wanted to my final output to be and came up with 3 discriminated unions that look like this:
type Arg =
| WithType of string \* string
| NoType of string
type LocaleContents =
| Argument of Arg
| Text of string
| Line of LocaleContents list
| Comment of string
| NewLine
type Locale =
| Entry of string \* LocaleContents
| IgnoreEntry of LocaleContents
Utilities
The next step was to build out some common utilities that I can use. I knew I’d need to be able to parse a phrase, a single word, and know when things are between brackets:
(\*
Utilities
\*)
let brackets = isAnyOf ['{';'}']
(\* non new line space \*)
let regSpace = manySatisfy (isAnyOf [' ';'\t'])
(\* any string literal that is charcaters \*)
let phrase = many1Chars (satisfy (isNoneOf ['{';'\n']))
let singleWord = many1Chars (satisfy isDigit \<|\> satisfy isLetter \<|\> satisfy (isAnyOf ['\_';'-']))
(\* utility method to set between parsers space agnostic \*)
let between x y p = pstring x \>\>. regSpace \>\>. p .\>\> regSpace .\>\> pstring y
Fparsec comes with a lot of great functions and parser combinators to create robust parsers. The idea is to combine parser functions from smaller parsers into larger parsers. I liked working with it because it felt like dealing directly with a grammar.
Arguments
Now that I was able to parse words, phrases, and I could seperate out newlines from spaces, lets tackle an argument:
(\*
Arguments of {a:b} or {a}
\*)
let argDelim = pstring ":"
let argumentNoType = singleWord |\>\> NoType
let argumentWithType = singleWord .\>\>.? (argDelim \>\>. singleWord) |\>\> WithType
let arg = (argumentWithType \<|\> argumentNoType) |\> between "{" "}" |\>\> Argument
The .>>.?
combinator says to apply both combinators results as a tuple, but if it fails to backtrack to the state of the previous parser. Also, the `` combinator lets you apply parsers as alternatives, so either of the parsers can be applied.
Text elements
Next up is text elements. This is the contents after the = of the identifier, but not including arguments. For example, if our locale entry is
UserLogin = Hey! Whats up?
= new lineezzz
We want to match on “Hey! Whats up?”, followed by an explict newline, followed by “new lineeezz”
(\*
Text Elements
\*)
let textElement = phrase |\>\> Text
let newLine = (unicodeNewline \>\>? regSpace \>\>? pstring "=") \>\>% NewLine
let line = many (regSpace \>\>? (arg \<|\> textElement \<|\> newLine)) |\>\> Line
Remembering that a phrase is any text except for a start bracket and a newline, we can parse all text up to an argument. New lines are a new line, followed by some space (maybe), followed by an equal sign. Since the newline doesn’t contain any data we care about from the parser we can ignore the output and just assign the result to the union type NewLine
using the >>%
operator.
But a line is an aggregation of new lines, arguments, and phrases, so we can use the fparsec many
operator, along with the 3 alternatives (arguments, text elements, and new lines) to build out an actual line.
An Entry
Since we have arguments, new lines, and text set up, we can finally put it all together. What I need now is to match when we have an identifier (“UserLogin”), an equals sign, followed by a line.
(\*
Entries
\*)
let delim = regSpace \>\>. pstring "=" .\>\> regSpace
let identifier = regSpace \>\>. singleWord .\>\> delim
let localeElement = unicodeSpaces \>\>? (identifier .\>\>. line .\>\> skipRestOfLine true) |\>\> Entry
This gives you a tuple of identifier * line, representing your entire locale element.
Comments
But we also have to account for comments. Thankfully those are pretty easy
(\*
Comments
\*)
let comment = pstring "#" \>\>. restOfLine false |\>\> Comment
let commentElement = unicodeSpaces \>\>? comment |\>\> IgnoreEntry
This says if you match a “#” then take the rest of the line (but leave the newline since other parsers will handle that). We might as well maintain the comment information so we can pipe that result to the IgnoreEntry
union type.
Running the parser
And now we just have to piece together comments, locale elements, and run the parser
(\*
Full Locale
\*)
let locale = many (commentElement \<|\> localeElement) .\>\> eof
let test input = match run locale input with
| Success(r,\_,\_) -\> r
| Failure(r,\_,\_) -\>
Console.WriteLine r
raise (Error(r))
Example
Lets try it out. Here is my sample locale:
UserLogin = {user}! Whats up!
= You rock, thanks for logging
UserLogout = {firstName:string}, {lastName:string}...why you gotta go? We were just getting to know you
And running it in fsi
\> test "UserLogin = {user}! Whats up!
= You rock, thanks for logging
UserLogout = {firstName:string}, {lastName:string}...why you gotta go? We were just getting to know you ";;
val it : Locale list =
[Entry ("UserLogin", Line [Argument (NoType "user"); Text "! Whats up!"; NewLine; Text "You rock, thanks for logging"]);
Entry ("UserLogout", Line
[Argument (WithType ("firstName","string")); Text ", ";
Argument (WithType ("lastName","string"));
Text "...why you gotta go? We were just getting to know you "])]
And now its easy to iterate and manipulate the data!
Source
Full source available at my github