View on GitHub

parsek

Parser combinators in Kotlin for Kotlin Multiplatform

Parsek

Build and Tests

Parsek provides parser combinators (kombinators in fact) for Kotlin usages. With the library you can easily implement parsers and lexers in Kotlin. It’s a multiplatform library, so it can be used not only inside JVM projects, but also inside Kotlin JS and Kotlin Native.

Table of contents

Using

To use the library, it’s enough to include the dependency from Maven Central.

Gradle

dependencies {
    implementation("io.github.kkarnauk:parsek:0.1")
}

Maven

<dependency>
  <groupId>io.github.kkarnauk</groupId>
  <artifactId>parsek</artifactId>
  <version>0.1</version>
</dependency>

Examples

Parsing an integer

Let’s start with an easy task. For instance, you want to parse an integer. There are several ways to do that. The first one:

object IntegerGrammar : Grammar<Int>() {
    val digits by chars { it.isDigit() } // define a token for sequence of digits
    val minus by char('-') // define a token for a minus
    
    val positiveInteger by digits map { it.text.toInt() } // parser for a positive integer 
    val integer by (-minus seq positiveInteger map { -it }) alt positiveInteger // parser for an integer
    
    override val parser by integer
}

The second one is much simpler:

object IntegerGrammar : Grammar<Int>() {
    private val integer by regex("-?\\d+") // define a token for an entire integer
    
    override val parser by integer map { it.text.toInt() } // map a text into an integer
}

And now, if you want to parse an integer in your program, you can just do something like that:

val integer = IntegerGrammar.parse("-42")

Parsing an arithmetic expression

Yes, that’s too simple an example, isn’t it? Let’s try something bigger. For instance, you want to parse an arithmetic expression with summing, multiplying, powering and parenthesis. So:

object ArithmeticExpressionGrammar : Grammar<ArithmeticExpression>() {
    val num by regex("-?\\d+")
    val sum by char('+')
    val mul by char('*')
    val pow by text("**")
    val lpar by char('(')
    val rpar by char(')')
    
    val whitespaces by chars { it.isWhitespace() }.ignored()
    
    val primary by (num map { Value(it.text.toInt()) }) alt (-lpar seq ref(::parser) seq -rpar)
    val pows by rightAssociative(primary, pow) map { cur, res -> Power(cur, res) }
    val mults by leftAssociative(pows, mul) map { res, cur -> Mul(cur, res) }
    val sums by leftAssociative(mults, sum) map { res, cur -> Sum(cur, res) }
    
    override val parser by sums
}

There are some, maybe, not obvious things like the unary minuses. So let’s go into it.

After it’s all done, we can use the grammar in the following way:

val expression = ArithmeticExpressionGrammar.parse("1 + 2 ** 3 ** 4  * 12 + 3 * (1 + 2) ** 2 ** 3")

Tokens

A token represents a matched part of an input. A parser doesn’t consume an initial input, it does consume a sequence of tokens.

Each token can tell you:

The important point is that you don’t have to produce tokens on your own. This task is completed by producers.

Types

A token type introduces a family for tokens.

For example, you may have an input Hello, my friend! a token type any non-empty sequence of letters. Then there are several tokens with the type: Hello, my and friend.

Each token type can tell you:

Each token type has the match method with the signature:

fun match(input: CharSequence, fromIndex: Int): Int

It starts matching input from fromIndex and returns the number of matched chars. If the result is considered as successful if and only if the result is not 0.

The good news that lots of token types are already implemented. Check out:

Also, there are even more good news. You don’t have to name token types on your own. For convenience, there are token type providers!

Type providers

The purpose of TokenTypeProvider is very easy: we don’t want to write extra information while creating new types. When we create a new token type, we write it to a property, so, for example, there is already the name of the token type!

val digits by chars { it.isDigit() } 

In the example, we create a token type of type Char predicate (described above) and it now has the name digits.

Also, we’ve talked about ignoring tokens by parsers. If you want tokens of the specific type to be ignored by a parser, you just go with one of the following ways:

val digits by regex("\\d+").ignored()
val digits by regex("\\d+", ignored = true)

So the providers help you to create tokens in a more convenient way.

Note: you must use delegation here in order to pass created tokens into a tokenizer.

Now, let’s map the described token types into their providers:

val myType: TokenType = getTokenType() val myTokenType by tokenType(myType) ```

Each provider takes the name of the created token type from the name of the property.

On each of those providers you can invoke .ignored() or pass ignore = true to them in order to make them ignored by parsers!

Producers and tokenizers

We’ve talked much about tokens and token types, but you still don’t know how to convert a string into a collection of tokens.

There is an interface TokenProducer, which is responsible to provide tokens. The only interesting method it has is nextToken(): Token?. So, it either returns a new Token, or null if nothing can be produced.

But how to get a producer? For now, there is only one way to get it: Tokenizer. The main purpose of tokenizers is to take a string and turn it into a producer of tokens.

Usually, a tokenizer is initialized with a list of token types. These token types are retrieved automatically while you define them. Exactly for this purpose you must use delegation when creating a token type.

For now, there are two different implementations of a tokenizer:

The default tokenizer is the longest match one.

Note that parsers accepts IndexedTokenProcucer, not the regular Token producer. The reason is that the regular token producers are lazy and there is no possibility to reuse the produced tokens. The indexed producers memorize produced tokens and allow getting them by an index.

Anyway, you don’t need to implement an indexed producer on your own. Each producer can be effectively turned into an indexed one by calling producer.indexed().

Tokenizer providers

As already mentioned above, tokenizers pull information about token types when you delegate token type providers. If a tokenizer is initialized before all token types are initialized, it would not get all information. It’s not convenient to make users use by lazy or something like that on each override of tokenizer.

So there is a new abstraction: TokenizerProvider. You give a list of token type, the provider gives a tokenizer.

Now, the method for getting`a tokenizer for your Grammar is final and lazy-implemented. If you want to change a tokenizer for your grammar, override the method for getting a tokenizer provider.

There are implementations for providing default tokenizers: longest match and first match.

Parsers

TODO

Grammars

TODO

Inspiration

The project is inspired by an interesting library better-parse for the combinators. I really liked that and decided to implement parser combinators on my own. I’ll try to make parsers better :)