Building a back-end web stack in OCaml — Postgres, Caqti V2 & Opium

There’s a couple of choices when looking to build web-servers in OCaml; 2 commonly used choices seem to be postgres and caqti for the database layer, and dream for running the http-server. Both have half decent documentation and a couple of examples in the git repo’s to work off, however caqti recently released a new major version (>2) and dream relies on an older version (<=1.9.0); some fairly hefty breaking changes in this version precludes our using the two together if we want to use the most recent versions of both. One other option for building a web-layer is Opium; this has a very similar, fairly high-level feel to dream. The official docs for opium reference this excellent blog post however again the version of caqti is pretty old; you can get the gist from here but if you’ll run into issues if you try to lift and shift and use caqti >2 so i’ve put together this post which uses the latest version.

Setup

First create a new project

dune init project shorty

Once we’ve got our new project, add the dependencies we’re going to need into our

To install the dependencies we’ll need, update your dune-project file so your dependencies section so it contains the following dependencies

 (depends ocaml dune
  lwt
  lwt_ppx
  core
  caqti
  caqti-lwt
  caqti-driver-postgresql
  ppx_inline_test
  ppx_deriving_yojson
  opium
)

To install all these run opam install ./ --deps-only on your base project directory. This can be run any time you want to add a dependency to this file and saves you running opam install xyz every time. Next, let’s update our project structure a little. Rather than have a single lib directory, to me it makes more sense to split this up as we would in a real project.

mkdir lib/repo
mv lib/dune lib/repo

Our repo directory is going to contain everything we need for interacting with the db layer, so let’s update our library name and add the dependencies we’ll need:

(library
 (name repo)
 (libraries
  caqti
  caqti-driver-postgresql
  caqti-lwt.unix
  yojson
  core
  ppx_deriving_yojson.runtime)
 (preprocess
  (pps ppx_deriving_yojson)))

Next, let’s create a postgres instance we can interact with. In our base directory add a docker-compose.yml file containing the following:

version: '3'
services:
  flyway:
    image: flyway/flyway:6.3.1
    command: -configFiles=/flyway/conf/flyway.config -locations=filesystem:/flyway/sql -connectRetries=60 migrate
    volumes:
      - ${PWD}/sql_versions:/flyway/sql
      - ${PWD}/docker-flyway.config:/flyway/conf/flyway.config
    depends_on:
      - postgres
  postgres:
    image: postgres:12.2
    restart: always
    ports:
    - "5432:5432"
    environment:
    - POSTGRES_USER=example-username
    - POSTGRES_PASSWORD=pass
    - POSTGRES_DB=shorty-db

Note that we’re using flyway for our database migrations; we’re mounting a config file and our migrations directly to this image, let’s create these now:

Create a file docker-flyway.config containing:

flyway.url=jdbc:postgresql://postgres:5432/shorty-db
flyway.user=example-username
flyway.password=pass
flyway.baselineOnMigrate=false

and add a simple migration to get us started to sql_versions/V1__Create_link_table.sql

CREATE TABLE entry (
    short_url varchar(50),
    target_url varchar(50),
    PRIMARY KEY(short_url,target_url)
);

I’ve added the following to a file named nuke_docker_and_restart.sh to allow us to completely tear the db down when we’re done to make it easier to write tests against.

docker-compose rm -f
docker-compose pull
docker-compose up

Running this we can see our database coming up and flyway applying our migrations to create our table to test against.

Database layer and caqti

Before we add our code to interact with the db, i’ve created a Util.ml file containing some helper functions:

let get_uri () = "postgres://example-username:pass@localhost:5432/shorty-db"


let str_error promise =
  Lwt.bind promise (fun res ->
      res |> Result.map_error Caqti_error.show |> Lwt.return)

let connect () =
  let uri = get_uri () in
  Caqti_lwt_unix.connect (Uri.of_string uri)

(** Useful for `utop` interactions interactions. See `README.md`.*)
let connect_exn () =
  let conn_promise = connect () in
  match Lwt_main.run conn_promise with
  | Error err ->
      let msg =
        Printf.sprintf "Abort! We could not get a connection. (err=%s)\n"
          (Caqti_error.show err)
      in
      failwith msg
  | Ok module_ -> module_

Obviously in the real world we would not want to pass in our database credentials like this, but it will do for this example. You can ignore connect_exn, I’ve included examples of how to use this in the repo README.org if you’d like to see how to interact with the db from utop. Next we need to create our Db.ml file, where we’ll house the bulk of our code for interacting with the db.

module Model = struct
  type entry = { short_url : string; target_url : string } [@@deriving yojson]
  type entry_list = entry list [@@deriving yojson]

  let entries_to_json entries = entry_list_to_yojson entries

  let tuple_to_entry tup =
    let a, b = tup in
    let entry : entry = { short_url = a; target_url = b } in
    entry

  let entry_to_json (a : entry) = entry_to_yojson a
end

module Q = struct
  open Caqti_request.Infix

  (*
    Caqti infix operators

    ->! decodes a single row
    ->? decodes zero or one row
    ->* decodes many rows
    ->. expects no row
  *)

  (* `add` takes 2 ints (as a tuple), and returns 1 int *)
  let add = Caqti_type.(t2 int int ->! int) "SELECT ? + ?"

  let insert =
    Caqti_type.(t2 string string ->. unit)
      {|
       INSERT INTO entry (short_url, target_url)
       VALUES (?, ?)
      |}

  let select =
    Caqti_type.(unit ->* t2 string string)
      {|
       SELECT short_url
            , target_url
       FROM entry 
      |}
end

let add (module Conn : Caqti_lwt.CONNECTION) a b = Conn.find Q.add (a, b)

let insert (module Conn : Caqti_lwt.CONNECTION) short_url target_url =
  Conn.exec Q.insert (short_url, target_url)

let find_all (module Conn : Caqti_lwt.CONNECTION) =
  let result_tuples = Conn.collect_list Q.select () in
  Lwt_result.bind result_tuples (fun xs ->
      let out = List.map Model.tuple_to_entry xs in
      Lwt_result.return out)

let resolve_ok_exn promise =
  match Lwt_main.run promise with
  | Error _ -> failwith "Oops, I encountered an error!"
  | Ok n -> n

Let’s break this down a little. First up we have our Model module. In here we’ve housed a couple of basic types. Note that we’ve added [@@deriving yojson] to the back of these; this is a language extension which automatically generates functions for converting to and from json (eg entry_to_yojson ), thus why there’s nothing manually declared with these names!

Next we’ve declared our Q module where we’re adding our queries. Let’s break one of our queries down to clarify exactly what’s going on (I’ve added the return type to the declaration so it’s a little clearer what we’re creating):

  let insert: (string * string, unit, [ `Zero ]) Caqti_request.t =
    Caqti_type.(t2 string string ->. unit)
      {|
       INSERT INTO entry (short_url, target_url)
       VALUES (?, ?)
      |}

One thing to note: Caqti_type.(stuff) is an example of ocaml’s “local open” syntax; effectively all this is doing is

let open Caqti_type in 
stuff

to give us access to Caqti_type‘s scope. Within this scope we can access t2. This function consumes some local types and returns a function

?oneshot:bool -> string -> (string * string, unit, [ `Zero ]) Caqti_request.t

which we then pass our sql statement into. I think it’s worth calling out here the parameters we’re passing to t2 and ->. (ie string string & unit) and not OCaml primitives; In this context string and unit refer to local type declarations within Caqti_type with specific meanings; the docs for these are here. Apparently this is an intentional design pattern however I’ll admit a wariness to this; to me it feels like it’s going to create code that’s more difficult to read. Our Caqti_request.t output is parameterised by (string * string, unit, [ Zero ]) which gives us a nice clear description of how to use our insert request; it takes a tuple of two strings and returns unit. Again, it’s worth noting that OCaml’s syntax for type parameters is “backwards” compared to a lot of languages — eg where in something like scala we’d write List[String] in OCaml this would be String List.

In the next block we’re simply writing some functions which consume connections and some parameters and proxy these through to our queries.

At this point we’ve got some queries which we can use to interact with the database, let’s write some tests to make sure they work. Using the same directory structure as before, we’ll add our tests under lib/repo/db.ml and add our dependencies under lib/repo/dune:

(library
 (name repo_test)
 (inline_tests)
 (libraries repo)
 (preprocess
  (pps ppx_inline_test ppx_assert)))

and

 open Repo.Db.Model

let str_error promise =
  Lwt.bind promise (fun res ->
      res |> Result.map_error Caqti_error.show |> Lwt.return)

let drop_id_from_entry triple = (triple.short_url, triple.target_url)

let%test_unit "PostgreSQL: add (asynchronously)" =
  let ( => ) = [%test_eq: (Base.int, Base.string) Base.Result.t] in
  let will_add a b =
    let ( let* ) = Lwt_result.bind in
    let* conn = Repo.Util.connect () |> str_error in
    Repo.Db.add conn a b |> str_error
  in
  Lwt_main.run (will_add 1 2) => Ok 3

let%test_unit "Able to add to the database" =
  let ( => ) =
    [%test_eq:
      ((Base.string * Base.string) Base.list, Base.string) Base.Result.t]
  in
  let input_url = "hello" in
  let target_url = "Arnie" in
  let add_entry =
    let ( let* ) = Lwt_result.bind in
    let* conn = Repo.Util.connect () |> str_error in
    let* _ = Repo.Db.insert conn input_url target_url |> str_error in
    Lwt_result.bind (Repo.Db.find_all conn) (fun res ->
        Lwt_result.return @@ List.map drop_id_from_entry res)
    |> str_error
  in
  Lwt_main.run add_entry => Ok [ (input_url, "Arnie") ]

To run these:

$ ./nuke_docker_and_restart.sh
# and in another window
$ dune runtest

At this point we’ve got everything we need up and running to interact with our little database, now we’re ready to add our Opium layer. This part is fairly simple, we could add to a lib/controllers/ repo but for the sake of simplicity we’re just going to add everything to our bin/main.ml file and bin/dune the requisite dependencies.

(executable
 (public_name shorty)
 (name main)
 (libraries repo lwt opium)
 (preprocess
  (pps lwt_ppx)))
open Opium
open Repo
open Repo.Db

let convert_to_response entries =
  let as_json = Response.of_json @@ Model.entries_to_json entries in
  Lwt_result.return as_json

let find_all _ =
  Logs.info (fun m -> m "Finding all");
  let get_all =
    let ( let* ) = Lwt_result.bind in
    let* conn = Util.connect () in
    let entries = Db.find_all conn in
    Lwt_result.bind entries convert_to_response
  in
  Lwt.bind get_all (fun r ->
      match r with Ok r -> Lwt.return r | Error _ -> raise @@ Failure "")

let put_entry req =
  Logs.info (fun l -> l "adding entry");
  let insert =
    let open Lwt.Syntax in
    let+ json = Request.to_json_exn req in
    let entry = Model.entry_of_yojson json in
    (* manually declare let* as from Lwt_result as it's also available on the base Lwt *)
    let ( let* ) = Lwt_result.bind in
    let* conn = Util.connect () in
    match entry with
    | Ok e -> Db.insert conn e.short_url e.target_url
    | Error e -> raise @@ Failure e
  in
  let bind_insert insert_response =
    Lwt.bind insert_response (fun bind_response ->
        Lwt.return
        @@
        match bind_response with
        | Ok _ -> Response.of_plain_text "Hooray"
        | Error _ ->
          (* This isn't brilliant, ideally we'd handle different excpetions with specific/ sensible http code *)
            Response.of_plain_text ~status:`Bad_request
              "Oh no something went terribly wrong")
  in

  Lwt.bind insert bind_insert

let _ =
  Logs.set_reporter (Logs_fmt.reporter ());
  Logs.set_level (Some Logs.Info);
  Logs.info (fun m -> m "Starting run");
  App.empty |> App.get "/entry" find_all |> App.put "/entry" put_entry
  |> App.run_command

Opium has a really simple api; App.get and Api.put both have signature

string -> (Request.t -> Response.t Lwt.t) -> App.t -> App.t

where the first parameter is the route we’re binding to then the handler function to call.

Spinning up our app we’re now able to add and view entries in our db over our new server:

# first window 
$ dune exec -- shorty
# second window
$ ./nuke_docker_and_restart.sh
# third window
$ http PUT 127.0.0.1:3000/entry short_url=hello target_url=vinnie
HTTP/1.1 200 OK
Content-Length: 6
Content-Type: text/plain

Hooray

$ http 127.0.0.1:3000/entry
HTTP/1.1 200 OK
Content-Length: 45
Content-Type: application/json

[
    {
        "short_url": "hello",
        "target_url": "vinnie"
    }
]

# send the same value again to test our exception handling
http PUT 127.0.0.1:3000/entry short_url=hello target_url=vinnie
HTTP/1.1 400 Bad Request
Content-Length: 26
Content-Type: text/plain

Oh no something went terribly wrong

Conclusion

Opium and Caqti have both proven really nice libraries to work with, albeit in this simple example. Both are extremely lightweight and easy to get up and running quickly with. I’ve pushed the changes up to github, hopefully this provides an easy-to-use sample project for anyone looking to serve up an api in OCaml.

What the applicative? Optparse-applicative from the ground up (Part 2)

In this post we’re going to be following on from the previous entry covering some of the foundations of how applicatives work with a slightly more concrete example. We’ll be breaking down an example usage of optparse-applicative quite slowly to explore applicatives; if you’re just looking for some examples of usage, I’d refer to the already excellent documentation here.

Project Setup

Create a baseline project using stack:

$ stack new gh-cli

Once you’ve done this all we need to do is add optparse-applicative to our package.yaml file under dependencies:

dependencies:
- base >= 4.7 && < 5
- optparse-applicative

Setting up some working baseline code

Next let’s start with a very basic example that we’ve shamelessly stolen from the optparse-applicative docs and plug this into our Main.hs:

module Main (main) where

import Options.Applicative

data Sample = Sample
  { hello      :: String
  , quiet      :: Bool
  , enthusiasm :: Int }

sample :: Parser Sample
sample = Sample
      <$> strOption
          ( long "hello"
         <> metavar "TARGET"
         <> help "Target for the greeting" )
      <*> switch
          ( long "quiet"
         <> short 'q'
         <> help "Whether to be quiet" )
      <*> option auto
          ( long "enthusiasm"
         <> help "How enthusiastically to greet"
         <> showDefault
         <> value 1
         <> metavar "INT" )
          
main :: IO ()
main = greet =<< execParser opts
  where
    opts = info (sample <**> helper)
      ( fullDesc
     <> progDesc "Print a greeting for TARGET"
     <> header "hello - a test for optparse-applicative" )

greet :: Sample -> IO ()
greet (Sample h False n) = putStrLn $ "Hello, " ++ h ++ replicate n '!'
greet _ = return ()

Running using stack we can see that this is working:

$ stack exec -- gh-cli-exe
Missing: --hello TARGET

Usage: gh-cli-exe --hello TARGET [-q|--quiet] [--enthusiasm INT]

  Print a greeting for TARGET

$ stack exec -- gh-cli-exe --hello Pablo
Hello, Pablo!

NB: I had a bit of bother getting emacs to connect to the haskell lsp having not run it in a while due to some versioning issues. For me the fix was to use ghcup to get onto all the recommended versions. I had to also manually point lsp-haskell to the correct path. My init.el file now looks like this:

(use-package lsp-haskell
  :ensure t
  :config
 (setq lsp-haskell-server-path "haskell-language-server-wrapper")
 ;; gives a little preview on hover, useful for inspecting types. set to nil to remove. 
 ;; full list of options here https://emacs-lsp.github.io/lsp-mode/tutorials/how-to-turn-off/
 (setq lsp-ui-sideline-show-hover t)
 (setq lsp-haskell-server-args ())
 (setq lsp-log-io t)
)

Quick refactor before we get started

In this post we’re going to be breaking down exactly what’s going on here; specifically looking at how applicatives are used. Before we get started let’s do a quick refactor; this isn’t strictly necessary but the code in the examples is a little terse so it can be harder to tell exactly what each part is doing. Right at the start of main we’ve got:

main = greet =<< execParser opts
  where ...

=<< is just syntactic sugar for bind but with the arguments reversed (ie (a -> m b) -> m a -> m b) which is elegant, but let’s just refactor this to good old do notation in the interests of clarity:

main = do
  parsedOpts <- getInput
  greet parsedOpts
  where
    getInput = execParser opts
    opts = ...

Lets take this a step further and do a bit more of a tidy up, improve some variable names, add some type hints, and run fourmolu set the formatting:

data InputOptions = InputOptions
    { helloTarget :: String
    , isQuiet :: Bool
    , repeatN :: Int
    }

inputOptions :: Parser InputOptions
inputOptions =
    InputOptions
        <$> strOption
            ( long "helloTarget"
                <> metavar "TARGET"
                <> help "Target for the greeting"
            )
        <*> switch
            ( long "quiet"
                <> short 'q'
                <> help "Whether to be quiet"
            )
        <*> option
            auto
            ( long "enthusiasm"
                <> help "How enthusiastically to greet"
                <> showDefault
                <> value 1
                <> metavar "INT"
            )

main :: IO ()
main = do
    parsedInput <- getInput
    handleInput parsedInput
  where
    infoMod :: InfoMod a
    infoMod =
        fullDesc
            <> progDesc "Print a greeting for TARGET"
            <> header "hello - a test for optparse-applicative"

    parser :: Parser InputOptions
    parser = inputOptions <**> helper

    opts :: ParserInfo InputOptions
    opts = info parser infoMod

    getInput :: IO InputOptions
    getInput = execParser opts

handleInput :: InputOptions -> IO ()
handleInput (InputOptions h False n) = putStrLn $ "Hello, " ++ h ++ replicate n '!'
handleInput _ = return ()

We’ve gone a little overboard with extracting some of our variables here and we’ve added some redundant type declarations we don’t actually need, but this hopefully helps clarify some of the types.

What’s actually going on here?

Let’s break down what’s going on into chunks so we can see exactly how this works; for the sake of completeness we’ll first break down our main function. The first thing we do is set up our input type opts:

    infoMod :: InfoMod a
    infoMod = fullDesc
     <> progDesc "Print a greeting for TARGET"
     <> header "hello - a test for optparse-applicative"

    parser :: Parser InputOptions
    parser = inputOptions <**> helper

    opts :: ParserInfo InputOptions
    opts = info parser infoMod

The key function here (info) is a function from optparse-applicative; it’s type is

info :: Parser a -> InfoMod a -> ParserInfo aSource

All this is doing is creating a ParserInfo object from Parser (in our case of type Parser InputOptions) and an InfoMod (which in this case we’re just defining inline). In our case the Parser is inputOptions and is defined above; we’ll talk about this in some more detail below.

Next we’re calling execParser (execParser :: ParserInfo a -> IO a) from optparse-applicative with our ParserInfo to tell it how to handle our input:

    getInput :: IO InputOptions
    getInput = execParser opts

Once we’ve got our input, we simply pass this onto our handler function and we’re good to go:

main = do
  parsedInput <- getInput
  handleInput parsedInput
  where 
    -- etc etc

How do applicatives come into this?

The optparse-applicative library — unsurprisingly —makes use of applicative-style quite widely; let’s use our instance of inputOptions as a specific example. For reference, the definition is here:

inputOptions :: Parser InputOptions
inputOptions = InputOptions
      <$> strOption
          ( long "helloTarget"
         <> metavar "TARGET"
         <> help "Target for the greeting" )
      <*> switch
          ( long "quiet"
         <> short 'q'
         <> help "Whether to be quiet" )
      <*> option auto
          ( long "enthusiasm"
         <> help "How enthusiastically to greet"
         <> showDefault
         <> value 1
         <> metavar "INT" )

If we remember from our last post, <$> is just an infix fmap operation; for the purposes of illustration we can start off by turning this:

InputOptions
      <$> strOption
          ( long "helloTarget"
         <> metavar "TARGET"
         <> help "Target for the greeting" )

into something thats maybe a bit more declarative like this:

partialInput :: Parser (Bool -> Int -> InputOptions)
partialInput = fmap InputOptions $ strOption $ long "helloTarget" <> metavar "TARGET" <> help "Target for the greeting" 

Here, we’ve pulled out the first part of our Parser, that so far just “wraps” a function with the remaining parameters for our InputOptions object. If we remember from our last post, the signature for <*> is

(<*>) :: Applicative f => f (a -> b) -> f a -> f b)

The important thing to remember here is that the function in our Parser, of type Bool -> Int -> InputOptions, can be thought of as having type a -> b -> c or a -> b where b is simply a function of type b -> c. This allows us to use <*> to compose in our Bool parameter using switch and our Int parameter using option auto respectively (NB: ). In these cases <*> will effectively “unwrap” our Bool, and Int parameters (remember switch returns a Parser Bool and option auto is in our case returning a Parser Int) for us to build up our InputOptions and leave us with a Parser InputOptions when we’re done. If we wanted to do this in the classic monadic style we’d need to do something like this:

withoutApp :: Parser InputOptions
withoutApp = do
  helloTarget <-  strOption $ long "helloTarget" <> metavar "TARGET" <> help "Target for the greeting"
  booleanParam <- switch
          ( long "quiet"
         <> short 'q'
         <> help "Whether to be quiet" )
  intParam <- option auto
          ( long "enthusiasm"
         <> help "How enthusiastically to greet"
         <> showDefault
         <> value 1
         <> metavar "INT" )
  SomeParserConstructorThatDoesntExist $ InputOptions helloTarget booleanParam intParam

This is not only much less readable but also there is no public constructor for Parser so wouldn’t work even if we wanted it to.

If you’re not used to reading applicative-style code, I think it can be a little confusing to work out how the composition fits together however once the penny drops it’s actually a very readable pattern. I have a suspicion this post will prove to be more of an exercise in rubber-ducking and the actual writing of it more useful than anything else, but hopefully the breakdown is a helpful illustration of how applicatives can be used in a more real-world scenario.

What the applicative? Optparse-applicative from the ground up (Part 1)

Functors and monads are both (relatively) well understood topics these days, even if they’re not always labeled as these directly. Functors are everywhere, and many of the basic structures that imperative programmers use on a day-to-day basis form mondad’s and can be composed accordingly. As such, once the penny drops on what a monad actually is, it’s trivial to understand their application and widespread importance. Additionally, many languages have support specifically for this style of programming, making the leap very natural. Applicative functors on the other hand, are not widely used by Johnny-Java-Developer and because of this there’s a temptation to view them as little less intuitive and it’s harder see their applications immediately. In reality applicatives are certainly no more complex than many functional concepts and can be used to facilitate some really nice patterns.

In this short series, I’m going to cover the basics of what applicatives are, and go over an example in a little more detail to show them in action (using the haskell’s optparse-applicative library).

What are applicative functors?

As with many functional concepts, applicative functors are fairly simple to describe; the definition for applicatives looks something like:

class (Functor f) => Applicative f where
    pure  :: a -> f a
    (<*>) :: f (a -> b) -> f a -> f b

For reference the standard functor is below:

class Functor f where
    fmap :: (a -> b) -> f a -> f b

-- note, we'll mainly be using the infix operator <$> in this article, described here:
(<$>) :: Functor f => (a -> b) -> f a -> f b
(<$>) = fmap

The first change in the applicative functor is the pure function, we’re not going to go into this in too much detail here, for the time being just consider this as a “wrapper” function — think Just in Maybe.

The second function in applicative is more interesting: <*> (pronounced “app”). From the signature we can see that it’s a function, that takes a function (of type a->b) wrapped in a functor, and a functor of type a, then returns a functor of type b — the only real difference from the standard fmap being that the initial function we’re passing in is wrapped in a functor of the same kind. At this point, if you come from an imperative background it’s tempting of thinking of a use-case that looks something like this:

someFunc :: Num a => a -> a
someFunc = \x -> x + 1

someMaybe = Just 1

result = someFunc <$> someMaybe

-- result == Just 2

Which we could write using the applicative functions as:

someFunc :: Num a => a -> a
someFunc = \x -> x + 1

someMaybe = Just 1

result = Just someFunc <*> someMaybe

result == Just 2

Personally, I struggled to see the point of this at first. In many languages it’s simply unlikely that you’d often write something like this — if we’re concerned about someMaybe being Nothing, we can still just pass in someFunc to fmap and this is handled just fine.

In all likelihood, if we had more parameters in someFunc we’d probably compose this monadically:

someFunc::(Num a ) =>a -> a -> -> a
someFunc a b c = a + b + c

someResult::Maybe Int
someResult = do
  a <- Just 1
  b <- Just 2
  c <- Just 3
  Just $ someFunc a b c

Again this feels very familiar. The power of applicatives however really shines through though once you’re working in a language with first class support for partial application.

Let’s take a really simple data class

type Name = String
type Age = Int
type StreetName = String
type PostCode = String

data SimplePerson = SimplePerson {firstName:: Name} deriving (Show)

-- which we can create as 
sampleSimplePerson = SimplePerson "Harry"

Like all constructors SimplePerson is just a function, which in this case consumes a Name and returns a SimplePerson. If we have some function that generates a Maybe firstName for us, we can use our old friend fmap to generate a Maybe SimplePerson:

ghci> :t SimplePerson
SimplePerson :: String -> SimplePerson

ghci> fmap SimplePerson fetchName
Just (SimplePerson {firstName = "Harry"})

So far so good, but let’s take a type with a little more information, say:

data Person = Person
    { name :: Name
    , age :: Age
    , postCode :: PostCode
    , streetName :: StreetName
    }
    deriving (Show)
ghci> :t Person
Person :: Name -> Age -> PostCode -> StreetName -> Person

we can see that fmap isn’t going to cut it — it doesn’t take enough parameters. To allow us to use the same fmap style we used above for SimplePerson we’d really need something like:

magicMap:: (Name -> Age -> PostCode -> StreetName -> Person) -> Maybe Name -> Maybe Age -> Maybe PostCode -> Maybe StreetName -> Maybe Person

For obvious reasons this isn’t going to work for us on a practical, what we really need is a generalised way of applying this kind of mapping — enter <*>. The critical thing to note is that working with a language like Haskell, we get partial application out the box, so if we don’t apply all the parameters required to a function that’s ok, we’ll just get back another function that consumes the rest. That means we can write this:

ghci> :t Person “Harry”
Person :: Age -> PostCode -> StreetName -> Person

Or like this:

ghci> :t fmap Person $ Just "Harry"
fmap Person $ Just "Harry"
  :: Maybe (Age -> PostCode -> StreetName -> Person)

Now we’re getting somewhere: what we’ve ended up with is a function wrapped in some context (a Maybe) that takes some parameters and returns our completed Person. In turn, this function could be called with any number of the remaining parameters, and we’ll slowly work towards the end Person. If we stub out some extra functions to generate our required parameters wrapped in the same context we could use these to build our example outputs:

fetchName :: Maybe Name
fetchName = Just "Harry"

fetchAge :: Maybe Age
fetchAge = Just 21

fetchPostCode :: Maybe PostCode
fetchPostCode = Just "SUR1"

fetchStreetName :: Maybe StreetName
fetchStreetName = Just "Privet Lane"

Obviously we could put these together monadically:

personUsingDoNotation :: Maybe Person
personUsingDoNotation = do
    name <- fetchName
    age <- fetchAge
    postCode <- fetchPostCode
    streetName <- fetchStreetName
    Just $ Person name age postCode streetName

-- or using bind

personWithoutUsingDoNotation::Maybe Person
personWithoutUsingDoNotation = fetchName >>= (\name -> fetchAge >>= (\age -> fetchPostCode >>= (\postCode -> fetchStreetName >>= (\streetName -> Just $ Person name age postCode streetName))))

Using do-notation feels a tad overkill here and using bind directly is borderline unreadable, even in this simple case. Using applicative style instead however, allows us to put these together by stringing <$> and <*> as follows:

personUsingApp :: Maybe Person
personUsingApp = Person <$> fetchName <*> fetchAge <*> fetchPostCode <*> fetchStreetName

This is much nicer, hooray! Again, if we only pass in some of the parameters, we just get back a function that we can <*> away to as we go

ghci> :t Person <$> fetchName <*> fetchAge <*> fetchPostCode -- ie were missing `<*> fetchStreetName`
Person <$> fetchName <*> fetchAge <*> fetchPostCode
  :: Maybe (StreetName -> Person)

In the next section of this we’ll be taking a look at Haskells optparse-applicative library, which makes extensive use of applicative style to build command line tools.

Currency arbitrage on Spark and Scala -> part 1

With all the interest over the last few years surrounding cryptocurrencies, paired with their naturally open nature, I thought this might be a nice topic for a series on currency arbitrage. Arbitrage, the practice of exploiting differences in prices of the same asset across multiple markets, is an area of finance particularly well suited to a computational approach. The theory is simple enough; we take an asset, say pork bellies, and find someone selling at one price (£1) and someone buying at another (£2) and simply buy from the first to immediately sell to the second, profiting ourselves £1. Applying this to currencies, the natural extension of this is to start applying this idea across multiple pairs; say we have 3 currencies GBP, USD & JPY (with quotes GPB/USD = 1.5, USD/JPY = 1, JPY/GBP = 1.5) if we trade these simultaneously (or very quickly) GBP->USD->JPY->GBP, we’d end up with more GBP than we started with, risk-free.

In this series we’re going to look at developing an approach to pull market data for cryptocurrencies and identify arbitrage opportunities within this. This is quite a big topic so I’m going to spit this into a couple of posts, first laying down some theory and getting a starter implementation up and running, then we’ll look into writing a feed for the data and feeding this into our calculator.

Modeling the problem

We’re going to be looking at currency pairs through the lens of graphs where the vertices will represent each currency and the edge weights will represent the quote values. Visualy, the example listed in the intro would look something like this:

At it’s core, if we model our currencies as a graph, we’re able to identify arbitrage opportunities by identifying cycles that when the weights are multiplied together, yield a result > 1.

Thankfully for us, graph theory is an extremely mature topic and there are a host of algorithms we can leverage to approach our problem; in this case we’re going to explore the Bellman-Ford algorithm. Bellman-Ford is a really elegant little algorithm and a surprisingly simple exercise in dynamic programming, identifying the shortest path from a specific node to each of the nodes within the graph by progressively building out a picture of potential shortcuts. Clocking in with a worst case complexity of O(|V||E|), it’s a fair bit slower than algorithms like Dijkstra’s [running in O(|E|+(|V|log|V|))] however, critically in this case, it’s able to cope with (and with a few small modifications, identify) negative weight cycles.

Take the following graph:

Bellman-Ford runs V-1 (at most, in this case we don’t need to run this many) relaxations on each node to find the shortest distance to it. In our case we’d have something like this [A->0, B->5, C->9, D->-1].

If we update this so that the weights in our cycle (B->C->D->B) sum to a negative figure, eg:

we can now run this as many times as we want, our algorithm will never terminate at a fixed point; we’ve found a negative weight cycle.

Attempting to apply this to an example using currency pairs, for example the following:

doesn’t quite help us, due to the fact that Bellman-Ford identifies negative sum cycles and we’re looking for positive multiplicative cycles; if we look at our example above, there’s clearly an arbitrage opportunity (ETH -> USD -> BTC ->ETH) however a standard Bellman-Ford won’t pick up any negative weight cycles. Fortunately there’s an easy fix for this; by taking the log of each of our weights prior to processing, we can sum them as normal to find our negative cycles, then exponentiate our result at a later point if we want to calculate our yield. In the above case this looks something like:

scala> val (a,b,c) = (log(1.6), log(0.7), log(1.1))
a: Double = 0.47000362924573563
b: Double = -0.35667494393873245
c: Double = 0.09531017980432493

scala> val logSum = a+b+c
logSum: Double = 0.20863886511132812

scala> math.pow(math.E, logSum)
res9: Double = 1.232

scala> val target = 1.6 * 0.7 * 1.1 // quick sense check that we're able to recover our product successfully
target: Double = 1.232

As Bellman-Ford allows us to identify where we have negative weight cycles, all we need to do at this point is take the negative log of our weights before we run our algorithm, and it will function as normal.

Show me the money

My eventual aim with this series is to try to apply a distributed approach to the problem using Spark, so I’m going to use the GraphX data structures provided in Spark from the start. Our initial implementation is just going be a classic Bellman-Ford and looks like this:

import org.apache.spark.graphx.{Edge, Graph, VertexId}

import scala.annotation.tailrec

case class Reduction(distances: Map[VertexId, Double], previous: Map[VertexId, VertexId])

class BellmanFord {

  def relaxNonSpark(source: VertexId, graph: Graph[String, Double]) = {
    val verticesCount = graph.vertices.count()
    val edges = graph
      .edges
      .collect()

    @tailrec
    def relax(passes: Long, reduction: Reduction): Reduction =
      if (passes == 0) {
        reduction
      }
      else {
        relax(passes - 1, relaxOnce(edges, reduction))
      }

    relax(verticesCount, Reduction(Map(source -> 0d), Map[VertexId, VertexId]()))
  }

  def relaxOnce(edges: Array[Edge[Double]], reduction: Reduction) = {
    edges
      .foldLeft(reduction)((reduction, edge) => {
        val distances = reduction.distances
        val previous = reduction.previous
        (distances.get(edge.srcId), distances.get(edge.dstId)) match {
          case (None, _) => reduction
          case (Some(distanceToSrc), None) =>
            Reduction(distances + (edge.dstId -> (distanceToSrc + edge.attr)), previous + (edge.dstId -> edge.srcId))
          case (Some(distanceToSrc), Some(distanceToDst)) =>
            if (distanceToSrc + edge.attr < distanceToDst)
              Reduction(distances + (edge.dstId -> (distanceToSrc + edge.attr)), previous + (edge.dstId -> edge.srcId))
            else
              reduction
        }
      })
  }
}

And some tests to show it works:

import com.typesafe.scalalogging.Logger
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

import scala.math.log

class BellmanFordTest extends FunSuite {

  private val logger = Logger[BellmanFordTest]

  private def vertexId(string: String) = string.hashCode.toLong

  private def vertexIdPair(string: String) = (string.hashCode.toLong, string)

  private val spark = SparkSession.builder
    .master("local[*]")
    .appName("Sample App")
    .getOrCreate()

  private val sc = spark.sparkContext

  private val vertices: RDD[(VertexId, String)] =
    sc.parallelize(Seq(vertexIdPair("A"), vertexIdPair("B"), vertexIdPair("C"), vertexIdPair("D")))

  private val bellman = new BellmanFord()

  test("Graph without negative cycle doesn't change with extra relaxations") {

    val edgesNonNeg: RDD[Edge[Double]] = sc.parallelize(Seq(
      Edge(vertexId("A"), vertexId("B"), 5),
      Edge(vertexId("A"), vertexId("D"), 3),
      Edge(vertexId("B"), vertexId("C"), 3),
      Edge(vertexId("C"), vertexId("D"), -10)
    ))

    val graphWithNoNegativeCycles = Graph(vertices, edgesNonNeg)
    val sourceVertexId = graphWithNoNegativeCycles.vertices.first()._1
    val reduction = bellman.relaxNonSpark(sourceVertexId, graphWithNoNegativeCycles)
    val additionalReduction = bellman.relaxOnce(edgesNonNeg.collect(), reduction)
    logger.info(s"Reduction distances: ${reduction.distances}")
    logger.info(s"Reduction previous map: ${reduction.previous}")
    assert(reduction === additionalReduction)
  }

  test("Graph containing negative cycle does change with extra relaxations") {
    val edgesWithNegativeCycle: RDD[Edge[Double]] = sc.parallelize(Seq(
      Edge(vertexId("A"), vertexId("B"), 5),
      Edge(vertexId("A"), vertexId("D"), 3),
      Edge(vertexId("B"), vertexId("C"), 3),
      Edge(vertexId("C"), vertexId("D"), -10),
      Edge(vertexId("D"), vertexId("C"), 4)
    ))

    val graphWithNegativeCycles = Graph(vertices, edgesWithNegativeCycle)
    val sourceVertexId = graphWithNegativeCycles.vertices.first()._1
    val reduction = bellman.relaxNonSpark(sourceVertexId, graphWithNegativeCycles)
    val additionalReduction = bellman.relaxOnce(edgesWithNegativeCycle.collect(), reduction)
    assert(reduction !== additionalReduction)
  }

  // note this test isn't complete, there's graphs this would fail for, this is just to validate our test data
  test("Graph with negative cycles under multiplication, but not addition, does not change with extra relaxation without preprocessing") {
    val edgesWithNegativeCycle: RDD[Edge[Double]] = sc.parallelize(Seq(
      Edge(vertexId("A"), vertexId("B"), 0.8),
      Edge(vertexId("B"), vertexId("C"), 0.7),
      Edge(vertexId("C"), vertexId("D"), 1.1),
      Edge(vertexId("D"), vertexId("A"), 1.2),
      Edge(vertexId("D"), vertexId("B"), 1.6)
    ))

    val graphWithNegativeCycles = Graph(vertices, edgesWithNegativeCycle)
    val sourceVertexId = graphWithNegativeCycles.vertices.first()._1
    val reduction = bellman.relaxNonSpark(sourceVertexId, graphWithNegativeCycles)
    val additionalReduction = bellman.relaxOnce(edgesWithNegativeCycle.collect(), reduction)
    assert(reduction === additionalReduction)
  }

  test("Graph with negative cycles under multiplication changes with extra relaxation when preprocessing applied") {
    val edgesWithNegativeCycle: RDD[Edge[Double]] = sc.parallelize(Seq(
      Edge(vertexId("A"), vertexId("B"), -log(0.8)),
      Edge(vertexId("B"), vertexId("C"), -log(0.7)),
      Edge(vertexId("C"), vertexId("D"), -log(1.1)),
      Edge(vertexId("D"), vertexId("A"), -log(1.2)),
      Edge(vertexId("D"), vertexId("B"), -log(1.6))
    ))

    val graphWithNegativeCycles = Graph(vertices, edgesWithNegativeCycle)
    val sourceVertexId = graphWithNegativeCycles.vertices.first()._1
    val reduction = bellman.relaxNonSpark(sourceVertexId, graphWithNegativeCycles)
    val additionalReduction = bellman.relaxOnce(edgesWithNegativeCycle.collect(), reduction)
    assert(reduction !== additionalReduction)
  }
}

As you can see, we pass our graph into relaxNonSpark with a given sourceId and our graph then recursively relax our edges N-1 times. As we reduce we’re passing down a Reduction object, which contains a map of each node to it’s (current) distance from source, and a map of each node to the predecessor from which it was last updated. Note: There are a number of optimisations we can make to our implementation straight off the bat, such as terminating early upon seeing no changes to our Reduction object between relaxations, however I’ve purposefully left these out for now.

Conclusion

At this point we’ve seen how we can approach currency exchange from the perspective of graphs, and how to leverage this approach in such a way that we can identify opportunities for arbitrage. In the next part we’ll write a small web-scraper to pull some real cryptocurrency data and apply this to our implementation to identify arbitrage opportunities, before building on this to leverage spark to distribute our problem.

Quickstart with Zeppelin, Scala, Spark & Docker

Recently, I’ve been working on a problem that required doing some exploratory analysis on some of our more high throughput datasources. Where we’ve had to do this kind of work in the past the approach has typically been to reach for the usual suspects; python and pandas, probably running in jupyter notebooks. Whilst I’ve written quite a bit of python over the years and will occasionally reach for it to whip up a script for this or that, I find the lack of type safety for anything more involved than bashing together a script to automate something simple really frustrating, so I decided to try out Apache Spark using Scala. After a quick trial of running some code in both Apache Zeppelin and Jupyter using the spylon-kernal, Zeppelin quickly emerged victorious (in my mind) and so far has been an absolute joy to work with.

The purpose of this post is not to provide yet another x vs y technology breakdown (there’s plenty of those on the topic already), but rather to private a brief intro on how to get up and running quickly with a Zeppelin/ Spark/ Scala stack.

Installation

For a full local installation of zeppelin the official docs recommend that you download the binaries and run these against a (local or otherwise) spark installation, however for trialing out the stack and for situations where you don’t need a full scale cluster, it’s docker to the rescue.

Let’s create our docker-compose.yml:

$ mkdir zeppelin && cd $_ && touch docker-compose.yml

In this we want to insert the following:

version: "3.3"
services:
  zeppelin:
    image: apache/zeppelin:0.9.0
    environment:
      - ZEPPELIN_LOG_DIR=/logs
      - ZEPPELIN_NOTEBOOK_DIR=/notebook
    ports:
      - 8080:8080
    volumes: # to persist our data once we've destroyed our containers
      - ./notebook:/notebook 
      - ./logs:/logs

Now all we need to do now is start up our container:

$ docker-compose up

Wait a minute to let docker start up our container, then navigate to 127.0.0.1:8080 and you should see that Zeppelin has started up:

Speaking to spark

The first thing we’ll want to do is set up a SparkSession, which we’re going to be using as our main entry point; by default zeppelin injects a SparkContext into our environment in a variable named sc, from which we can create our session.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(sc.getConf).getOrCreate()

From here let’s start doing something more useful. Rather than going through yet another exercise in map-reduce-to-count-word-frequencies-in-a-string, let’s pop across to the https://ourworldindata.org GitHub repo, and grab some sample data; there’s lots of data on the current COVID pandemic, so let’s use this. I’ll include the actual files used in the GitHub repo for this, but the following link should get you the most up to date data at their repo here.

Spark has multiple API’s we can use to interact with our data, namely Resilient Distributed Datasets (RDD’s), Dataframes and Datasets. Dataframes and Datasets were introduced as of Spark 2.0, and will be the API’s we’re using here. To load our csv data and see our new schema we simply need to call:

val df = spark.read
         .format("csv")
         .option("header", "true") // first line in file is our headers
         .option("mode", "DROPMALFORMED")
         .option("inferSchema", "true") // automatically infers the underlying types, super helpful
         .load("/notebook/vaccinations.csv")

df.printSchema

Let’s start with something really simple, and look to see which country as issued the most vaccinations so far:

val totalVaccinations = df
    .groupBy("location")
    .sum("daily_vaccinations")
    .sort(desc("sum(daily_vaccinations)")) 

// alternatively if we want to use named fields we could also do something like this
df.groupBy("location")
    .agg(sum("daily_vaccinations") as "vaccinationsSum")
    .sort(desc("vaccinationsSum")) 

totalVaccinations.show(10)

This gives us a high level picture of the sum of our efforts, but say we want to find out how this has developed over time, we might start by collecting the total vaccinations together and sorting them by date so we can track how this changes; something like this:

val vaccinations = df.select("location", "date", "total_vaccinations")
        .withColumn("date", to_date(col("date"),"yyyy-MM-dd")) // convert our dates into a more usable format
        .sort("date")
        .sort("location")
vaccinations.createOrReplaceTempView("vaccinations") // we'll use this later

This is obviously rather a lot of data so we can filter this slightly to try to see what’s going on:

vaccinations
    .filter(col("location") === "United States" ||  col("location") === "United Kingdom").sort("location", "date")
    .show(100)
+--------------+----------+------------------+
|      location|      date|total_vaccinations|
+--------------+----------+------------------+
|United Kingdom|2020-12-20|            669674|
|United Kingdom|2020-12-21|              null|
|United Kingdom|2020-12-22|              null|
|United Kingdom|2020-12-23|              null|
|United Kingdom|2020-12-24|              null|
|United Kingdom|2020-12-25|              null|
|United Kingdom|2020-12-26|              null|
|United Kingdom|2020-12-27|            996616|
|United Kingdom|2020-12-28|              null|
|United Kingdom|2020-12-29|              null|
|United Kingdom|2020-12-30|              null|
|United Kingdom|2020-12-31|              null|
|United Kingdom|2021-01-01|              null|
|United Kingdom|2021-01-02|              null|
|United Kingdom|2021-01-03|           1389655|
|United Kingdom|2021-01-04|              null|
|United Kingdom|2021-01-05|              null|
|United Kingdom|2021-01-06|              null|
|United Kingdom|2021-01-07|              null|
|United Kingdom|2021-01-08|              null|
|United Kingdom|2021-01-09|              null|
|United Kingdom|2021-01-10|           2677971|
|United Kingdom|2021-01-11|           2843815|
|United Kingdom|2021-01-12|           3067541|
|United Kingdom|2021-01-13|           3356229|
|United Kingdom|2021-01-14|           3678180|
|United Kingdom|2021-01-15|           4006440|
|United Kingdom|2021-01-16|           4286830|
|United Kingdom|2021-01-17|           4514802|
|United Kingdom|2021-01-18|           4723443|
|United Kingdom|2021-01-19|           5070365|
|United Kingdom|2021-01-20|           5437284|
|United Kingdom|2021-01-21|           5849899|
|United Kingdom|2021-01-22|           6329968|
|United Kingdom|2021-01-23|           6822981|
|United Kingdom|2021-01-24|           7044048|
|United Kingdom|2021-01-25|           7325773|
|United Kingdom|2021-01-26|           7638543|
|United Kingdom|2021-01-27|           7953250|
|United Kingdom|2021-01-28|           8369438|
|United Kingdom|2021-01-29|           8859372|
|United Kingdom|2021-01-30|           9468382|
|United Kingdom|2021-01-31|           9790576|
|United Kingdom|2021-02-01|          10143511|
| United States|2020-12-20|            556208|
| United States|2020-12-21|            614117|
| United States|2020-12-22|              null|
| United States|2020-12-23|           1008025|
| United States|2020-12-24|              null|
| United States|2020-12-25|              null|
| United States|2020-12-26|           1944585|
| United States|2020-12-27|              null|
| United States|2020-12-28|           2127143|
| United States|2020-12-29|              null|
| United States|2020-12-30|           2794588|
| United States|2020-12-31|              null|
| United States|2021-01-01|              null|
| United States|2021-01-02|           4225756|
| United States|2021-01-03|              null|
| United States|2021-01-04|           4563260|
| United States|2021-01-05|           4836469|
| United States|2021-01-06|           5306797|
| United States|2021-01-07|           5919418|
| United States|2021-01-08|           6688231|
| United States|2021-01-09|              null|
| United States|2021-01-10|              null|
| United States|2021-01-11|           8987322|
| United States|2021-01-12|           9327138|
| United States|2021-01-13|          10278462|
| United States|2021-01-14|          11148991|
| United States|2021-01-15|          12279180|
| United States|2021-01-16|              null|
| United States|2021-01-17|              null|
| United States|2021-01-18|              null|
| United States|2021-01-19|          15707588|
| United States|2021-01-20|          16525281|
| United States|2021-01-21|          17546374|
| United States|2021-01-22|          19107959|
| United States|2021-01-23|          20537990|
| United States|2021-01-24|          21848655|
| United States|2021-01-25|          22734243|
| United States|2021-01-26|          23540994|
| United States|2021-01-27|          24652634|
| United States|2021-01-28|          26193682|
| United States|2021-01-29|          27884661|
| United States|2021-01-30|          29577902|
| United States|2021-01-31|          31123299|
| United States|2021-02-01|          32222402|
| United States|2021-02-02|          32780860|
+--------------+----------+------------------+

Still, this is hardly ideal and leads us nicely into the fantastic data visualisation support that zeppelin gives us. In the block where we generated our dataframe containing the total vaccinations by data and country earlier, you’ll see we also created a view using the createOrReplaceTempView function; this creates a view against which we can run good old fashioned sql to query and visualise our data. Simply add in %spark.sql in at the top of your block and you’re good to go:

%spark.sql
/* note that we need to order by date here 
 even though we've already sorted our data frame, if we don't to this 
 zeppelin will group dates where we have values for each element in the group together 
 and our graphs will come out with the order messed up */
select location, date, total_vaccinations from vaccinations order by date

Note that you’ll need to click into settings and select which fields we’re going to be using and for what:

Whist this output is naturally very busy, and a little overwhelming on my 13" MacBook, we’re immediately getting a decent view into our data. Picking a few countries at random we can slim this down to allow us to compare whomever we wish:

val filterList = Seq("United States", "United Kingdom", "France")
val filtered = vaccinations
                    .filter(col("location").isin(filterList : _*))
                    .sort("location", "date")
filtered.createOrReplaceTempView("filtered")
%spark.sql
select * from filtered sort by date

Easy!

One final thing I’d like to touch on. So far we’ve been dealing only with Spark Dataframes, the weakly typed cousin of the Dataset. Whilst I surprisingly really like the stringly-typed nature of the Dataframe, particularly for the speed at which it allows us to play around with the data, for anything other than exploratory analysis or small scale projects this is obviously inadequate. Fortunately Spark adding this information in is trivial. All we need to do is define the shape of our object, and pass this into as as our type parameter.

import sqlContext.implicits._ // import so we can use $ rather than col("some_col")

// note we're defining peopleVaccinated as an Optional here, preventing any nasty surprises
case class Details(
    location: String, 
    isoCode:String, 
    peopleVaccinated: Option[Int], 
    date: java.sql.Timestamp
)

val detailsDs = df.select(
        $"location", 
        $"iso_code" as "isoCode", 
        $"people_vaccinated" as "peopleVaccinated",
        $"date")
    .as[Details]

val ukResults = detailsDs.filter(details => details.location.equals("United Kingdom"))

Additionally, if we supply type information and use Dataset’s, Zeppelin provides us with some autocomplete functionality.

Whilst this support does move us in the right direction, it is a bit sluggish to be honest and isn’t really a substitute for a real IDE.

Conclusion

So far I’ve had a pretty good experience with Zeppelin and I really like what it’s got to offer; the tooling does exactly what I want from it and by running in docker it’s easy to get an environment spun up in a minutes. Whilst the out-of-the-box experience isn’t up to par with a proper IDE, if you’ve got an IntelliJ Ultimate licence, the integration with Big Data Tools brings this in line. Even for simple data analysis, comparing to tools like python/ pandas, the Scala/Spark/Zeppelin stack is a great choice for most use-cases, where we’d typically end up using a python/pandas stack.