Streamlining Asynchronous Workflows with Oban

### Streamlining Asynchronous Workflows with Oban
           
            [@mtunski](https://tunski.co)

---
            
            ### Agenda

- (SEO) Audit in Surfer
            -  
              Apps/Services layout
            - 
              RabbitMQ as communication medium 
            - 
              The Old Way of doing things & its problems 
            - 
              Oban to the rescue 
            - 
              Good, bad & ugly of The Oban Way
            
            ---
            
            ### Disclaimer

- Code examples are simplified for the sake of presentation and might not compile
            - Screenshots quality is not the best
            
            
            ---

### (SEO) Audit in Surfer

---

Input: audited page's URL & targeted keyword(s)

---

Output: competition-based audit & recommendations

---

### The Workflow

```txt[1|2-5|6-7|1-7]
            Scrape the audited page 
            |-> Fetch the SERP for targeted keyword 
            |   |-> Scrape top 10 pages from SERP 
            |       |-> Analyze scraped pages (e.g. calculate Content Score)
            |           |-> Select 5 best (by Content Score) competitors           
            |*-> Fetch the SERP scoped to audited page's domain 
                 |*-> Analyze internal links
            ```
            
            \* optional steps, audit is still considered completed if they fail - they can be retried on demand

---
            
            ### Apps/Services
            
            ---
            
            ### The Monolith

* Surfer - Elixir & React web application 
              * 
                Coordinates the whole process 
              * 
                Does some processing/analysis on its own 
              *  
                Stores everyting in the database 
              * 
                Serves the results to the user

---
            
            ### (Not-so-micro)Services
            
            * **Scraper** - Elixir app (in umbrella) for fetching & parsing SERPs 
              * 
                in essence, does an API call to one of our SERP data providers 
            *  
              **Crawler** - Node.js app for scraping individual pages, data extraction and initial analysis
                *  
                  uses Puppeteer to scrape pages

disclaimer: In Surfer, we call the SERP fetching step "scraping" and scraping the pages one "crawling", so that's the
            nomenclature I'll be using further on ¯\\\_(ツ)\_\/¯

---

### Async comms: RabbitMQ

<a href="https://www.rabbitmq.com" target="_blank">https://www.rabbitmq.com</a>
            
            ```txt
                    |-> scrape_requests -> Scraper -> scrape_results -|
            Surfer -|                                                 |-> Surfer
                    |-> crawl_requests --> Crawler -> crawl_results --|
            ``` 
        
            - 
              Easily scalable: spin up more Scrapers/Crawlers to handle higher load 
            - 
              Fault tolerant:
                - 
                  if Scraper/Crawler instance goes down, the message(s) will be picked up by another one 
                - 
                  if Surfer goes down, the message(s) will be picked up after restart

---

### The Old Way

---

### AMQP

<a href="https://github.com/pma/amqp" target="_blank">https://github.com/pma/amqp</a>
            
            Consumer & producer modules per service

* 
              `Surfer.GoogleScrapeScheduler` 
            * 
              `Surfer.ScrapedKeywordsConsumer`
            * 
              `Surfer.PageCrawlScheduler`
            * 
              `Surfer.CrawledPagesConsumer`
            
            ---

### DB Entities

```elixir[1-5|7-11|13-14|16-2137]
            # Surfer.AuditQuery
            schema "audit_queries" do
              has_one(:scrape, Scrape, where: [kind: "keyword"])
              has_one(:site_scrape, Scrape, where: [kind: "site"])
            end

# Surfer.Scrape
            schema "scrapes" do
              has_many(:search_results, SearchResult)
              has_many(:crawled_pages, CrawledPage)
            end

# Surfer.SearchResult
            schema "search_results" do #...

# Surfer.CrawledPage
            schema "crawled_pages" do #...
            ```

---

#### `Surfer.CreateAuditQuery`

```elixir[1,13|2|4-5|7|9-10|12]
            Repo.transaction(fn repo ->
              audit_query = repo.insert!(%AuditQuery{...})

scrape = repo.insert!(%Scrape{audit_query_id: audit_query.id, ...})
              site_scrape = repo.insert!(%Scrape{audit_query_id: audit_query.id, ...})
              
              audited_page = repo.insert!(%CrawledPage{scrape_id: scrape.id, ...})
              
              GoogleScrapeScheduler.schedule_scrape(scrape)
              GoogleScrapeScheduler.schedule_scrape(site_scrape)

PageCrawlScheduler.schedule_page(audited_page)
            end)
            ```
            
            ---

#### `Surfer.GoogleScrapeScheduler`

```elixir
            def schedule(scrape) do
              # the message must contain scrape_id
              scrape |> prepare_payload() |> publish_scrape_request() 
            end 
            ```

---

#### `Surfer.ScrapedKeywordsConsumer`

```elixir[1,4|2|3|6,19|7|9-11,18|12-13|12-14|12-15|12-16|12-17]
            def handle_info({:basic_deliver, payload, meta}, {channel, connection}) do
              payload |> decode() |> process()
              ack(channel, connection, meta.delivery_tag)
            end
            
            defp process(scrape_results) do 
              scrape = load_scrape(scrape_results.scrape_id)
              
              # scrape can be failed when the initial crawl of audited page failed
              # it's a special case for audits only
              if scrape.state != "failed" do 
                scrape 
                |> update_state(scrape_results) # can be failed if the SERP was empty
                |> maybe_insert_search_results(scrape_results)
                |> maybe_insert_crawled_pages() #⬅️⬇️ for "keyword" scrapes only
                |> maybe_schedule_crawls() #❗when scrape is ready we publish crawl requests
                |> broadcast_finished()
              end
            end
            ```
            
            There are _a few_ maybes in the original `process` method

---

#### `Surfer.PageCrawlScheduler`
            
            ```elixir
            def schedule(crawled_page) do
              # the message must contain crawled_page_id
              crawled_page |> prepare_payload() |> publish_crawl_request()
            end
            ```

---

#### `Surfer.CrawledPagesConsumer`

```elixir[1,11|2|4-5|4-6|4-8|4-9|4-10|13-16|18,36|19-26,33|27|29-32|35]
            defp process(crawl_results) do 
              crawled_page = load_crawled_page(crawl_results.crawled_page_id)
              
              crawled_page 
              |> update_state(crawl_results)
              |> maybe_fail_audit_query() 
              # |> maybe_fail_content_editor_query() 
              # |> ...() 
              |> maybe_analyze_scrape() 
              |> broadcast_finished()
            end

defp maybe_fail_audit_query(crawled_page) do 
              # code that checks if this was the initial crawl of audited page
              # and if so, fails the audit query
            end

defp maybe_analyze_scrape(crawled_page) do 
              # scrape is ready for analysis when all its pages are crawled,
              # so when processing every SINGLE crawled page, we have to load its scrape
              # with all its crawled pages and check their state!
              #
              # in our audit example, only the last (10th) crawl will satisfy this condition
              # also have to guard for audited page crawl finished before keyword scrape
              # scheduled its crawls! 🤯
              if scrape_ready_for_analysis?(crawled_page.scrape) do
                ScrapeAnalyzer.analyze_scrape(crawled_page.scrape)

crawled_page 
                |> do_something_if_content_planner()
                |> do_something_if_ai_article()
                # |> ...
              end

crawled_page
            end
            ```

There are _loads of_ maybes in the original `process` & `maybe_analyze_scrape` methods

---

### Problems

-  
              Difficult to follow the workflow from start to end 
            - 
              If/elsing in producers and consumers: 
              - 
                to handle "scrape finished" when all its pages are crawled 
              - 
                to handle different requirements for other Surfer modules 
            - 
              Difficult to handle errors 
            - 
            No observability beyond RabbitMQ dashboard 
            
            ---

### Can we do better?

- Keep the scalability
            - 
              Make it easier to follow the whole workflow 
            - 
              Reduce conditional logic per Surfer module 
            - 
              Maybe improve observability of the process 
            - 
              Maybe get some nice error handling

---
            
            ### The New Way

---

### Oban

<a href="https://getoban.pro" target="_blank">https://getoban.pro</a>
            
            <a href="https://hexdocs.pm/oban" target="_blank">https://hexdocs.pm/oban</a>

> Oban is a background job system built on modern PostgreSQL and SQLite3 with the primary goals of reliability,
            consistency and observability. Thousands of Elixir applications rely on Oban to coordinate their async workloads.

---

### Oban Web & Pro

<a href="https://getoban.pro/docs/pro" target="_blank">https://getoban.pro/docs/pro</a>
            <a href="https://getoban.pro/docs/web" target="_blank">https://getoban.pro/docs/web</a>

> (...) real-time monitoring with Oban Web, and complex
            workflow management with Oban Pro.

---

### Oban 101

> Worker modules do the work of processing a job.

>  
              Jobs are simply Ecto structs and are enqueued by inserting them into the database. For convenience and consistency all workers provide a new/2 function that converts an args map into a job changeset suitable for insertion.

>  
              When a job returns an error value, raises an error, or exits during execution the details are recorded within the errors array on the job. When the number of execution attempts is below the configured max_attempts limit, the job will automatically be retried in the future. The retry delay has an exponential backoff (...).

> 
              Workflow workers compose together with arbitrary dependencies between jobs, allowing sequential, fan-out, and fan-in execution workflows.

---

#### `Surfer.CreateAuditQuery`
            
            ```elixir[9]
            Repo.transaction(fn repo ->
              audit_query = repo.insert!(%AuditQuery{...})
              
              repo.insert!(%Scrape{audit_query_id: audit_query.id, ...})
              repo.insert!(%Scrape{audit_query_id: audit_query.id, ...})
              
              repo.insert!(%CrawledPage{scrape_id: scrape.id, ...})
              
              InitializeAuditQueryWorkflow.run(audit_query)
            end)
            ```

---

#### `Surfer.InitializeAuditQueryWorkflow`
            
            ```elixir[1|3|4-8|9|10-12|10-13|10-14|10-15|10-16|17-19|17-20|21|24-26]
            use Surfer.Oban.Workflow # provides init_workflow/2, add_step/2, start_workflow/1

def run(query) do
              query
              |> init_workflow("initialize-audit-#{query.id}",
                queue: :audits,
                default_args: %{organization_id: query.organization.id}
              )
              |> add_step(&crawl_audited_page/1)
              # branch 1: "keyword" scrape for discovering competitors and preparing analysis
              # this branch determines query completion
              |> add_step(&fetch_serp/1, deps: [:crawl_audited_page])
              |> add_step(&crawl_top_10_pages_from_serp/1, deps: [:fetch_serp])
              |> add_step(&analyze_crawled_pages/1, deps: [:crawl_top_10_pages_from_serp])
              |> add_step(&select_competitors/1, deps: [:analyze_crawled_pages])
              |> add_step(&complete_query/1, deps: [:select_competitors])
              # branch 2: "site" scrape for discovering internal links
              # query can still be successful when internal links discovery fails
              |> add_step(&fetch_serp_for_internal_links/1, deps: [:crawl_audited_page])
              |> add_step(&check_internal_links/1, deps: [:fetch_serp_for_internal_links])
              |> start_workflow()
            end

defp crawl_audited_page(query) do
              CrawlAuditedPageWorker.new(%{query_id: query.id})
            end
            ```

---
            
            ### Async comms: still RabbitMQ

-  
              `Surfer.Oban.RabbitClient` (producer & consumer)
            - 
              `Surfer.Oban.AsyncWorker` (behaviour)
            
            ---
            
            ### `Surfer.Oban.AsyncWorker` 
            
            - behaviour implemented by specific "async workers"
            -  
              builds a job that will be executed asynchronously (published & awaited) 
            -  
              executing jobs are moved to another queue (`async_work_awaits`) not to block other jobs

---

### `Surfer.Oban.RabbitClient`

<div style="font-size: 1.75rem;">
            
            - producer & consumer module
            -  
              publishes work requests to service queues with `correlation_id` containing job's id to identify the source job when results come
            -  
              handles responses from **all** services by sending `reply_to: :async_work_results`
              -  
                thus, **must only do two things**:
                -  
                  update the job (parse, validate & save the results in the db); must be done here, because the instance processing the job might be different from the processing results message; the job process might also _not_ be running at all (it's orphaned after node goes down while processing and it takes a while to "rescue" it)
                -  
                  notify its process about completion; if it's running it should be stopped immediately
              -  
                any other processing of the results must be done in **separate** workers

---

### `Surfer.Oban.RabbitClient`

```elixir
            def schedule_work(queue, payload, job_id) do
              publish(exchange(), queue, payload, reply_to: :async_work_results, correlation_id: job_id)
            end

def handle_deliver(_consumer, %{payload: payload, meta: meta}) do
              AsyncWorker.finish_job(meta.correlation_id, payload)
            end
            ```

---

### `Surfer.Oban.AsyncWorker`

```elixir[1-15|17-25|27-37|39-61|46-61]
            defmacro __using__(opts) do
              quote do
                use Oban.Pro.Worker

@impl Oban.Pro.Worker
                def process({queue: :async_work_awaits} = job) do
                  AsyncWorker.await_work_finished(job)
                end

@impl Oban.Pro.Worker
                def process(job) do
                  AsyncWorker.schedule_work(job)
                end
              end
            end

def schedule_work(job) do 
              worker_module = get_job_worker_module(job)

job
              |> worker_module.prepare_payload()
              |> worker_module.schedule_work(job)

put_job_in_awaits_queue(job)
            end

def await_work_finished(job_id) do
              :ok = listen(oban(), [:async_workers])
              
              receive do
                {:notification, :async_workers, %{"work_completed" => ^job_id}} ->
                  :ok

{:notification, :async_workers, %{"work_failed" => ^job_id, "error" => error}} ->
                  {:error, error}
              end
            end

def finish_job(job_id, payload) do
              job = load_job(job_id)
              worker_module = get_job_worker_module(job)

results =
                payload
                |> worker_module.parse_results()
                |> worker_module.validate_results()
                |> case do
                  {:ok, results} -> %{results: encode64(results)}
                  {:error, error} -> %{work_error: error}
                end

job
              |> store_results(results)
              |> case do
                %{meta: %{results: _results}} ->
                  notify(oban(), :async_workers, %{work_completed: job.id})

%{meta: %{work_error: error}} ->
                  notify(oban(), :async_workers, %{work_failed: job.id, error: error})
              end
            end
            ```

---

### `Surfer.CrawlUrlWorker`

```elixir[1-2|4-7|9-12|14-15|17-18|20-21]
            alias Surfer.Oban.AsyncWorker
            use AsyncWorker, queue: :crawls

@impl AsyncWorker
            def prepare_payload(job) do
              prepare_rabbit_message(job)
            end

@impl AsyncWorker
            def schedule_work(payload, job) do
              RabbitClient.schedule_work(:crawl_requests, payload, job.id)
            end

@impl AsyncWorker
            def parse_results(raw_data), do: Msgpax.unpack!(raw_data)

@impl AsyncWorker
            def validate_results(%{error_message: error_message}), do: {:error, error_message}

@impl AsyncWorker
            def validate_results(data), do: {:ok, data}
            ```

---
            
            <img src="crawl-executing.png" class="r-stretch">
            
            ---
            
            <img src="crawl-failed.png" class="r-stretch">

---

---
            
            ### Problem

```elixir[6]
            # Surfer.InitializeAuditQueryWorkflow

def run(query) do
              # ...
              |> add_step(&fetch_serp/1, deps: [:crawl_audited_page])
              |> add_step(&crawl_top_10_pages_from_serp/1, deps: [:fetch_serp])
              |> add_step(&analyze_crawled_pages/1, deps: [:crawl_top_10_pages_from_serp])
              # ...
            end
            ```

`crawl_top_10_pages_from_serp` step must add new crawl jobs to the workflow basing on the `fetch_serp` step results

---

### Solution

> Sometimes all jobs aren't known when the workflow is created. In that case, you can add more jobs with optional
            dependency checking using append_workflow/2.

Inside the `crawl_top_10_pages_from_serp` step, we make a custom `prepend` operation, which adds new crawl jobs as dependencies of this step itself.

Thanks to that, we avoid "append waterfalls" in respective workers (e.g. scrape -> append crawls -> append next steps) and have a beautiful workflow definition where all crucial steps are visible.

---

#### `Surfer.CrawlScrapePagesWorker`

```elixir[1-2|4-10|12-21]
            alias Surfer.Oban.WorkflowWorker
            use WorkflowWorker

@impl Oban.Pro.Worker
            def process(job) do
              scrape = load_scrape(job)
              crawl_jobs = build_crawl_jobs(scrape)

prepend(job, :crawl_pages, crawl_jobs)
            end

@impl WorkflowWorker
            def process_prepend(job, :crawl_pages) do
              crawl_jobs = load_prepend(job, :crawl_pages)

if enough_successfully_crawled_pages?(crawl_jobs) do
                :ok
              else
                {:cancel, :not_enough_successfully_crawled_pages}
              end
            end
            ```

Prepending mechanism is implemented in `Surfer.Oban.WorkflowWorker`, but we won't be going into details on it in this presentation

---
            
            ### Profit

- We still use RabbitMQ, so we can scale easily
            - 
              Whole workflow is defined in a single file, so it's super clear what steps are needed to be performed in what order
            - 
              We have Surfer-module-agnostic workers, e.g. `CrawlUrlWorker`, that can be plugged into any workflow that requires crawling
            - 
              We have Oban Web dashboard where we can monitor the whole process
            - 
              We get error handling with retries and exponential backoff for free

---

### Bonus

- Synchronous, RPC-like calls thanks to `Oban.Pro.Relay` plugin
            - 
              It's now easy to send requests to staging/production machines from dev environment by setting `Surfer.Oban.AsyncWorker` results queue, e.g. `async_work_results/tuna-dev`

---

<img src="schedule.png" class="r-stretch">
        
            ---

---

### Not-so-great

- Quite a lot custom code and macros 🙈
            -  
              Oban's dependency resolution mechanism isn't great 
              -  
                it basically performs periodic deps checks for all awaiting jobs and by defaul it blocks the queue
            -  
              There's no way to see an overview of a single workflow in Oban Web (yet)
            -  
              Additional load on the database 
              -  at Surfer we are runnnig a couple million jobs daily without any hiccups so far
          
            ---
            
            ### Room for improvement
            
            - We didn't yet remove the db entities that store intermediate steps results, and we could, as these can be safely
            stored in Oban jobs
            - 
            In theory, it'd be possible to ditch RabbitMQ completely, but this would require us to implement custom Oban (ergo
            Elixir) facade apps for each of our services

---

### That's it, thanks! 
            #### This deck 👇🏻
            <img src="qrcode-deck.png" width="300" height="300">