Apache-2.0 · self-hosted

PySpark and Scala notebooks
for your team.

Self-hosted, multi-user. Every user gets their own isolated environment. One Helm install. Apache-2.0.

Built for

Built for teams that need Spark — not just notebooks.

Three teams who pick SparkLabX over wiring JupyterHub, Spark, and an IAM layer together themselves.

Data teams · 5–50 people

Spark notebooks for everyone — without the babysitting.

The pain: Everyone needs PySpark but nobody wants to install Spark on a laptop. A shared Jupyter VM means everyone shares one kernel — one bad df.collect() takes the whole team down.

Every analyst gets their own kernel container, on the cluster you already run. Open source under Apache-2.0.
Regulated industries

Data that legally can't leave your network.

The pain: Compliance, residency rules, or DPO concerns rule out SaaS notebooks. PII and PHI cannot leave your VPC, full stop.

Runs entirely inside your VPC or on-prem k8s. Storage, kernels, auth — all yours.
Internal data platforms

One notebook surface for every BU, properly isolated.

The pain: Each business unit wants notebooks but you can't trust them not to read each other's data. App-layer ACLs are leaky.

Per-user S3 prefix enforced at the IAM layer. Deploy once with Helm, onboard via OAuth.
Try it

Apache Spark in your browser

A faithful preview of the real notebook UI. Click Run on any cell.

ScalaConnected
FileEditViewRunCellHelp
scalaIn [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder
  .appName("hello-sparklabx")
  .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0")
  .getOrCreate()

println(s"Spark ${spark.version}")
Press Run to execute
scalaIn [2]:
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("s3a://workspace/public/students.csv")

df.groupBy("department")
  .agg(avg($"gpa").as("avg_gpa"))
  .orderBy($"avg_gpa".desc)
  .show()
Press Run to execute
scalaIn [3]:
df.filter($"gpa" > 3.5).show(5)
Press Run to execute

Pre-installed libraries

  • org.apache.spark:spark-core_2.123.5.1
  • org.apache.spark:spark-sql_2.123.5.1
  • org.apache.spark:spark-streaming_2.123.5.1
  • io.delta:delta-spark_2.123.0.0
  • org.apache.hadoop:hadoop-aws3.3.4
  • com.amazonaws:aws-java-sdk-bundle1.12.x
  • sh.almond:scala-kernel_2.120.14.0
  • org.scala-lang:scala-library2.12.18
Add more with :require (Scala) or %pip install (PySpark).
This is a UI preview — the production notebook runs against a real Spark cluster you control.
Under the hood

Six things you don't have to build yourself.

Most "self-hosted notebook" setups stop at docker run jupyter. SparkLabX ships the boring infrastructure — the bits between "it runs on my laptop" and "it survives Monday morning."

User A literally can't read User B's data.

Each user gets a MinIO IAM account scoped to their own S3 prefix. A misclick or a typo in a notebook path returns AccessDenied from the storage layer — your app code never sees the request.

One greedy notebook can't take down the room.

Every user gets their own kernel container/pod. Someone runs df.collect() on a 100 GB DataFrame? Only their pod OOMs — everyone else keeps coding.

Login through Google or Microsoft. No new passwords to manage.

OAuth wired in. Restrict access by domain (@yourcompany.com) or by individual email. Unallowed addresses fail before any DB row gets created.

One bucket for shared datasets, one shelf for everyone's private notebooks.

Drop reference data into s3a://workspace/public/ — every user can read it from PySpark or Scala. Their own work lives at s3a://workspace/users/<you>/, invisible to everyone else.

Dead kernels heal themselves before users notice.

Container exited overnight? Pod hit CrashLoopBackOff? The next "Connect" detects the corpse, deletes it, and spawns a fresh one. Nobody has to page the on-call engineer at 7am.

Restart the backend. Lose nothing.

Notebook ↔ kernel mappings, idle-reaper state, spawn progress — all in Postgres. Backend pod can crash and restart mid-execution; the running kernels keep going and reconnect transparently.

Deployment

Pick your isolation level.

Same product, three deployment shapes. Start with shared for a demo; move to k8s_per_user when you need real boundaries.

demo

shared

One Jupyter container for everyone. Quick demos and single-user dev. Cross-user Spark reads are possible — don't ship this to production.

  • Single shared kernel
  • No isolation
  • Lowest RAM
production

k8s_per_user

One pod per user, backend runs in-cluster. MinIO IAM plus Kubernetes NetworkPolicy and ResourceQuota. The mode you actually run for paying users.

  • Pod isolation
  • NetworkPolicy / quotas
  • RBAC-scoped backend
30-second quickstart

Clone, run, log in.

The script generates strong random secrets, pulls public Docker images, and starts the stack with sensible defaults. The admin password is printed when it's done.

1
Clone the repo

Apache-2.0 licensed. Fork, customize, ship.

2
Run quickstart.sh

Generates a fresh .env with random secrets and boots the full stack.

3
Open http://localhost:3000

Log in as admin with the printed password. OAuth is optional.

~/projects · zsh
# Clone
$ git clone https://github.com/sparklabx/sparklabx.git
$ cd sparklabx

# One command, full stack
$ ./quickstart.sh

▶ Generating .env with random secrets...
✓ Secrets generated. (Stored in .env — gitignored.)
▶ Pulling images...
▶ Starting stack...
✓ Backend is up.

────────────────────────────────────────
SparkLabX Notebook is running.

  URL:        http://localhost:3000
  Admin:      admin
  Password:   ••••••••••••••••
  Kernel:     docker_per_user
────────────────────────────────────────
Vs alternatives

Where it lands.

An honest comparison with the OSS notebooks you'd otherwise reach for. SparkLabX is essentially what you'd end up wiring on top of JupyterHub if you went down that road — bundled into one Helm chart.

Plain JupyterJupyterHubSparkLabX
PySpark + Scala kernelsDIY installDIY installBundled, configured
Multi-user
Per-user kernel container
Per-user storage isolationFilesystem onlyIAM-enforced at the S3 layer
OAuth + email allowlistPluginPluginBuilt in
Idle kernel reaperConfigurableBuilt in
Auto-respawn on kernel crash
Deploypip installHelm chartHelm chart
In development · early access

Next: SparkLabX Pro
built on the core you just saw.

Same notebook engine, with the pieces you can't get out of any other Spark platform: run a classroom of 200 students on a real cluster, time-boxed exams with sealed kernels, and Spark workers that burst across AWS, GCP, or Azure depending on what's cheapest right now.

FAQ

The questions everyone asks first.

Can't I just use JupyterHub + Spark?
Yes — and if you want full control over the stack, that's a perfectly valid path. SparkLabX is essentially what you'd end up wiring on top of JupyterHub anyway: PySpark and Scala kernels bundled, per-user S3 isolation enforced by MinIO IAM, an idle reaper, and auto-respawn when a kernel container dies. If you don't need to build that stack from scratch, this saves you the integration work.
Is it production-ready?
The core (notebook engine, kernel isolation, S3 IAM, OAuth, Helm chart) is what you see — Apache-2.0, deployed via Helm, runs on standard k8s. The classroom, exam, and multi-cloud layers are still in development under Pro. If your use case is "give my data team isolated Spark notebooks," you're production-ready today.
How big a team does this make sense for?
Sweet spot is roughly 5 to 50 active users. Below that, vanilla Jupyter on a VM is probably enough. Above that, you'll start wanting features Pro is being built for (auto-scaling, cohort isolation, billing-grade audit). Architectures scale further; we just haven't pressure-tested past ~50 yet.
Can I run this on my own k8s cluster?
Yes — Helm chart in chart/. Works on EKS, GKE, AKS, k3s, and bare-metal as long as you have a default StorageClass and an ingress controller. Three other deployment modes (shared, docker_per_user, k8s_per_user) cover everything from a laptop demo to multi-tenant production.
What's on the roadmap?
The open-source core stays as-is — bug fixes and stability. The new development effort is going into Pro: classroom mode, exam isolation, multi-cloud Spark bursting, and a hosted control plane for teams that don't want to self-host. Email for early access or watch the GitHub repo for announcements.
How are storage isolation guarantees actually enforced?
On first login, the backend provisions a MinIO IAM account per user with a policy scoped to users/<username>/* plus the shared public/ prefix. The kernel container is launched with that user's credentials baked in. A PySpark line reading s3a://workspace/users/someone-else/... gets AccessDenied from MinIO before it ever hits any app code. That's the design difference vs every "isolated notebook" project that filters paths in the application layer.