scalapythonIn [1]:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder
  .appName("hello-sparklabx")
  .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0")
  .getOrCreate()

println(s"Spark ${spark.version}")

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("hello-sparklabx") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0") \
    .getOrCreate()

print(f"Spark {spark.version}")

Press Run to execute

scalapythonIn [2]:

val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("s3a://workspace/public/students.csv")

df.groupBy("department")
  .agg(avg($"gpa").as("avg_gpa"))
  .orderBy($"avg_gpa".desc)
  .show()

df = spark.read \
    .option("header", True) \
    .option("inferSchema", True) \
    .csv("s3a://workspace/public/students.csv")

df.groupBy("department") \
    .agg({"gpa": "avg"}) \
    .orderBy("avg(gpa)", ascending=False) \
    .show()

Press Run to execute

scalapythonIn [3]:

df.filter($"gpa" > 3.5).show(5)

df.filter("gpa > 3.5").show(5)

Press Run to execute

Six things you don't have to build yourself.

Most "self-hosted notebook" setups stop at docker run jupyter. SparkLabX ships the boring infrastructure — the bits between "it runs on my laptop" and "it survives Monday morning."

User A literally can't read User B's data.

Each user gets a MinIO IAM account scoped to their own S3 prefix. A misclick or a typo in a notebook path returns AccessDenied from the storage layer — your app code never sees the request.

One greedy notebook can't take down the room.

Every user gets their own kernel container/pod. Someone runs df.collect() on a 100 GB DataFrame? Only their pod OOMs — everyone else keeps coding.

Login through Google or Microsoft. No new passwords to manage.

OAuth wired in. Restrict access by domain (@yourcompany.com) or by individual email. Unallowed addresses fail before any DB row gets created.

One bucket for shared datasets, one shelf for everyone's private notebooks.

Drop reference data into s3a://workspace/public/ — every user can read it from PySpark or Scala. Their own work lives at s3a://workspace/users/<you>/, invisible to everyone else.

Dead kernels heal themselves before users notice.

Container exited overnight? Pod hit CrashLoopBackOff? The next "Connect" detects the corpse, deletes it, and spawns a fresh one. Nobody has to page the on-call engineer at 7am.

Restart the backend. Lose nothing.

Notebook ↔ kernel mappings, idle-reaper state, spawn progress — all in Postgres. Backend pod can crash and restart mid-execution; the running kernels keep going and reconnect transparently.

Pick your isolation level.

Same product, three deployment shapes. Start with shared for a demo; move to k8s_per_user when you need real boundaries.

demo

shared

One Jupyter container for everyone. Quick demos and single-user dev. Cross-user Spark reads are possible — don't ship this to production.

Single shared kernel
No isolation
Lowest RAM

trusted hosts

docker_per_user

One container per user on the host's Docker daemon. True MinIO IAM isolation per kernel. Great for local dev with prod parity.

Per-user IAM creds
Idle reaper
Requires docker.sock access

production

k8s_per_user

One pod per user, backend runs in-cluster. MinIO IAM plus Kubernetes NetworkPolicy and ResourceQuota. The mode you actually run for paying users.

Pod isolation
NetworkPolicy / quotas
RBAC-scoped backend

Clone, run, log in.

The script generates strong random secrets, pulls public Docker images, and starts the stack with sensible defaults. The admin password is printed when it's done.

Clone the repo

Apache-2.0 licensed. Fork, customize, ship.

Run quickstart.sh

Generates a fresh .env with random secrets and boots the full stack.

Open http://localhost:3000

# Clone $ git clone https://github.com/sparklabx/sparklabx.git $ cd sparklabx # One command, full stack $ ./quickstart.sh ▶ Generating .env with random secrets... ✓ Secrets generated. (Stored in .env — gitignored.) ▶ Pulling images... ▶ Starting stack... ✓ Backend is up. ──────────────────────────────────────── SparkLabX Notebook is running. URL: http://localhost:3000 Admin: admin Password: •••••••••••••••• Kernel: docker_per_user ────────────────────────────────────────

Where it lands.

An honest comparison with the OSS notebooks you'd otherwise reach for. SparkLabX is essentially what you'd end up wiring on top of JupyterHub if you went down that road — bundled into one Helm chart.

	Plain Jupyter	JupyterHub	SparkLabX
PySpark + Scala kernels	DIY install	DIY install	Bundled, configured
Multi-user	✗	✓	✓
Per-user kernel container	✗	✓	✓
Per-user storage isolation	✗	Filesystem only	IAM-enforced at the S3 layer
OAuth + email allowlist	Plugin	Plugin	Built in
Idle kernel reaper	✗	Configurable	Built in
Auto-respawn on kernel crash	✗	✗	✓
Deploy	`pip install`	Helm chart	Helm chart

The questions everyone asks first.

Can't I just use JupyterHub + Spark?

Yes — and if you want full control over the stack, that's a perfectly valid path. SparkLabX is essentially what you'd end up wiring on top of JupyterHub anyway: PySpark and Scala kernels bundled, per-user S3 isolation enforced by MinIO IAM, an idle reaper, and auto-respawn when a kernel container dies. If you don't need to build that stack from scratch, this saves you the integration work.

Is it production-ready?

The open-source core (notebook engine, kernel isolation, S3 IAM, OAuth, Helm chart) is what you see — Apache-2.0, deployed via Helm, runs on standard k8s. The wider platform layers (ingestion, workflows, query, governance, security) ship under Pro. If your use case is "give my data team isolated Spark notebooks," you're production-ready today.

How big a team does this make sense for?

Sweet spot is roughly 5 to 50 active users. Below that, vanilla Jupyter on a VM is probably enough. Above that, you'll start wanting the platform pieces Pro adds (ingestion, orchestrated workflows, governance, and a security center). Architectures scale further; we just haven't pressure-tested past ~50 yet.

Can I run this on my own k8s cluster?

Yes — Helm chart in chart/. Works on EKS, GKE, AKS, k3s, and bare-metal as long as you have a default StorageClass and an ingress controller. Three other deployment modes (shared, docker_per_user, k8s_per_user) cover everything from a laptop demo to multi-tenant production.

What's on the roadmap?

The open-source core stays as-is — bug fixes and stability. The new development effort is going into Pro: the full self-hosted data platform — ingestion, orchestrated workflows (medallion pipelines), a query engine, data catalog & governance, and a security center — customized to your company. Email for early access or watch the GitHub repo for announcements.

How are storage isolation guarantees actually enforced?

On first login, the backend provisions a MinIO IAM account per user with a policy scoped to users/<username>/* plus the shared public/ prefix. The kernel container is launched with that user's credentials baked in. A PySpark line reading s3a://workspace/users/someone-else/... gets AccessDenied from MinIO before it ever hits any app code. That's the design difference vs every "isolated notebook" project that filters paths in the application layer.

PySpark and Scala notebooks
for your team.

Built for teams that need Spark — not just notebooks.

Spark notebooks for everyone — without the babysitting.

Data that legally can't leave your network.

One notebook surface for every BU, properly isolated.

Apache Spark in your browser

Pre-installed libraries

Six things you don't have to build yourself.

User A literally can't read User B's data.

One greedy notebook can't take down the room.

Login through Google or Microsoft. No new passwords to manage.

One bucket for shared datasets, one shelf for everyone's private notebooks.

Dead kernels heal themselves before users notice.

Restart the backend. Lose nothing.

Pick your isolation level.

shared

docker_per_user

k8s_per_user

Clone, run, log in.

Where it lands.

The full data platform — self-hosted, customized for your company.

The questions everyone asks first.

PySpark and Scala notebooksfor your team.

Built for teams that need Spark — not just notebooks.

Spark notebooks for everyone — without the babysitting.

Data that legally can't leave your network.

One notebook surface for every BU, properly isolated.

Apache Spark in your browser

Six things you don't have to build yourself.

User A literally can't read User B's data.

One greedy notebook can't take down the room.

Login through Google or Microsoft. No new passwords to manage.

One bucket for shared datasets, one shelf for everyone's private notebooks.

Dead kernels heal themselves before users notice.

Restart the backend. Lose nothing.

Pick your isolation level.

shared

docker_per_user

k8s_per_user

Clone, run, log in.

Where it lands.

The full data platform — self-hosted, customized for your company.

The questions everyone asks first.

PySpark and Scala notebooks
for your team.