Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

Channel: aiDotEngineer

Published at: 2026-04-10

YouTube video id: c5-kx2bwoCk

Source: https://www.youtube.com/watch?v=c5-kx2bwoCk

Hello everyone. I'm Moska Gabricima,
developer relations manager at Nvidia,
where I work closely with developers
building and deploying AI systems.
Today, we're looking at running LLMs
locally, practical LLM performance on
the Jetson Spark.
This isn't a theoretical talk. It's a
data-backed journey through the
trade-offs of modern AI infrastructure.
Findings are based on hands-on
experiments with the goal of
understanding what's actually practical
on a single system.
The evolution in AI puts greater demand
on developer systems, creating two main
challenges.
You either run out of memory or you do
not have access to the right software
stack. Then you end up and pushing
everything to the cloud or data center.
As models move from experiments to
production, concerns like cost
predictability,
data residency, and deterministic
latency take center stage.
And iteration speed often depends often
depends on access to shared
infrastructure. And since your work will
be scheduled against other competing
workloads, it causes delays in
development work as well.
So, the question becomes, can we bring
some of that workflow closer to where
development actually happens?
To maximize developer productivity,
local solutions to these challenges are
required.
The Jetson Spark is designed from the
ground up to build and run AI and can be
used as a standalone system.
It's powered by the GB10 Grace Blackwell
superchip, combining CPU and GPU with a
unified memory architecture. With 128 GB
of unified memory and NV4 support, it
enables developers to work with models
of up to around 200 billion parameters
locally on a system that fits under the
under a desk or on top of your desk.
It runs the same Nvidia AI software
stack used in production environments,
meaning workflows can move from desktop
to data center or cloud with minimal
changes.
The key idea here is not replacing the
cloud, but bringing
powerful AI development closer to the
developer.
This is my setup.
Everything runs locally and is
reproducible. To serve the models, I
used vLLM
with a set of quant models across
different sizes and precision formats.
I run this inside an Nvidia optimized
container. This would ensure that our
environment is identical to what you
would deploy in a data center.
And [snorts] instead of just showing raw
results, I want to show you the how.
I built an automated benchmarking
harness. Every model run from 1.5
billion to 14 billion follows the same
strict protocol. Environment isolation
via Docker, three mandatory warm-up
runs, and background GPU metrics logging
at 1-second intervals.
On the left, you're looking at the
orchestrator script.
For every execution, the script
automatically generates a unique
directory using a precise timestamp and
a sanitized model ID. For every model
run, it captures the full model's
endpoint response and necessary metrics.
On the right, you see the result, a
clean, versioned artifact of the run.
It contains everything to verify the
findings, the metadata, and the text
results from the benchmarks.
On the lower right, you can see an
example command for getting things
started.
Now, let's look at the actual
measurement logic.
In an AI application, end-to-end latency
is important,
but time to first
token is the metric that defines the
user's perceived performance. If the
first token arrives instantly, the
application feels responsive. And this
script here shows how we timestamp the
very first chunk of a streaming
response.
In this script, we're not just calling
an API and waiting for a result. We're
explicitly handling the streaming
response from the vLLM server.
If you look at the highlighted block in
the stream_once
function, you'll see the timestamping
logic.
To explore this in practice, I ran a
series of experiments using vLLM on the
Jetson Spark.
I tested different models, from a
smaller instruction model to larger
optimized variants, all under the same
setup.
Everything was served locally.
The goal here wasn't to push theoretical
limits, but to understand realistic
behavior in a developer workflow.
Let's dive into the raw performance data
I captured.
>> [snorts]
>> This bar chart represents the completion
token per second across the test set.
At the far left, you see the 1.5
billion instruct model
delivering a massive 61.73
tokens per second.
But the most interesting data point for
us is the 14 billion NVFB4
model. Despite being nearly 10 times
larger than the 1.5 billion model, it
still achieves 20.19
tokens per second.
I would say this is a critical
engineering sweet spot. By leveraging
Nvidia's NVFB4
4-bit floating-point quantization, we
were we are able to maintain a
sophisticated, high-intelligent model at
a throughput that is still faster than
the average human reading speed.
What becomes very clear here is how
aggressively throughput drops as model
is scaled
as models scale.
And how much quantization helps.
Have a look at the 14 billion base
model.
It drops to just 8.40 tokens per second.
This proves that on Blackwell archi- on
Blackwell hardware, the choice of
quantization format is just as important
as the hardware itself.
It's what allows the Jetson Spark to
bridge the gap between research work and
production prototyping engine.
The Spark allowed me to experiment with
different precision formats locally and
understand the trade-offs
in the real time.
While throughput is a measure of raw
power, time to first token is the metric
that defines user experience.
It determines whether an application
feels instant or broken.
In other words, it reflects how quickly
the model is starts responding.
As we see in the results, the Jetson
Spark delivers exceptional
responsiveness.
The increase for larger models is
expected. More parameters mean more
computation before the first token is
generated.
Out of all from this chart, the
interesting ones are the comparison
between the 14 billion parameter models,
the base model and the NVFB4
model.
As we can see,
as we can see, the 14 billion NVFB4
is 3.4 times faster to first token than
the unoptimized 14 base model.
A key takeaway here from this data is
that memory capacity is not the same as
the memory bandwidth.
While the Jetson Spark's 128 GB of
unified memory allows us to fit massive
models up to 200 billion
parameters,
our throughput is still governed by how
efficiently
we can move data.
This is why NVFB4
is the hero here.
It effectively increases our
intelligence per byte,
allowing a 14 billion model to feel as
responsive as much smaller one.
After running all of these, two things
stood out to me.
On when to use the Jetson Spark. For
steady-state workloads,
privacy-sensitive data,
and rapid prototyping.
It allows you to build and fine-tune
locally with the exact same software
stack used in Jetson cloud.
Visit build.nvidia.com/spark
to access the playbooks and software
stack I used for these benchmarks.
And in one
run locally, iterate quickly, and when
ready, scale to data center or cloud.
That's the workflow that J uses Spark
enables.