Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Channel: aiDotEngineer

Published at: 2025-07-26

YouTube video id: Ahtaha9fEM0

Source: https://www.youtube.com/watch?v=Ahtaha9fEM0

[Music]
Hey everyone. Um, so we're going to
we're going to go ahead and get started
here. Um, we've got a nice close group
here today. Um, and that's I think to
everyone's benefit. Um, this workshop is
really for you. You know, I love the
sound of my own voice. I love talking.
That's why I'm a developer advocate. Um,
but the, you know, the the purpose of
this workshop is to help you get
comfortable with SG lang. So, if you
have questions, if you have ideas, if
you have bugs, uh, askang um or or me.
Uh, and we're we're definitely going to
be able to tailor this workshop to you
and your interests and what you're
working on. Um, so the title of this
workshop is an introduction to LLM
serving with SG Lang. Um we're going to
be uh you know talking about SG Lang and
little quick introduction. Um so my
co-speaker Hu Yang um is a core
maintainer of SG lang. Um has been
involved with LMS or for quite a while
now. Um h is the sort of like influence
lead on the project. Um previously
worked at um BYU and some other places
and also is uh you're an author of a few
papers um including flash and fur. Um,
and I'm Philip and I got a B+ in linear
algebra. So, um, whether you know,
whether you're coming in here and you're
super cracked or you're brand new at SG
Lang, we're going to have something for
you. Whatever your skill level, uh, this
is this is the place to be. Um, so what
are we going to do today? We're going
to, you know, introduce SG Lang, get set
up a little bit. Um, we're going to talk
about the history of SG Lang. um talk
about deploying your first model, bunch
of things you can do to optimize
performance after that. And then we're
also going to talk a little bit about
the SGLAN community and how you can get
involved and even do a little bit of a
tour of the codebase in case you want to
start making open source contributions.
Um so by way of introduction uh let's
see what is SG Lang. So SG lang is an
open-source fasterving framework for
large language models and large vision
models. Generally you use SG lang in a
sentence along with either VLM or tensor
RTLM. Um it's one of the multiple
options for serving models in
production. So the question is um why
SGLANG like why should we uh you know
invest in in learning and and building
this library? Um and you know first off
it's it's very performant. SGLANG offers
excellent performance um on a wide
variety of GPUs. It's production ready
out of the box. Um, it's got day zero
support for new model releases from uh
labs like Quen and DeepSeek and it's got
a great community, strong open source
ethos. Um, which means that if something
is broken in SGLang, if you don't like
something, you can fix it. Uh, which is
which is pretty huge advantage.
Um, so who uses SG lang? Well, uh, we do
at base 10. Um we use it as part of our
inference stack for a variety of
different models that we run. Um we also
um see SGLang being used very heavily by
XAI for their Glock models as well as a
wide variety of inference providers and
cloud providers and research labs,
universities and even product companies
like Koser.
So quick history of SG Lang. Um, it's
honestly really impressive to me how
quickly this project has come up and
gotten big. Um, if you look at, you
know, the archive paper was released in
December 2023. That's 18 months ago. So,
in just 18 months, this project has gone
from a paper to 15,000 GitHub stars
almost. You should all go star it so
that we can get a little closer. Um, and
it's uh, you know, supporting all of
those logos, all those companies we saw
on the last slide. Um, it's got a
growing and vibrant community. Um, it's
got international adoption. So, yeah,
incredibly impressive what the team has
done in that time. Um, and I'm going to
turn over to Yianang now to talk a
little bit more about that history and
also like how you got involved in the
project. Okay. Hello, I'm Ena. I'm the
co-developer of the Estelon project and
I'm also the software engineer at Bon.
And uh before I joined bon I work as I
worked at at mron and at that time I
worked for the internal uh clickth
through rate ranking model optimization
and inference optimization and at that
time the creator of named lei just reach
out and then we we have a yeah Google
meet. So at that time I left mine I
joined project. So I worked closely with
Le and Ying on Estelon. Also you know
Estelon use flash infer heavily because
we use flash infer as the attention ko
library and the sampling kernel library.
So I also worked with Zuha on the flash
infer project and yeah currently I'm the
co-maintainer of the project and I'm
also the team member at LMC's or and
that's the uh little point of trivia
that's the same LMIS or that just got
hund00 million to build chatbot Arena uh
from A16Z. Um I learned that while I was
putting together the slides for this
talk. So, um, if you were here early,
you were able to, um, scan this QR code
and get everything set up for the
workshop. Um, if not, uh, definitely
grab that right now. Um, you've got the
QR code, you've got the URL that takes
you to the same place. Does anyone still
need the QR code? Um, okay, I've got a
couple people still.
All right.
[Music]
Anyone
still need the QR code?
Going once. Going twice. Yep.
To uh folks watching at home, you've got
this great button on YouTube. Uh it's
called the fast forward button. So you
can just skip this part.
All right, we uh looking good. Um if you
uh if you need this again, uh just just
let me know. I'll uh throw it back up
there. So, we're going to talk about um
how to deploy your first model um on
SGLANG. Um so if you um go over to the
GitHub.
Yes.
So um in this step we're just going to
get familiar with the uh basic mechanics
of SG lang. Um, sglang is basically just
like a server command um that you're
going to run in your Docker container.
There's a little bit of sort of
difference uh with using it the way
we're going to use it in the workshop
right now versus how you might use it if
you're working directly on a GPU. The
difference is you're using something
called truss to package it. Basically,
you're putting in your SGLang uh
dependencies and your command into this
YAML file. you're bundling it and you're
shipping it up to a GPU. Uh the reason
we are using trust is because that is
the way that you can get on base 10. And
the reason you are using ben is because
that is the only company on earth that
will give me free GPUs because I work
there. Um so we're uh we're going to be
working um on all these examples on L4
GPUs uh because they are cheap and
abundant um and they also support FP8.
Uh but this same uh the same product
works on H100 um H200 and Blackwell's
coming soon. Yeah. Yeah. Yeah. Coming
soon. Um so yeah, it's going to be
basically like the same principles. Um
if you go through here um the uh the
configuration um you can actually in
your trust config you can change the um
hardware type to H100 if you want. Um
and
uh in the yeah in the uh accelerator
line right there. Um but yeah, so what
is like the actual SGLang launch server
command um that we're that we're running
here. So it's basically just like a
bunch of flags. That's the thing to
understand about using SGLANG. It's all
about knowing what flags are available,
knowing what configuration options are
available, knowing the support matrix
that exists for them, and knowing how
they interact with each other. Um, if
you, you know, turn on a major
speculation algorithm, and then you also
jack your batch size way up, well,
that's probably not going to go so well
for you. Um, but if you want to do say
like your, you know, quantization along
with some of these other optimizations,
those play nice. Um, so yeah. Um, what
we're going to do, um, this is the fun
part of leading a workshop, um, is the
part where we just like stand around
watching you type. Um, what we're going
to do is give everyone about 5 minutes
to work through this first example. Um,
we're going to circulate the room if you
have any questions. Um, and then we're
going to come back together after uh,
running the first example. Sound good?
All right, let's do it. Uh, can you cut
the mics for five minutes?
Pause. Skip. It's It's great. They these
buttons, they're magical.
Has issues. Is anyone having issues
where you're like
stuck trying to get into base 10? Um,
you're in like a waiting room and it
won't let you out. Um, if you are de if
you are, uh, flag me. Um, if anyone is
having issues where you're getting like
an error in your code, please don't show
me, show him.
And a check on progress. Has anyone
managed to get the first model deployed
and running?
It's deploying. Awesome. Let's hope it's
deploying really fast.
Let me let me take a look here. All
right, sounds good. Can you uh take a
look at the logs for me real quick?
Wow, our Wi-Fi is just amazing here. I
promise base 10 is usually faster than
this.
Oh, okay. Well, it looks like it it came
up. Um so you can um you can use the um
sample code
um in call.py or call.ipy nb um or like
you can just use an ordinary openi
client um
what you need to call it if you go back
to your base 10 workspace with the model
um what you need is uh scroll back up a
little bit for me. You need that model
ID. That's what's going to um unlock
your calling code. Love it. Um that
Yeah. Paste it in right there. Um you
you'll need to run an act run an actual
Jupyter notebook to to run that.
All right, we've had our first
successful deploy.
If you want to call it using the Open
SDK, using the call.nb PYNB uh notebook.
Um this thing up here, it's going to be
different for everyone. Um this within
the UI is your model ID that you use to
put uh set up the URL.
Um hey everyone, we're going to come
back together here. Um it's about 9:45.
Um so we're going to move on to the next
stage of the workshop where Yian is
going to do some really awesome demos.
Um if you are still getting everything
set up, uh no worries. All this stuff is
going to be live on um GitHub. You've
Oh, sorry. Yeah, on GitHub. Um the the
repository with the workshop information
is going to stay up um so you can keep
following along. Um this is also all
going to be published. Um so it's going
to be easy to go back if you have any
issues. Um anyway, so the next thing
that we're going to look at um now that
we have a sort of basic idea of okay, SG
lang is just like running a model
server. um how are we going to actually
make it fast? Um so Yang's going to show
um one demo which is the um CUDA um what
is it? What's Yes, CUDA graph match max
BS flag um and how to set that to
improve performance. Um and then we're
also going to take a look at Eagle 3
which is a new speculative decoding
algorithm which uh also can improve
performance. So take it away Yian. Yeah.
Uh can
can you see my screen? Maybe I can. Yes.
Yeah. Good. Good call. Zoom it in a
little bit.
Um we're using one pod because uh on
base 10 you don't get SSH access into
your GPUs because uh security or
something I guess. I don't know. Okay.
So here I will use the L4 GPU. Yeah,
this is the L4 GPU. And I have already
installed the sjon. Yeah, we can just
use the pip install or install from
source. And uh here is the
this command line sorry we launch the
server and we use the llama 38b
instruct model and the attention back
end use fa3. This is the default. And
when we
Okay, it it started to loading the
weights.
So, uh, just to just to give everyone a
little bit of context, um, the top
window you're seeing here is the, um, L4
that's actually running the SGLAN
server. The bottom window here, um, LMEL
is a sort of industry standard
benchmarking tool, um, that we're just
going to use to throw a bunch of traffic
at the running server. Yeah. Yeah, for
sure. And, uh, yeah, we can see the the
log from from the server. It shows that
we capture CUDA graph batch size. I
think CUDA graph is turn on by default
but the CUDA graph max batch size for L4
for this model is eight. So it only
capture one 2 48 and okay the surfer is
ready to roll and we can use the LM4 to
send a request.
Yeah, we can see that from from the log.
Here is the prefill batch and here is
the decode batch. And we can see uh
at the decode batch when the running
request is 10, it means that there are
10 running request and the CUDA graph is
false because the running request 10 is
larger than the max CUDA graph size
eight. That's why this one this flag is
false. And when this is false, we get uh
155
generation token per second. And we can
use this one divide divide 10. So I
think per user nearly 15 uh token per
second.
Okay, we can kill the client and we can
also kill the server.
So yeah, we we can use this command as a
base and uh set the CUDA graph max. Yes,
CUDA gra
for example, we can just set 32.
You you've got a you've got a typo in um
Oh, sorry.
The network is not good. Here everyone
is learning a very important lesson in
the value of latency.
Okay. Yeah.
Yeah, it's loading.
Waits.
Yeah. And we can see that uh after we
set the max cuda graph besides the
capture kuda graphs uh I think that the
max is uh 32. It's larger than the eight
and the server is ready to roll. We also
used to send a request.
Okay. So, first is the prefill batch and
then we can here is the decode batch.
Okay. And uh yeah, here is the decode
batch. We can
wait for a moment.
Decode
Yeah, for example, here the the decode
batch and there are 13 running request
and the CUDA graph is true and here is
the generation S putut
and I think per user should be
12 and we and compare with before.
It's not easy to compare. Uh yeah. Yeah.
I I I think uh
we have recording this video and we can
also see here cuda graph and we upload
this one kudraph max specialize demo.
We we want to codraph to be true during
decode because I think this is very
important for the decoding performance.
Uh but the default max size is eight on
L4. And when we used LM4 to send a
request, we find that oh the max size is
larger than eight. That's why we want to
set or adjust the parameter. Here when
we set it to 32 uh we can handle the
realistic batch during benchmark.
Do you have any questions?
What are the commands to
Oh, okay. The LME4. Mhm. Yeah. Yeah.
Yeah. I think LME4 is the evaluate
evaluation tool and we need to specify
the model
and uh here is the model name. Here is
the URL because I just use run port to
run this and it used the same node. So
that's why the URL is the local host and
we we specify the port this one 8,000
that's why we use 8,000 and we use the
openi compatible server and here the
number concurrent or the the batch size
is 128. We set the max generation tokens
we just use GSMK. I think it's a
classical evaluation data set and uh
because we use the chat completion API
interface that's why we need to apply
try to complete and I just use f short 8
the limit means that because you know uh
the GSMK it has uh 1,
39
promotes and when we use the limit
0
15 I think it's nearly uh 200 promotes.
I can share I can also share this this
command line in the in the ripple. Yeah.
Yeah. Maybe I can add it.
Oh, sorry.
Yeah. So, just to just to be clear, um
this command is running on the actual
GPU itself. Um so, this is for when you
have SSH access into the GPU. running um
on a on the service we're all using on
the the base 10 GPUs you can't SSH in um
but if you do have the access to a GPU
where you can get SSH access then you
would use this um LM uh eval tool um in
order to simulate that traffic um if
you're using a more like standard HTTP
connection um to a you know remote GPU
then you would use a a different
benchmarking tool um that's you know
request based.
Yeah.
Okay. And uh do you have any other
questions for CUDA graph?
Why is default?
Yeah. Yeah. I I think the default eight
is because the A4 GP memory and we we
have uh some default configuration. uh
we will yeah set the
when when you didn't set the kuda graph
max patch size the default value is none
and when the default value is none we
will set internally for for specific
hardware for specific model yeah for for
example it's TP1 and it's on L4 so the
default is just eight yeah so what if
someone by mistake like he adds a higher
one for
was on. Yeah, you can just try that
because when you launch the server, you
can see the the uh startup parameters
and then well you you have a workload,
right? and you use the LME42 benchmark
for example and you can analyze the
server log and you find that oh during
the decoding the CUDA graph is disabled
and we actually we want to enable CUDA
graph that's that's why we increase the
max CUDA graph batch size
yeah
okay awesome um so let's um let's see do
you want to show the eagle stuff or do
you want to show the codebase stuff yeah
yeah yeah
okay Uh I think the the next very
important is about the the Eagle stuff.
Yeah. So Eagle 3 is a speculative
decoding framework. It came out very
recently, right? The paper was released
a few months ago. Um and so SGLang
supports Eagle 3. Um and uh with it you
can configure a wide different a wide
variety of um different parameters
around how many tokens you're
speculating, how deep you're
speculating, that kind of stuff. Um and
Eagle 3 um can have much higher
acceptance token acceptance rate. Um so
obviously when you're speculating, the
higher your token acceptance rate, the
better performance you're going to get.
So we can take a quick look at some of
those parameters that you showed. Um and
then maybe uh the benchmark script you
were showing me the other day. Yeah.
Yeah. I I think for the ego 3 you Yeah.
We we also provide the the example we
can just yeah change directory to to
this directory and then use trans push.
It's very easy. I just want to explain
uh some details. For example, we need to
specify the speculative decoding
algorithm. Here is the eagle like this
one.
Yeah, we need to specify speculative
decoding algorithm eagle and we also
need to specify the draft model path
because uh
this one the model path. This is the
target model and here is the draft
model.
Sorry. Here is the draft model for the
ego 3.
Yeah, llama llama 38B. So, one thing
that's different about Eagle um all the
different Eagle algorithms is instead of
like a standard draft target where
you're say maybe using llama 1B and
llama 8B together, um Eagle works by
pulling in multiple layers um of the
target model, using that to build a
draft model. Um so the draft model is
kind of derived directly from the target
model versus being just a smaller model
that you're also running. Yeah. Yeah.
And you also need to specify uh this
parameter the numbers depths the eagle
top K and draft verify tokens for
example the depth of the drafting if
it's three and the top K is one. I think
the the most number of Java tokens
should not more than four. That's why we
said four here. And yeah, you can see
more details about this configuration at
the Echelon official documentation. And
I also will show show something about
how to turn in these parameters. You
know, we have these parameters. I think
the model path and is fixed. And the how
about this one? The number steps. Okay.
and the number of Java tokens we can
turn in these parameters and I will show
you how to turn in that. So in the SLAN
mano we have a script and uh
we have a playground
yeah we have a bunch speculative
decoding.
Okay. So we can just use uh this script
to turn in these three parameters. For
example, on a single GPU when we want to
you this is the target model llama 27B
and this is the uh draft model. Here is
some default parameters.
The batch size is from 1 2 4 8 16. And
the steps is here 0 one three seven five
seven and the top K is here. This is the
number of the tokens. What does that
mean? I think it's it's very easy to
understand. For example, we have
different combinations of these
different parameters and this script
will run all of these combinations and
you will get a result and from the
result you will get a you will get to
know that oh this for example this
combination is best. For example the at
the b size eight uh the three steps
maybe and the top k is one and the
number of tokens is four. you will get
some result about the the the speed and
about the accept rate then you can use
this parameter for your online servering
for your production servering. Yeah.
Yeah. And when you're running this
benchmark uh do be sure to set the
prompts to things that are
representative of your actual workload.
Yeah. Because speculation um in any
format including Eagle is all about
guessing future tokens. uh if you are
benchmarking on data that is not
representative of your actual inputs and
outputs um that you're seeing live in
production then you're probably going to
end up with the wrong parameters. Um
speculation is a very topic and content
dependent uh optimization. Yeah. Yeah. I
think so. So you you can also update
these promotes here in this bench bench
spec decoding uh uh pi python script we
have some promotes and I think you can
update this. Yeah just according your
needs. Yep. Okay. So let's uh let's take
a look at some of the um stuff around
you know the community and getting
involved. Yeah. Yeah. Yeah. Also I I
think yeah SGAN currently it's become
very popular and if you want to
participate in this community and
contribute some code I think uh yeah
I'll I'll show the the slides real
quick. Okay.
Yeah. So um you know SG Lang does have a
really great community. Um, and uh, you
know, some some quick ways to get
involved. Um, you can start it on
GitHub, file issues and bug reports as
you build. Um, they have a great tagging
system of post issues to get involved
with which Giann's is going to show in a
second. Um, but the number one thing you
can do is follow SG Langis.org on
Twitter. Um, and then join the Slack um,
to keep an eye out for online and
inerson meetups. Um, so this is a link
to the community Slack. Um you can uh
scan that real quick if you uh if you
want to get involved with SG Lang. Um
these slides are also all in the um
these slides are all in the repo um that
you got from the workshop. So you can
access this uh this link and stuff
later. It's also just slack.sglang.ai.
Pretty simple link. Um so if you are
going to get involved and you do want to
um you know start contributing to the
codebase um we can kind of show you um
some of the stuff. So at a high level um
the codebase has the SGLang runtime um
it's got a domain specific front-end
language and it has a set of optimized
konels. Um you can go actually on this
deep wiki page um and get a really good
co tour of the codebase um as well as a
tour from um this uh other repository
that we have linked um which is also by
one of the SG lang people um with some
some diagrams about like exactly how
this stuff works. Um and then yeah,
Yinang's just going to show a quick
overview of the codebase on GitHub. Um
in case you're interested in getting
involved and contributing. Yeah, I I
think that the best way to get involved
in this project first we need to use
that and then you will find some issue
or of you will find some feature missing
in this ripple and then the first thing
that is you can raise a new issue here.
It's loading. Yeah, you can just create
a new issue feature request something
like this. And also I I think yeah we
have labeled something like
good first issue or help wanted.
Yeah, you can see that there are nearly
uh
26. So I think yeah if you are
interested in in this issue for example
if you are interested in support or
suffering VM va model or you you can
just start with this I think uh good
first issue and here wanted issue. Yeah
we are welcome the contributions and
here is the development road map.
So yeah if some feature is missing or if
some feature you care about you you you
can find it in the road map I think you
can uh join us for for this feature
development or you can also yeah raise
new issue about this and the last one is
about the
overall work through okay so Yeah,
in the estron repo uh we we have some
component. This one is the SJ kernel.
It's a Echelon kernel library. We
implement attention normalization
activation gym all of them in this
kernel library. And if you are familiar
with CUDA kernels and if you're
interested yeah with kernel programming
you can just contribute this part. And
here is the SGL rooter. Last year we
published Slon
the S version and we supported the
cashware rooting. If you yeah you are
interested in this part you can work on
the SG routter.
Currently we we use eston as a LM
inference runtime. So I think the Python
part the SRT is the core part. We
support disagre PD disagregation. We
support a constraint coding. We support
function calling. Yeah, we support open
eye compatible server and we also
support a lot of models. If yeah I think
if you want to support the custom model
you can just yeah take this as a
reference. For example you can take
llama as a reference. I I think uh the
popular open source model the
architecture is very very similar. So if
if the model you are interested has not
been implemented in the eston you can
just check this reference and do some
modification and then we welcome
contributions. Yeah, that's all.
Awesome. So, um if we get the slides
back up here. Um
yeah, so to uh you know, wrap it up. Um
first off, thank you so much for coming
out. Thank you for bearing with us.
Thank you for waiting for web pages to
load on this uh wonderful uh internet
connection that we all have. Um to kind
of wrap things up, um I do want to issue
a couple invitations to everyone in this
room today. Uh, number one, we're having
a really cool uh, happy hour with the
folks from Oxen AI. Um, OxenAI is a
fine-tuning company. Um, their CEO just
had a really cool demo that he published
a couple weeks ago where he took GPT 4.1
um, and made it, you know, do a SQL
generation benchmark, took the score,
said, "Okay, I think I can do better
than this." Took Quen 0.6b, 6B. Yes, you
heard me right. Less than a billion
parameters. Fine-tuned it on some SQL
generation data and actually beat GPT
4.1 with a model that you can run on
like 3 years ago iPhone. Um, so yeah,
we're going to be uh, you know, at this
happy hour. We're going to be talking
about fine-tuning and stuff. It's going
to be a great time. Um, second
invitation I want to extend to everyone
in this room is if you think this stuff
is cool, if you were, you know, seeing
all the stuff that Yang was talking
about around contributing to the
codebase and you're like, "Yeah, I love
CUDA programming, um, just come work at
base 10. Uh, if you're bored in your
job, you won't be bored here." Uh, we've
got a lot of open roles for both
infrastructure and for model
performance. Uh, if you're at all
interested, just come talk to me. I'm
going to be here all uh, all three days.
Um so yeah that's pretty much our um
workshop today. Thank you so much for
coming through um and happy to take any
questions in the uh remaining time we
have. Yes. What are the main reasons why
you use SG? Yeah that's a great
question. Um you know I think that uh
what we at bas like we use all sorts of
different runtimes u model to model. Um
sometimes you just want to use whatever
one is best for your use case. Um, but
in general, uh, I think that the reason
that we've been really attracted to it
is because of how configurable and
extensible it is. Out of the box with
basic parameters, you're going to get
more or less the same performance from
anyone. Um, but if you're able to number
one have like a really deeply and well
doumented codebase like SGLang where
you're able to really deeply understand
all the different options that you have.
Um, that can get you a long way. And
then as we were just talking about, it's
super easy to contribute. Um, so we're
constantly like making fixes and and
contributing them back. Um, and that
means that you know if you're using a a
different library, you might be blocked
waiting for the core developers to
implement support for a model or
something SG lang you can unblock
yourself.
Yes.
When there are multiple vendors and
different kind of applications
around the end point or within the
subnet you are defining how would you
define your um cyber security or
security protocols? How would you
enhance your protocols? Yeah, I mean
that's a great question. I don't really
think that your your choice of runtime
engine like affects that too much. um
because you you're just packaging it up
in a in a container. Um you know within
within B 10 we've thought a lot about
this in a sort of runtime agnostic way.
Um where we're thinking about of course
like lease privilege. Um we're thinking
about you know making sure that there's
a good deal of isolation built into the
system. Um but from a from a runtime
perspective I don't think there's
anything special we have to do for
security with SGLang right compared to
like VLM or anything else.
Thank you. So um I I am from a
department of defense and so awesome
extensive experience in financial
applications. So
to uh do some uh
product developments in house
uh do you think I have I can do the
entire product development inhouse
within a submit don't have to go back
and forth open right now for example
just throwing an example
one of those uh uh CMMC
cyber security
certifications
I have to go through the endpoint
controls
and define the endpoint control and then
go connect to the chat open GP. Gotcha.
Yeah. Yeah. So in that case, this would
actually help you out a lot. Um instead
of relying on that remote server, um you
can just spin up a cluster like within
the same uh VPC or like within the same
physical data center um as the workload
that's relying on the AI model. um you
can clone SG lang you can cut you know
take a release um and fully inspect the
code because it's open source and then
fix um on that release so that there's
nothing changing under the hood um and
then yeah with that you'd be able to you
know run the models just directly on the
GPU as you saw in Yinang's demo when he
was doing the um CUDA graph stuff um
you're able to you know call it on even
a local host basis and run info Um, so
yeah, it gives you all the tools you
need if you're trying to build even like
a sort of a gapped type of system um
with, you know, all of these open source
runtimes. You can pull that code in,
inspect it, lock it, um, and then, uh,
build off of it. Very impressive. And
also, um, currently I'm working on, uh,
I'm also a PhD student. Yeah. So, I'm
working on blockchain based quantum
computing and some kind of AI
deliverables.
So how do you
circumvent within your product? Can you
so blockchain is completely another
community based code development. So how
do you can we integrate different
community based or a combination of both
hybrid community based protocol or so
what is because blockchain is kind of
decentralized network whereas this one
is kind of
yeah um to be perfectly honest like I
haven't really experienced anything with
that um pretty much all of uh the use
cases that I've run with SG lang or just
traditional client server applications.
Any other questions?
Yeah.
you shared something.
Yeah, great. Um, so yeah, so in B 10
like what we do is we we call it like
the base 10 inference stack where we're
taking all of these different um all of
these different providers, the the VLM,
the SGLANG, and the Tensor RT LLM, which
we actually probably use the most
heavily of the three. Um, and taking
them in, customizing them, doing all
that stuff. I'm supposed to say for
marketing purposes. Um, but we are
customizing it quite a bit. Um, anyway,
where we generally pick VLM, um, I'm
sorry I'm talking about uh I'm talking
about them during your SG lang talk. Um,
but where we use VLM is oftentimes for
compatibility. Um, for example, like I
know our Gemma models that we have up in
the library are using VLM um because
like it's what was supported uh when
when it dropped. Um, so yeah, that's
that's in my mind like the best use case
for VLM is like super broad
compatibility.
Any other questions?
Awesome. Well, uh, like I said, we're
going to be around all day. Um, and, um,
I'm going to be, uh, at the base 10
booth, uh, for the next 3 days. So if
you have any questions about SGA, model
serving, model inference in general, um
or if you want one of them jobs I was
talking about, we are hiring very
aggressively. Uh so definitely stop by
the booth, hang out, uh grab one of
these shirts. Um and yeah, thank you so
much for coming.
[Music]