Z.ai GLM 4.6: What We Learned From 100 Million Open Source Downloads — Yuxuan Zhang, Z.ai
Channel: aiDotEngineer
Published at: 2025-11-22
YouTube video id: m6MF1OR_9kM
Source: https://www.youtube.com/watch?v=m6MF1OR_9kM
Hello everyone. I'm John from VAI and I'm very happy to here to talk about our latest model series uh J 4.6 series. And let's jump right in. First uh I will introduce the GM series model. Gen 4.6 Six is not our first open source model since 2022 starting from the very first G30B. We have been quite serious about open source our work. Over the years, we have released a whole family of models such as chat gem 6p for language model and the code for vision understanding code view for image generation and the video for video generation and the very uh many more across the different domains and on this side you can see a map of our open source model so far and include the different color such as the white is for the language model the z GM series and the pink for the multi mode understanding such as the code VM and now it's called GMV and the green one is for image generation and the yellow is for video generation. Uh 2025 is our open source year of and in this year we added even more models including the GM4 0414 dense model including like 9B and 32B and the GM4.5 G4.6 M O series model which is actually our first MO models family. So up to now we have released over 65 module in total and the closed platform like hing face monoscope and others we have already passed 100 million downloads. If you search for the GM or video on GitHub you will find 105 1,500 community project really top of them and is much a communitydriven ecosystem. Now let's move to uh GE 4.6. Uh I will introduce it now. So G 4.6 is our latest flagship model or many public benchmark especially in math and coding. G 4.6 shows a clear game over GM 4.5. It also output opensource model release in the same period like dipstick version 3.2. and even beat this commercial model such as the cloud SOS 4 on several benchmarks. Of course, if we compare to the clock 4.5, there still be a noticeable uh noticeable gap. So, we are not coming with everything, but we're getting close and close. uh but what makes us especially happy is here is um arena uh this benchmark uh this is uh which is closer to real user preference and on element arena gem 4.6 say it's time for number one together with GPD5 and the cost cross storage 4.5 and it's the only open source model here and so I'm very happy appreciate and I want to thank our developer who try to who try our model and votes for it so let's move to the CC bench so beside the user benchmark we also build our own data set called the CC bench Here we want to text agent style coding in real world not just iso lore problem. So we built a agent coding text platform based on the cloud code and on top of that we create CCB version 1.1. So compared with version one version one uh the new version added 22 hard coding task and we statistically evaluate coco sonets a consonate 4 and g 4.5 the kim K2 and the versions 3.1 terminus in total citybench have 74 tasks is covering the front end development and internal tool development in the data analyze and also So algorithm implementation. So for every model we record the full agent trader query the planning st the call and code adds and execution the fully open source this benchmark. So you can check all the link later uh below in the hiring phase and JM 4.6 made a clear jump over June 4.5 and over uh over performance is called to call 4 with about 68.6% 6% win rate while being significant better than under open source baseline. So uh where does the performance come from? A lot of uh let's talk about GM 4.6 training and in this uh we will start from the data v training design. First part is the general pre-training. So we start with about 15 billion uh 15 trillion tokens of the generate proposal data includes web page books uh acupedia and multilang multi uh multilingual content and so on. So this stage is about building a strong allrounder best model. The contest then here is 4,000 tokens and the next step is called the reasoning continue training. So on top of that base with about 7 trillion token of extra code and reasoning data. So it's part of this part of this counts for a high quality open source reports and another part is math science and context program with four stepbystep reasoning. Then we come to the mid training. So we move to ripple label codes uh include that multiple files issues and pull request and the difference from the same project and all these packed into one long contains and the goal is to teach the model to following the close file and understand the chains and to also understand the pro square chains and read the real project structure end to end. So at this stage we stand the content to 32,000 and the model can basically see the key file of a medium size ripple on one shot. Then is a synetic reasoning data. We added about 500 billion token of synetic reasoning data. So it cover map science and algorithm with experience thinking trace. So it's mean the lay of the groundwork of future agent behavior like breaking down the task refle uh reflecting on the mistake and doing longchain reasoning. Uh the next step is the long content and agent data. Uh finally we use about 100 billion token of a long content and agent data. Here the secret then is now pushed further to 180 uh 20 uh 128,000 for GM 4.6 is 200,000. So the model can handle four documents and the whole data uh code base and very long chart at the same time we feed lots of agent chery. So include that multi-step two calls the search and the code execution extra. So uh in this space improve the model long content capability and the aging capability. Also in this slide we introduce slide and it's our reinforcement learning framework and based on eston inference stack uh in practice uh we design in house training framework here and we also open source it uh we found that the different task need very different system design for short reasoning task like the math or the code completion. So the best setup is clocked uh with the average agriculture. So we train inference in the same GPU. So after one batch update wave. So the next batch immediately sample for the latest policy. the screens the most of GPU memory and compute and but for agent task and for example the real software engineering uh usually have many steps like for example open the browser and hit uh backend API and for the external response um extra so if we force every worker to stay in the sameness the forces the get dragged down by the service field and GPU So in slide we decide he agriculture to support both uh and secret model. If you look at the diagramraph the blue part is meatron batch training engine w for data buffer and opposite ways and the green pause is high throughout intervention coaster. So with a routine of dispatch request and in the middle the data buffer act like the share nist systems. So one side connect training and other side differential agent environments for regular reinforcement learning cost. We keep training and in uh inference on the same GPU pool using with a sim mode and dynamic sampling instant update and the maximum throughputs. Once we switch reach to complete agent task we move to a decouple and synchroniz mode. So the row outside port talks directly to real environments and just regenerate tractory and write them into the buffer and then the training side consume the data in own space up the model and uh periodically push new way. So the nice thing is even if some tasks super slow they don't block the whole training pipeline. So on top of that we have done a branch of efficiency optimization [Music] like the main chain still run flow 16 uh stability but after each policy update we do blockwise FDA cronization on the latest ways and send FPA version through our work. So the most expensive part the data generation and running FDA with much higher output while training this keep BF BF16 precision. So in practice we will get the benefit both accuracy and speed in this framework. Now let's zoom in reasonal ISO and this slide with some plots. So the first one is about the two-stage curriculum we use. We don't change all the fixed data set from start to finish. Instead, we use a two-stage difficulty curriculum. In stage one, we use medium difficulty problem. In each batch, some arrive in some room. So, the rewards have various in the grinding are meaningful. All the model get stronger with which you extremely hard problem in stage two. But with 512 samples, you can still occasionally get a correct solution. So you can see on the pause the blue curve is our method after switching to hard problem the curve is keep going up. However use the uh median difficulty the way uh is not on the red curve. The next picture is about a single stage reinforcement learning as 64,000 tokens. Some previous works such as multi-stage reason uh reinforcement. for is that uh for example is uh 16 then 32 then 48 and finally 6 64 but we found that for model that is already been trained with 64,000 token uh SFT those shorter IO strange actually make you forget it long content ability so average upper listings and the finals 64k token stage can't fully recover the loss So the red curve here is our approach. We start directly with 40 uh 64,000 uh token and train in one single stage re is clearly outperform than the blue middle multi-stage curve. Uh the the picture below is about the code. So on the left bottom we complain two ways of committing the laws for code. So the blue one is classic sequence means loss is sequence has one loss value and the red one is our token w means loss which average over token instead of sequence. The token w version converts faster and most steadily and it reduce the chances to generate very short template answer just to get the rewards. And the right you can see the data. uh we do get a science reinforcement learning on GPQA demons and the messaging almost opposite of more data is better. The red curve red curve is trained only the small set of expert verify but high quality multiple choice question and the blue curve use mixed quality data. So this result that a small blood clean data set gives much better performance. So for scientific reasoning data quality really matters more than raw size. After talking about J 4.6 language model we move to the multimodel. J 4.5 supports the both image and video understanding. It is our latest visual understanding model and go and on grounding and the image understanding benchmark it shows strong performance and the clear advantages over other open source model release around the same time. So agriculturally you have the three main PS here the one is a version transforming coord and then it's like with MLP projector and the finally is 4.5 base model at the coordinates. So we're trying hard to keep the virtual input as original as possible. So the model can see the image negative resolution and accept ratio instead of focusing everything into a fixed square. So this matter a lot of us screenshot and also long vertical image and the pawn point slides. So looking for the video we also insert a time index token after each time basically telling model this is the friend C and this is the second T and they help it understand the temporal order and reience which is crucial for action understanding and step by step uh producer. Uh we also use a method uh as we researched before co uh in coke agent. Now the GUI agent capability is also support on GM 4.5 B. So you can like uh it can also help you to control the computer and also like the website to control uh you can use the mouse or the keyboard touching and to communicate with your uh browser also computer or mobile environments. So how to use G 4.6 or G 4.5 V model. The first one is using a open source weight. Uh as we know this both these tool is open source. So you can use the echelon or v or other framework to inference it. Uh along with the weights on the release day we already had achelon and the vlm integrated ready and we also work with many third party open source frameworks like the llama factory or ms swift. So thank you to this community there you have you can choose uh any that framework you want and to try our model but the GM uh GM 4.6 model is a large model with like more than 35 uh 355 billion parameter. So if you don't have that 100 or like and other uh GPUs there's an easier way to uh use our model. So in this slide we show the deploy uh command of using Helon or the VLM. Here the next slide uh we can use the GM on the Z. AI uh Z.AI AI. This is a website and you can try your uh directly and you can use the writing code you can using to generate powerpoints and so on. And in this uh demo is uh using one command to write the Google searching in our dat uh demo. So you can just uh communicate with it and also GM is famous in coding capability. So we also provides the GM coding plan which connect GM with tools and other plugins that cloud code or other coding developer uh develop tools and to provide a very strong coding assistant experience. We also have a short demo video that uh show how to replace Yodi model in a cocoa livea with gem 4.6 here and you can uh see the you can watch this on YouTube then is the uh our community activity beyond today talk where regularly host events both online and offline. So whenever we release a new model we usually run a several community session afterwards as the first one there is AMA in the Reddit and we also have some uh we also have some offline and onsite uh techn uh tech technology sharing so you can join us the final uh slide is some important link you may to know uh is about website as I as I mentioned mentioned before to try GM model as ZAI and also our API on here then we also provide GM 4.6 six technical uh technical board and J4.5 tech reports. Uh you can check it and you want to join our community here is the discord link and also the GitHub link is below with the open source model including the readme to how to deploy on the open source method. That's all of today. Thank you very much.