office-hours

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset (2025 March 06)

πŸ”— tinyurl.com/2wv52rc7

Event Details

Event sign up
πŸ—“οΈ : March 06, 2025 Thursday
⏰ : 9 am PST / 11 am CST / 12 pm EST / 5pm GMT
Duration: 1 hour

Agenda

Event recording will be available soon

Check resources - code, presentation slides ..etc

Q & A section


Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced β€œniceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

πŸ‘‰ > 2% avg improvement in benchmark performance over FineWeb
πŸ‘‰ Huggingface page
πŸ‘‰ Data prep kit detailed recipe
πŸ‘‰ Data prep kit bloom filter for quick reproduction
πŸ‘‰ Recipe models for reproduction
πŸ‘‰ announcement
πŸ‘‰ Paper

Session Type:

Presentation

Audience

LLM app developers, data scientists, data engineers

Technical Level

Beginner - Intermediate

Prerequisites

None

Resources

Presenation slides

Speaker: Shahrokh Daijavad

Research Scientist @ IBM Almaden Research Center

Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

Linkedin


Q & A

Please review the session recording