Posts
Asyncio By Example: Implementing the Producer-Consumer Pattern
python coroutine techThe Most Basic Case With corountines, we can define a produer and consumer without any need for threads. This simplifies our code and makes it more efficient. import asyncio async def producer(): for i in range(6): await asyncio.sleep(0.2) yield i async def consumer(): async for i in producer(): print(i) async def main(): await asyncio.gather(consumer()) asyncio.run(main()) 0 1 2 3 4 5 Work with Heavy IO When working with heavy IO operations, we need to be careful not to block the event loop.
Emacs Lisp Introduction for Python Programmers
emacs lisp tech pythonThis is a brief introduction to Emacs Lisp for Python programmers, (although I am not an Elisp expert, and actually I am more familiar with Python than Elisp). Both languages have quite different syntaxes, it is interesting to see how can implement Python code with lisp code. The content follows the strucutre from Learn X in Y Minutes Where X is Python, and we will touch all the topics. Primitive Datatypes and Operators Numbers Python # Integer 1 # Float 3.
Reduce kernel in CUDA
cuda basics techQuestion definition Given an array of \(n\) integers, the goal is to compute the sum of all elements within the array. Solutions The implementations for all kernel versions can be found at 2-reduce.cu on GitHub. Naive Version with atomicAdd The simplest approach involves utilizing each thread to perform an atomicAdd operation on the output variable. Here’s how the kernel is defined: __global__ void reduce_naive_atomic(int* g_idata, int* g_odata, unsigned int n) { unsigned int idx = blockIdx.
Count the parameters in LLaMA V1 model
LLM techLet’s load the model from transformers import LlamaModel, LlamaConfig model = LlamaModel.from_pretrained("llama-7b-hf-path") def count_params(model, is_human: bool = False): params: int = sum(p.numel() for p in model.parameters() if p.requires_grad) return f"{params / 1e6:.2f}M" if is_human else params print(model) print("Total # of params:", count_params(model, is_human=True)) Print out the layers: LlamaModel( (embed_tokens): Embedding(32000, 4096, padding_idx=0) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaSdpaAttention( (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (k_proj): Linear(in_features=4096, out_features=4096, bias=False) (v_proj): Linear(in_features=4096, out_features=4096, bias=False) (o_proj): Linear(in_features=4096, out_features=4096, bias=False) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) (up_proj): Linear(in_features=4096, out_features=11008, bias=False) (down_proj): Linear(in_features=11008, out_features=4096, bias=False) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm() (post_attention_layernorm): LlamaRMSNorm() ) ) (norm): LlamaRMSNorm() ) Total # of params: 6607.
Get GPU Properties
gpu basics techIn `cuda_runtime.h`, there are several APIs for retrieving properties for the installed GPU. cudaDeviceGetAttribute(int* value, cudaDeviceAttr attr, int device): a C api cudaGetDeviceProperties ( cudaDeviceProp* prop, int device ) : a C++ api Here is the code of the example. On a Nvidia GTX 3080 GPU, the properties are as below: Device 0 properties: Max block dimensions: 1024 x 1024 x 64 Max grid dimensions: 2147483647 x 65535 x 65535 Shared memory bank size: 4 bytes Max shared memory per block: 49152 bytes Max registers per block: 65536 Warp size: 32 Multiprocessor count: 68 Max resident threads per multiprocessor: 1536 = 48 warps L2 cache size: 5242880 bytes Global L1 cache supported: yes Total global memory: 9 GB Processor clock: 1 MHZ Memory clock: 9 MHZ
Notes on LLM technologies (keep updating)
LLM techBrief notes on LLM technologies. Models GPT2 Model structure The GPT model employs a repeated structure of Transformer Blocks, each containing two sub-layers: a Masked Multi-Head Attention (MMHA) layer and a Position-wise Feed-Forward Network. The MMHA is a central component of the model. It operates by splitting the input into multiple ‘heads’, each of which learns to attend to different positions within the input sequence, allowing the model to focus on different aspects of the input simultaneously.
Memory coalescing in CUDA (2) – Matrix Transpose
cuda basics techBackground In the VecAdd page, we’ve introduced the memory coalescing in global memory access. This post will follow the topic with another interesting application: Matrix transposing. The following content will briefly touch on the following topics: Tiles in matrix, this is the basis of optimization matrix computation A simple trick to avoid bank conflict in shared memory access Kernels The code for all the kernels locates in 1-matrix-transpose-coalesce.cu. Read coalesced template <typename T> __global__ void transpose_read_coalesce( const T* __restrict__ input, T* __restrict__ output, int n, int m) { int i = blockIdx.
Memory coalescing in CUDA (1) – VecAdd
cuda basics techBackground Memory coalescing is a crucial optimization technique in CUDA programming that allows optimal usage of the global memory bandwidth. When threads in the same warp running the same instruction access to consecutive locations in the global memory, the hardware can coalesce these accesses into a single transaction, significantly improving performance. Coalescing memory access is vital for achieving high performance. Besides PCIe memory traffic, accessing global memory tends to be the largest bottleneck in GPU’s memory hierarchy.
LLVM Utilities (keep updating)
llvm cpp techThere are many handy functions or data structures in LLVM project, which are widely used by other projects that rely on LLVM. In this page, I will introduce some common utilities that are worthy of using in your own project or frequently used in LLVM code that you should be familiar with. Basic data type llvm::StringRef It is a lightweight, non-owning reference to a sequence of characters. It is similar to std::string_view introduced in C++17.
Apple TV 折腾记
life由于房屋面积有限,我们的大厅一直没有安装电视。相反,我们使用了一个较大的LG 4K显示器作为娱乐中心,并搭配了两个巨大的Fyne落地音箱,效果不错。我们主要观看的节目是Netflix,晚上我们会和家人一起围在电脑旁边观看,感觉还不错。但是,将显示器用作电视存在一个问题,就是必须通过操作系统进行操作。由于我的职业是程序员,一旦接触电脑,就不由自主地敲击键盘,并且很容易进入工作模式,这对放松娱乐有些影响。
Best Practices for Python Programming (Continuously Updated)
python techWhen delving into the codebases of some successful large Python projects such as PyTorch, I am consistently impressed by their code – whether it’s clean yet precise, or leveraging lesser-known built-in or third-party packages to significantly enhance functionality.
High-quality code snippets, handy packages, and modules have greatly facilitated my work. In this blog, I’ll be sharing noteworthy findings and insights learned from the open-source codebase.
OpenAI/Triton MLIR 迁移工作简介
triton system tech经过几个月的不懈努力,OpenAI Triton已经成功完成了面向MLIR Infra的迁移/重构工作,并将其最新的基于MLIR的代码合并至主分支。这个工作是由OpenAI和NVIDIA相关团队近几个月来深入合作完成的,而我也有幸参与其中。在这篇文章中,我将分享一些技术总结,记录一些收获和思考。
尽管Triton目前的开源开发非常迅速,但本文将主要聚焦于基于MLIR Infra进行重构的第一个版本的代码(这应该也是两三个月前的)
Emacs Essentials
emacs techIt is a steep learning curve to master Emacs lisp, there are mainly two issues in it from my experience the lisp syntax and functional programming the fragmented methods and libraries For the 1st issue, it is easy to master the syntax after writing several programs and getting used to them, but for the 2nd one, one needs to take notes or remember something. In this blog, I focus on the 2nd point and keep updating the notes of some methods and libraries that I think are essential for writing Emacs lisp packages.
About me
about-meAbout me Thank you for your interest. I am a Deep Learning system architect at NVIDIA, and my current focus is high-performance AI Compiler(on GPU). Before this, I was a senior engineer in Baidu, working as an architect on PaddlePaddle (one of the most popular open-sourced deep learning frameworks in China market). I was the creator & primary author & tech lead of the following projects in PaddlePaddle ecosystem (before 2022-6)