Posts

flash-attention Usage: a Worknote for LLM inference

llm tech

Background

The flash-attention project provides flash_attn package in Python, and it provides multiple APIs in the interface. As the APIs contains many LLM optimization concepts such as paged kv-cache, variant-length (continuous batching) and so on. This post tries to aggregate related information for the related concepts, and focus on inference only We will not cover the modules defined for training, and only focus on several basic functional APIs used in inference , for using the flash_attn APIs.

Enable Jupyter in Doom Emacs

tech emacs

There are a few adjustments needs for the default installation when using the jupyter package in Emacs. Here’s a step-by-step guide to configure it properly with Doom Emacs.

Step 1: Install the jupyter package.

Add this line to package.el:

(package! jupyter)                      ;

Step 2: Enable builtin Jupyter Support in Org Mode

To enable Jupyter support in Org mode, make the following modifications in your init.el file:

  1. Uncomment the ein line. The emacs-ipython-notebook is a dependency of jupyter package.
  2. Add +jupyter to the Org settings. For more details, refer to :lang org:
(org +jupyter)               ; organize your plain life in plain text

Step 3: Patch for Runtime Errors with ZeroMQ

To address a runtime error related to ZeroMQ (as discussed in this issue), append the following code to your config.el or any other configuration file:

Asyncio By Example: Implementing the Producer-Consumer Pattern

python coroutine tech

The Most Basic Case

With corountines, we can define a produer and consumer without any need for threads. This simplifies our code and makes it more efficient.

import asyncio

async def producer():
    for i in range(6):
        await asyncio.sleep(0.2)
        yield i

async def consumer():
    async for i in producer():
        print(i)

async def main():
    await asyncio.gather(consumer())

asyncio.run(main())
0
1
2
3
4
5

Work with Heavy IO

When working with heavy IO operations, we need to be careful not to block the event loop. Running heavy IO operations can block the current event loop, which would slow down the scheduling of all coroutines.

Emacs Lisp Introduction for Python Programmers

emacs lisp tech python

This is a brief introduction to Emacs Lisp for Python programmers, (although I am not an Elisp expert, and actually I am more familiar with Python than Elisp). Both languages have quite different syntaxes, it is interesting to see how can implement Python code with lisp code.

The content follows the strucutre from Learn X in Y Minutes Where X is Python, and we will touch all the topics.

Primitive Datatypes and Operators

Numbers

Python

# Integer
1
# Float
3.14
# Math is what you would expect
1 + 1   # => 2
8 - 1   # => 7
10 * 2  # => 20
35 / 5  # => 7.0

# Integer division rounds down for both positive and negative numbers.
5 // 3       # => 1
-5 // 3      # => -2
5.0 // 3.0   # => 1.0  # works on floats too
-5.0 // 3.0  # => -2.0

# The result of division is always a float
10.0 / 3  # => 3.3333333333333335

# Modulo operation
7 % 3   # => 1
# i % j have the same sign as j, unlike C
-7 % 3  # => 2

# Exponentiation (x**y, x to the yth power)
2**3  # => 8

# Enforce precedence with parentheses
1 + 3 * 2    # => 7
(1 + 3) * 2  # => 8
Elisp

;; Integer
1
;; Float
3.14
;; Math is what you would expect
(+ 1 1)   ; => 2
(- 8 1)   ; => 7
(* 10 2)  ; => 20
(/ 35 5)  ; => 7

;; Integer division rounds down for both positive and negative numbers.
(truncate (/ 5 3))       ; => 1
(truncate (/ -5 3))      ; => -2
(truncate (/ 5.0 3.0))   ; => 1.0  ; works on floats too
(truncate (/ -5.0 3.0))  ; => -2.0

;; The result of division is always a float if the denominator or numerator is float
(/ 10.0 3)  ; => 3.3333333333333335

;; Modulo operation
(% 7 3)   ; => 1
;; different from Python
(% -7 3)  ; => -1

;; Exponentiation
(expt 2 3)  ; => 8

;; Enforce precedence with parentheses
(+ 1 (* 3 2))    ; => 7
(* (1+ 3) 2)  ; => 8

Bools and comparasion

In Emacs Lisp, booleans are represented by the symbols t for true and nil for false.

Reduce kernel in CUDA

cuda basics tech

Question definition

Given an array of \(n\) integers, the goal is to compute the sum of all elements within the array.

Solutions

The implementations for all kernel versions can be found at 2-reduce.cu on GitHub.

Naive Version with atomicAdd

The simplest approach involves utilizing each thread to perform an atomicAdd operation on the output variable. Here’s how the kernel is defined:

__global__ void reduce_naive_atomic(int* g_idata, int* g_odata, unsigned int n)
{
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int gridSize = blockDim.x * gridDim.x;

    int sum = 0;
    for (unsigned int i = idx; i < n; i += gridSize)
    {
        sum += g_idata[i];
    }

    atomicAdd(g_odata, sum);
}

And the kernel launcher is straightforward, invoking the kernel a single time:

Count the parameters in LLaMA V1 model

LLM tech

Let’s load the model

from transformers import LlamaModel, LlamaConfig
model = LlamaModel.from_pretrained("llama-7b-hf-path")

def count_params(model, is_human: bool = False):
    params: int = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return f"{params / 1e6:.2f}M" if is_human else params

print(model)
print("Total # of params:", count_params(model, is_human=True))

Print out the layers:

LlamaModel(
  (embed_tokens): Embedding(32000, 4096, padding_idx=0)
  (layers): ModuleList(
    (0-31): 32 x LlamaDecoderLayer(
      (self_attn): LlamaSdpaAttention(
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (mlp): LlamaMLP(
        (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
        (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
        (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): LlamaRMSNorm()
      (post_attention_layernorm): LlamaRMSNorm()
    )
  )
  (norm): LlamaRMSNorm()
)
Total # of params: 6607.34M

The Transformers shows that there are 6607.34M float16 parameters, roughly 13GB, that is aligned to the actual weight size.

Get GPU Properties

gpu basics tech

In `cuda_runtime.h`, there are several APIs for retrieving properties for the installed GPU.

Here is the code of the example.

On a Nvidia GTX 3080 GPU, the properties are as below:

Device 0 properties:
  Max block dimensions: 1024 x 1024 x 64
  Max grid dimensions: 2147483647 x 65535 x 65535
  Shared memory bank size: 4 bytes
  Max shared memory per block: 49152 bytes
  Max registers per block: 65536
  Warp size: 32
  Multiprocessor count: 68
  Max resident threads per multiprocessor: 1536 = 48 warps
  L2 cache size: 5242880 bytes
  Global L1 cache supported: yes
  Total global memory: 9 GB
  Processor clock: 1 MHZ
  Memory clock: 9 MHZ

Notes on LLM technologies (keep updating)

LLM tech

Brief notes on LLM technologies.

Models

GPT2

Model structure

The GPT model employs a repeated structure of Transformer Blocks, each containing two sub-layers: a Masked Multi-Head Attention (MMHA) layer and a Position-wise Feed-Forward Network.

The MMHA is a central component of the model. It operates by splitting the input into multiple ‘heads’, each of which learns to attend to different positions within the input sequence, allowing the model to focus on different aspects of the input simultaneously. The output of these heads is then concatenated and linearly transformed to produce the final output.

Memory coalescing in CUDA (2) – Matrix Transpose

cuda basics tech

Background

In the VecAdd page, we’ve introduced the memory coalescing in global memory access. This post will follow the topic with another interesting application: Matrix transposing.

The following content will briefly touch on the following topics:

Kernels

The code for all the kernels locates in 1-matrix-transpose-coalesce.cu.

Read coalesced

template <typename T>
__global__ void transpose_read_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[i * m + j] = input[j * n + i];
  }
}

Write coalesced

template <typename T>
__global__ void transpose_write_coalesce(
    const T* __restrict__ input,
    T* __restrict__ output,
    int n,
    int m) {
  int i = blockIdx.x * blockDim.x + threadIdx.x; // the contiguous tid
  int j = blockIdx.y * blockDim.y + threadIdx.y;
  if (i < n && j < m) {
    output[j * n + i] = input[i * m + j];
  }
}

Both read and write coalesced by tiling with shared memory

The tiling method is a common methodology for optimizing matrix operation. It divides the matrix into smaller, manageable blocks or “tiles” that can fit into shared memory.

Memory coalescing in CUDA (1) – VecAdd

cuda basics tech

Background

Memory coalescing is a crucial optimization technique in CUDA programming that allows optimal usage of the global memory bandwidth. When threads in the same warp running the same instruction access to consecutive locations in the global memory, the hardware can coalesce these accesses into a single transaction, significantly improving performance.

Coalescing memory access is vital for achieving high performance. Besides PCIe memory traffic, accessing global memory tends to be the largest bottleneck in GPU’s memory hierarchy. Non-coalesced memory access can lead to underutilization of memory bandwidth.

LLVM Utilities (keep updating)

llvm cpp tech

There are many handy functions or data structures in LLVM project, which are widely used by other projects that rely on LLVM. In this page, I will introduce some common utilities that are worthy of using in your own project or frequently used in LLVM code that you should be familiar with.

Basic data type

llvm::StringRef

It is a lightweight, non-owning reference to a sequence of characters. It is similar to std::string_view introduced in C++17.

Apple TV 折腾记

life

由于房屋面积有限,我们的大厅一直没有安装电视。相反,我们使用了一个较大的LG 4K显示器作为娱乐中心,并搭配了两个巨大的Fyne落地音箱,效果不错。我们主要观看的节目是Netflix,晚上我们会和家人一起围在电脑旁边观看,感觉还不错。但是,将显示器用作电视存在一个问题,就是必须通过操作系统进行操作。由于我的职业是程序员,一旦接触电脑,就不由自主地敲击键盘,并且很容易进入工作模式,这对放松娱乐有些影响。

Best Practices for Python Programming (Continuously Updated)

python tech

When delving into the codebases of some successful large Python projects such as PyTorch, I am consistently impressed by their code – whether it’s clean yet precise, or leveraging lesser-known built-in or third-party packages to significantly enhance functionality.

High-quality code snippets, handy packages, and modules have greatly facilitated my work. In this blog, I’ll be sharing noteworthy findings and insights learned from the open-source codebase.

OpenAI/Triton MLIR 迁移工作简介

triton system tech

经过几个月的不懈努力,OpenAI Triton已经成功完成了面向MLIR Infra的迁移/重构工作,并将其最新的基于MLIR的代码合并至主分支。这个工作是由OpenAI和NVIDIA相关团队近几个月来深入合作完成的,而我也有幸参与其中。在这篇文章中,我将分享一些技术总结,记录一些收获和思考。

尽管Triton目前的开源开发非常迅速,但本文将主要聚焦于基于MLIR Infra进行重构的第一个版本的代码(这应该也是两三个月前的)

Emacs Essentials

emacs tech

It is a steep learning curve to master Emacs lisp, there are mainly two issues in it from my experience

  1. the lisp syntax and functional programming
  2. the fragmented methods and libraries

For the 1st issue, it is easy to master the syntax after writing several programs and getting used to them, but for the 2nd one, one needs to take notes or remember something.

In this blog, I focus on the 2nd point and keep updating the notes of some methods and libraries that I think are essential for writing Emacs lisp packages.

About me

about-me

About me

Thank you for your interest.

I am a Deep Learning system architect at NVIDIA, and my current focus is high-performance AI Compiler(on GPU).

Before this, I was a senior engineer in Baidu, working as an architect on PaddlePaddle (one of the most popular open-sourced deep learning frameworks in China market).

I was the creator & primary author & tech lead of the following projects in PaddlePaddle ecosystem (before 2022-6)