simons blog

Short introduction to the Mojo programming language

Introduction

Mojo is a new programming language that is designed to solve a variety of AI development challenges that no other language can, because Mojo is the first programming language built from the ground-up with MLIR(a compiler infrastructure that's ideal for heterogeneous hardware, from CPUs and GPUs, to various AI ASICs). The syntax of the language is Python like but the performance is that of a systems level programming language like Rust or C. This is a brief introduction which explains the Hello, world! equivalent for GPU programming: vector addition. I will briefly demonstrate how to setup the development environment, analyse the code while drawing similarities to CUDAand ending with a demonstration on how to use the compiler.

Setup

Download and install the magic tool

curl -ssL https://magic.modular.com/e6f579fc-5f0d-4223-a91f-e031163b1bc2 | bash

If you use VSCode as your IDE you may install the mojo extension for LSP support.

Initialise a project like so:

magic init gpu-intro --format mojoproject

We are now ready to develop inside the gpu-intro folder.

Simple example

# ===----------------------------------------------------------------------=== #
# Copyright (c) 2025, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #
from math import ceildiv
from sys import has_accelerator

from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from layout import Layout, LayoutTensor

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)


fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]


def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)

        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )

        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)

We will now explain this code step by step.

Kernel

fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]

The kernel is what we expect from a vector_addition kernel if we are familar with CUDA.

It takes two LayoutTensors. By the Mojo documentation LayoutTensor provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. The properties of the LayoutTensor are defined as aliases

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

Inside the kernel we'll than simply index into the LayoutTensors by getting the global thread id and add them up with a check of bounds.

Host Code

We first check if we have a GPU available

@parameter
if not has_accelerator():
	print("No compatible GPU found")
else:

Similar to what we do in CUDA we reserve memory on the host for our arrays

        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

By the Mojo documentation ctx.synchronize() blocks until all asynchronous calls on the stream associated with this device context have completed.

Next we initialise the host arrays and print them out. Afterwards we reserve memory on the device and copy the content of the host arrays just like we would do in CUDA. Afterwards we reserve memory of the result_device_buffer. We don't need to initialise it for our task.

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)

        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )

We create the LayoutTensors (note we could also use UnsafePointer here, see the GPU Puzzles for how to do that). We'll than call the kernel. Note that, just as in CUDA, we need to provide grid and block configuration.

alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)
        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)

        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](
            lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

After the kernel call we progress as we would do in CUDA. We allocate memory for the result array on the host side and copy it over from the device. We can than print our result and that's already it.

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)

Compile and run the kernel

We can simply run and compile the kernel at once by executing mojo vector_addition.mojo. Note that before doing that we need to execute magic shell to activate the Mojo environment.

After executing mojo vector_addition.mojo we will get the expected output

LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])

Note that we might want to have more control in the compilation step. That can be done by using mojo build.

mojo build vector_addition.mojo will emit a simple executable, which can than be executed via ./vector_addition and produces the same output as above. Using mojo build vector_addition.mojo --emit llvm will create a file vector_addition.ll with the LLVM IR. Using mojo build vector_addition.mojo --emit asm will create a file vector_addition.s with the target assembly. This is interesting if we want to make low level optimisations. For further compile options please see the above doc.

Conclusion

I hope this can serve as a quick starting guide to Mojo and make understanding it easier. In the future I plan to implement more advanced example using Mojo. If you are looking for further information I recommend to check out the Mojo docs. Also recently about 450K lines of Mojo code for GPU programming got Open Sourced and the code is very readable in my opinion. Please check out the Github repo to dive deeper into the code.