CUDA Tile Programming Now Available for BASIC!

NVIDIA Tech Blogby Rob ArmstrongApril 1, 20267 min read1 views

CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible....

CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible. One of its key strengths is language openness: any programming language can target CUDA Tile, enabling developers to bring tile-based GPU acceleration into a wide range of ecosystems.

In response to overwhelming demand from seasoned developers everywhere, we’re releasing cuTile BASIC for GPUs, bringing CUDA Tile programming to this long-overlooked language.

What is cuTile BASIC?

cuTile BASIC is an expression of the CUDA Tile programming model in BASIC, built on top of the CUDA Tile IR specification. It enables you to write tile kernels in BASIC using a tile-based model, which is a natural fit for a programming language like BASIC which predates multi-threaded programming.

cuTile BASIC is the perfect marriage of the power of GPUs with the anachronistic charm and syntactic simplicity of the BASIC programming language – an elegant language, from a more pixelated era. Manually numbering your lines of code never looked so good or ran so fast!

Who is cuTile BASIC for?

BASIC is one of the oldest programming languages around and as such, is revered by a whole generation of developers who remember the sound of a 300 baud dial-up modem handshaking fondly. For many such developers, BASIC was their first introduction to computer programming.

Now, developers with BASIC still burned into their brains can take legacy applications onto NVIDIA GPU-accelerated computing for the first time. This unlocks performance and functionality the BASIC programming language could never have previously imagined – allowing your Lunar Lander to zip around the moon’s surface faster than an Artemis mission.

Relive the glory of writing games for your graphing calculator during math class while running on the world’s most powerful GPUs.

Get setup

First, install cuTile BASIC with PIP:

pip install git+https://github.com/nvidia/cuda-tile.git@basic-experimental

The full hardware and software requirements for running cuTile BASIC are listed at the end of this post (64k of RAM or more recommended).

cuTile BASIC example

If you’ve learned CUDA C++, you probably encountered the canonical vector addition kernel. A vector add kernel in CUDA C++ looks something like the following, which takes two vectors and adds them together elementwise to produce a third vector.

This is one of the simplest CUDA kernels one can write:

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength) {  /* calculate my thread index */  int workIndex = threadIdx.x + blockIdx.x*blockDim.x;

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength) {  /* calculate my thread index */  int workIndex = threadIdx.x + blockIdx.x*blockDim.x;

if(workIndex < vectorLength) { /* perform the vector addition */ C[workIndex] = A[workIndex] + B[workIndex]; } }`

In this kernel, each thread’s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched.

Now let’s look at the equivalent code written in cuTile BASIC. We don’t need to specify what each thread does. We only have to break the data into tiles and specify what mathematical operations should happen to these tiles. Everything else is handled for us.

The cuTile BASIC vector add kernel is shown below:

10 REM Vector Add: C = A + B 20 INPUT N, A(), B() 30 DIM A(N), B(N), C(N) 40 TILE A(128), B(128), C(128) 50 LET C(BID) = A(BID) + B(BID) 60 OUTPUT C 70 END

10 REM Vector Add: C = A + B 20 INPUT N, A(), B() 30 DIM A(N), B(N), C(N) 40 TILE A(128), B(128), C(128) 50 LET C(BID) = A(BID) + B(BID) 60 OUTPUT C 70 END

This example is very basic and uses standard BASIC with three additional things to point out:

Indexing an array returns a tile, which is a subset of the array.
BID is a built-in variable that specifies the tile block index.
TILE specifies what size tiles the arrays should be partitioned into.

Notice we didn’t have to specify anything other than the addition operation. Everything else is handled by cuTile BASIC.

Putting it all together

Now we’ll show how to run this vector add kernel in BASIC. The straightforward workflow is that first the BASIC function is compiled to a cubin, and then it’s launched on the GPU. For brevity, we’ve omitted the boring Python host and wrapper code, but you can find it in our GitHub repository.

If you have the proper versions of CUDA Toolkit and Python installed and have downloaded the cuTile BASIC repo from GitHub, you can execute the following command:

$ python examples/vector_add.py [1/2] Compiling to cubin ...  Arrays: ['A', 'B', 'C'], tile_shapes={'A': [128], 'B': [128], 'C': [128]}, grid_size=8 [2/2] Launching kernel on GPU ...

$ python examples/vector_add.py [1/2] Compiling to cubin ...  Arrays: ['A', 'B', 'C'], tile_shapes={'A': [128], 'B': [128], 'C': [128]}, grid_size=8 [2/2] Launching kernel on GPU ...

Results (showing 5 samples of 1024): C[ 0] = 0.0 (expected 0.0) C[ 1] = 3.0 (expected 3.0) C[ 511] = 1533.0 (expected 1533.0) C[ 512] = 1536.0 (expected 1536.0) C[1023] = 3069.0 (expected 3069.0)

VERIFICATION PASSED (max_diff=0.000000, 1024 elements)`

If your output looks the same, congratulations, you just ran your very first cuTile BASIC program, and quite possibly your very first program of any sort written in BASIC! Max Headroom would be proud.

A BASIC matrix multiplication

BASIC is a simple language such that very few lines of code can express common algorithms. Consider a matrix multiply (GEMM) kernel in BASIC, shown below:

10 REM GEMM: C(M,N) = A(M,K) * B(K,N) 15 INPUT M, N, K, A(), B() 20 DIM A(M, K), B(K, N), C(M, N) 30 TILE A(128, 32), B(32, 128), C(128, 128), ACC(128, 128) 40 LET TILEM = INT(BID / INT(N / 128)) 50 LET TILEN = BID MOD INT(N / 128) 60 LET ACC = 0.0 70 FOR KI = 0 TO INT(K / 32) - 1 80 LET ACC = MMA(A(TILEM, KI), B(KI, TILEN), ACC) 90 NEXT KI 100 LET C(TILEM, TILEN) = ACC 110 OUTPUT C 120 END

10 REM GEMM: C(M,N) = A(M,K) * B(K,N) 15 INPUT M, N, K, A(), B() 20 DIM A(M, K), B(K, N), C(M, N) 30 TILE A(128, 32), B(32, 128), C(128, 128), ACC(128, 128) 40 LET TILEM = INT(BID / INT(N / 128)) 50 LET TILEN = BID MOD INT(N / 128) 60 LET ACC = 0.0 70 FOR KI = 0 TO INT(K / 32) - 1 80 LET ACC = MMA(A(TILEM, KI), B(KI, TILEN), ACC) 90 NEXT KI 100 LET C(TILEM, TILEN) = ACC 110 OUTPUT C 120 END

In this kernel, in addition to standard BASIC syntax, TILE specifies how A, B, and C should be tiled and the size of the accumulator tile, ACC. MMA is the function call for matrix multiply and accumulate. Notice how simple this code is. You specify how your data should be subdivided into tiles, and specify your algorithm at a high-level, and under the covers, CUDA Tile handles everything else.

This example is also available in the GitHub repo’s examples folder. Running it produces the following output:

$ python examples/gemm.py [1/2] Compiling to cubin ...  M=512, N=512, K=512, tile_shapes={'A': [128, 32], 'B': [32, 128], 'C': [128, 128]}, grid_size=16 [2/2] Launching kernel on GPU ...

$ python examples/gemm.py [1/2] Compiling to cubin ...  M=512, N=512, K=512, tile_shapes={'A': [128, 32], 'B': [32, 128], 'C': [128, 128]}, grid_size=16 [2/2] Launching kernel on GPU ...

Results (showing 5 samples of 512x512 = 262144 elements): C[0,0] = -0.1199 (expected -0.1199) C[0,1] = -14.4456 (expected -14.4456) C[256,0] = -15.8891 (expected -15.8891) C[256,1] = -2.8646 (expected -2.8646) C[511,511] = 11.4724 (expected 11.4724)

VERIFICATION PASSED (max_diff=0.000012, tol=0.005120)`

Matrix multiplication, such as that shown above, is at the heart of artificial intelligence tools like large language models. With cuTile Basic, developers can now explore the frontiers of artificial intelligence in models with trillions of parameters from a language that could barely imagine a whole megabyte of system memory.

How developers can get cuTile

To run cuTile BASIC programs, you need the following:

A GPU with compute capability 8.x, 10.x, 11.x or 12.x (in future CUDA releases we’ll add support for additional GPU architectures)
NVIDIA Driver R580 or later (R590 is required for tile-specific developer tools support)
CUDA Toolkit 13.1 or later
Python version 3.10 or higher
cuTile BASIC package

Get started

Once you’ve got all the software, check out the full cuTile BASIC documentation, try all the sample programs found on GitHub, and start programming in cuTile BASIC today. Relish in the opportunity to port your modern AI or scientific computing code base to a historically pivotal language while retaining the ability to run on the most powerful hardware available! Just don’t tell your Commodore 64.

CUDA Tile in any language

While BASIC may not be the first language developers think of for high-performance parallel computing, it is an instructive demonstration that, thanks to the design of the CUDA software stack, CUDA Tile could be used from nearly any programming language. By compiling to the CUDA Tile IR format, CUDA Tile can be brought to just about any language … even BASIC!

Editor’s Note: In retrospect, developers asking for broad support of the CUDA Tile programming model should perhaps have been a bit more specific. Look for cuTile COBOL coming April 1, 2027.

Original source

NVIDIA Tech Blog

https://developer.nvidia.com/blog/cuda-tile-programming-now-available-for-basic/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

available

ModelsLive

Mitigating LLM biases toward spurious social contexts using direct preference optimization

arXiv:2604.02585v1 Announce Type: new Abstract: LLMs are increasingly used for high-stakes decision-making, yet their sensitivity to spurious contextual information can introduce harmful biases. This is a critical concern when models are deployed for tasks like evaluating teachers' instructional quality, where biased assessment can affect teachers' professional development and career trajectories. We investigate model robustness to spurious social contexts using the largest publicly available dataset of U.S. classroom transcripts (NCTE) paired with expert rubric scores. Evaluating seven frontier and open-weight models across seven categories of spurious contexts -- including teacher experience, education level, demographic identity, and sycophancy-inducing framings -- we find that irreleva

ArXiv CS.AI

2m12 minutes ago

ReleasesFresh

[P] SpeakFlow - AI Dialogue Practice Coach with GLM 5.1

Built SpeakFlow for the Z.AI Builder Series hackathon. AI dialogue practice coach that evaluates your spoken responses in real-time. Two modes: Practice (AI evaluates accuracy/grammar/fluency) and Presentation (lines hidden, hints available, instant scoring). Supports 11 languages, audio recording, detailed session reports. Built with: HTML5, Web Speech API, GLM 5.1, Vercel Live: https://speakflow-omega.vercel.app Code: https://github.com/ArjunGanjinalli/speakflow #buildwithGLM submitted by /u/Future-Ad-1493 [link] [comments]

Reddit r/MachineLearning

1mabout 4 hours ago

ProductsLive

Stop Prompting; Use the Design-Log Method to Build Predictable Tools

The article by Yoav Abrahami introduces the Design-Log Methodology, a structured approach to using AI in software development that combats the "context wall" — where AI models lose track of project history and make inconsistent decisions as codebases grow. The core idea is to maintain a version-controlled ./design-log/ folder in a Git repository, filled with markdown documents that capture design decisions, discussions, and implementation plans at the time they were made. This log acts as a shared brain between the developer and the AI, enabling the AI to act as a collaborative architect rather than just a code generator. By enforcing rules like read before you write, design before implementation, and immutable history, the methodology ensures consistency, reduces errors, and makes AI-assi

DEV Community

4mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 160 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesLive

A Comprehensive Framework for Long-Term Resiliency Investment Planning under Extreme Weather Uncertainty for Electric Utilities

arXiv:2604.02504v1 Announce Type: new Abstract: Electric utilities must make massive capital investments in the coming years to respond to explosive growth in demand, aging assets and rising threats from extreme weather. Utilities today already have rigorous frameworks for capital planning, and there are opportunities to extend this capability to solve multi-objective optimization problems in the face of uncertainty. This work presents a four-part framework that 1) incorporates extreme weather as a source of uncertainty, 2) leverages a digital twin of the grid, 3) uses Monte Carlo simulation to capture variability and 4) applies a multi-objective optimization method for finding the optimal investment portfolio. We use this framework to investigate whether grid-aware optimization methods ou

ArXiv CS.AI

1m12 minutes ago

ReleasesLive

scan-for-secrets 0.3

Release: scan-for-secrets 0.3 New -r/--redact option which shows the list of matches, asks for confirmation and then replaces every match with REDACTED , taking escaping rules into account. New Python function redact_file(file_path: str | Path, secrets: list[str], replacement: str = "REDACTED") -> int . Tags: projects

Simon Willison Blog

1mabout 1 hour ago

ReleasesLive

datasette-ports 0.2

Release: datasette-ports 0.2 No longer requires Datasette - running uvx datasette-ports now works as well. Installing it as a Datasette plugin continues to provide the datasette ports command. Tags: datasette

Simon Willison Blog

1mabout 1 hour ago

ReleasesLive

Kakao Healthcare, Sanofi launch AI-driven RWE studies in Korea - koreabiomed.com

Kakao Healthcare, Sanofi launch AI-driven RWE studies in Korea koreabiomed.com

GNews AI healthcare

1mabout 2 hours ago