Metadata-Version: 2.4
Name: cuda_fps
Version: 0.1.0
Summary: CUDA-accelerated Farthest Point Sampling for PyTorch
Home-page: https://github.com/yourusername/cuda_fps
Author: Shun Iwase
Author-email: shun.iwase@tri.global
Keywords: pytorch cuda farthest-point-sampling point-cloud
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: torch>=1.9.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CUDA FPS: Fast Farthest Point Sampling for PyTorch

A high-performance implementation of Farthest Point Sampling (FPS) for PyTorch. This library provides efficient point cloud sampling with support for batched operations and segmented point clouds, with both CUDA-accelerated GPU and multi-threaded CPU implementations.

## Features

- **Fast CUDA Implementation**: Optimized CUDA kernels for maximum GPU performance
- **Multi-threaded CPU Implementation**: OpenMP-accelerated CPU fallback for systems without GPU
- **Automatic Device Detection**: Seamlessly dispatches to GPU or CPU based on tensor device
- **Flexible API**: Support for both segmented and batched point clouds

## Installation

### Requirements

- Python >= 3.7
- PyTorch >= 1.9.0
- CUDA Toolkit (with compute capability >= 7.0) - **Optional, for GPU support**
- A compatible GPU (Volta/Turing/Ampere/Ada/Hopper architecture) - **Optional, for GPU support**
- OpenMP support in your compiler (usually available by default on Linux/macOS)

### Install from source

```bash
git clone git@github.com:TRI-ML/cuda_fps.git
cd cuda_fps
pip install .
```

For development:
```bash
pip install -e .
```

## Usage

### Basic Example (GPU)

```python
import torch
import cuda_fps

# Create a random point cloud (1000 points, 3D)
points = torch.randn(1000, 3, device='cuda')

# Define segment boundaries (single segment in this case)
offsets = torch.tensor([0, 1000], dtype=torch.int32, device='cuda')

# Sample 100 points using FPS
sampled = cuda_fps.fps(points, offsets, num_sampled_points=100)
print(sampled.shape)  # torch.Size([100, 3])
```

### Basic Example (CPU)

```python
import torch
import cuda_fps

# Create a random point cloud on CPU (1000 points, 3D)
points = torch.randn(1000, 3)

# Define segment boundaries (single segment in this case)
offsets = torch.tensor([0, 1000], dtype=torch.int32)

# Sample 100 points using FPS (automatically uses CPU implementation)
sampled = cuda_fps.fps(points, offsets, num_sampled_points=100)
print(sampled.shape)  # torch.Size([100, 3])
print(sampled.device)  # cpu
```

### Batched Operations

```python
import torch
import cuda_fps

# Create batched point clouds (32 batches, 1024 points each, 3D)
points = torch.randn(32, 1024, 3, device='cuda')

# Sample 256 points from each batch
sampled = cuda_fps.fps_batched(points, num_sampled_points=256)
print(sampled.shape)  # torch.Size([32, 256, 3])
```

### Multiple Segments

```python
import torch
import cuda_fps

# Create point cloud with multiple segments
# Segment 1: 300 points, Segment 2: 400 points, Segment 3: 300 points
points = torch.randn(1000, 3, device='cuda')
offsets = torch.tensor([0, 300, 700, 1000], dtype=torch.int32, device='cuda')

# Sample 100 points from each segment
sampled = cuda_fps.fps(points, offsets, num_sampled_points=100)
print(sampled.shape)  # torch.Size([300, 3]) - 100 points * 3 segments
```

### Custom Starting Points

```python
import torch
import cuda_fps

# Create point cloud
points = torch.randn(1000, 3, device='cuda')
offsets = torch.tensor([0, 1000], dtype=torch.int32, device='cuda')

# Start sampling from point at index 500
start_idx = torch.tensor([500], dtype=torch.int32, device='cuda')

sampled = cuda_fps.fps(points, offsets, num_sampled_points=100, start_idx=start_idx)
print(sampled[0])  # First sampled point will be points[500]
```

### High-Dimensional Points

```python
import torch
import cuda_fps

# Works with any dimensionality
points_64d = torch.randn(1000, 64, device='cuda')  # 64-dimensional points
offsets = torch.tensor([0, 1000], dtype=torch.int32, device='cuda')

sampled = cuda_fps.fps(points_64d, offsets, num_sampled_points=100)
print(sampled.shape)  # torch.Size([100, 64])
```

## API Reference

### `fps(points, offsets, num_sampled_points, start_idx=None, distance_dim=None)`

Farthest Point Sampling for segmented point clouds.

**Parameters:**
- `points` (torch.Tensor): Input point cloud of shape `(N, D)` where N is total number of points and D is dimensionality. Can be on CPU or CUDA device with dtype `float32`.
- `offsets` (torch.Tensor): Segment boundaries of shape `(K+1,)` where K is the number of segments. Must be on the same device as `points` with dtype `int32`. `offsets[i]` indicates the starting index of segment i.
- `num_sampled_points` (int): Number of points to sample from each segment.
- `start_idx` (torch.Tensor, optional): Starting point index for each segment, shape `(K,)`. Must be on the same device as `points` with dtype `int32`. If `None`, starts with the first point (index 0) in each segment.
- `distance_dim` (int, optional): Number of dimensions to use for distance computation. If `None`, uses all D dimensions. Useful for computing distances on a subset of features.

**Returns:**
- `torch.Tensor`: Sampled points of shape `(num_sampled_points * K, D)` on the same device as input. If a segment has fewer than `num_sampled_points`, the remaining slots are zero-padded.

### `fps_batched(points, num_sampled_points, start_idx=None, distance_dim=None)`

Batched Farthest Point Sampling for batched point clouds.

**Parameters:**
- `points` (torch.Tensor): Batched point cloud of shape `(B, M, D)` where B is batch size, M is points per batch, and D is dimensionality. Can be on CPU or CUDA device with dtype `float32`.
- `num_sampled_points` (int): Number of points to sample from each batch.
- `start_idx` (torch.Tensor, optional): Starting point index for each batch element, shape `(B,)`. Must be on the same device as `points` with dtype `int32`. If `None`, starts with the first point (index 0) in each batch.
- `distance_dim` (int, optional): Number of dimensions to use for distance computation. If `None`, uses all D dimensions.

**Returns:**
- `torch.Tensor`: Sampled points of shape `(B, num_sampled_points, D)` on the same device as input. If M < `num_sampled_points`, the remaining slots are zero-padded.

## Testing

Run the test suite:

```bash
# Install pytest if not already installed
pip install pytest

# Run tests
pytest tests/test_fps.py -v
```

## Performance Notes

- **GPU (CUDA)**: Optimized for CUDA-capable GPUs with compute capability >= 7.0 (Volta or newer). Best performance on moderate to large point clouds (N > 10k).
- **CPU**: Multi-threaded OpenMP implementation. Best for smaller point clouds (N < 10k) or when GPU is not available.
- Input tensors must be contiguous in memory
- For very large point clouds or extreme dimensions, consider processing in batches

The library automatically dispatches to the appropriate implementation based on the tensor device.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
