Following our previous exploration of porting Mediapipe's LLM Inference, this article delves into migrating the Object Detection Demo. By reading this, you'll learn:
First, let's examine the original Object Detection project. You'll notice that the Android portion includes both an android.view.View
implementation and a Jetpack Compose version. Following our previous approach, we'll directly port the Jetpack Compose version to KMP.
Upon closer inspection, you'll find that this app is more complex. In the LLM Inference example, the SDK only provided text inference interfaces, which we could easily wrap in Kotlin to abstract platform-specific details (though we ultimately used an alternative due to some cinterop
support issues). The UI was entirely reusable. However, Object Detection is based on real-time image processing and includes three modes in the demo: real-time camera detection, local video detection, and local image detection. Camera preview typically relies heavily on platform-specific implementations, and players rarely use custom rendering (i.e., they use native platform solutions).
In summary, from the outset, we need to reserve parts that are challenging to implement with Compose Multiplatform (CMP) (such as the CameraView
shown above) and abstract them into separate expect
Composable functions for each platform to implement individually. To simplify the learning process and reduce the demo's scale, we'll focus solely on implementing the CameraView
component, leaving the Gallery (Video + Image) sections for you to explore. In essence, once you grasp how to embed a Camera Preview, you can implement the other two parts similarly, including Compose and UIKit interactions and iOS permission requests.
By cross-referencing the iOS version of the demo, we've broken down the UI layers related to CameraView
into four components, as illustrated above:
ResultOverlay
, which draws bounding boxes and other results, could be implemented in the common layer. However, due to complexities like matching the overlay with the Camera Preview (since the preview size can vary depending on the camera's aspect ratio) and coordinate transformations, we'll delegate this component to platform-specific implementations in this demo.Scaffold
and Inference Time Label
will be implemented in the common layer.Building on the foundation from the previous section, we'll add a new folder named objectdetection
to the Mediapiper
project. With our prior experience, we realize that much of the UI content isn't overly complex—except for the focal point of this section, the camera preview interface. Therefore, we can proceed by migrating all files except for camera
and gallery
:
The necessary modifications can be categorized into two areas:
Data and Logic:
ObjectDetectionResult
declarations from original SDKs and create a common version as a data class. This allows both SDKs to convert their results into the common version, facilitating the display of inference time, unified logging, and even paving the way to potentially move ResultOverlay
to the common layer in the future.UI Components:
R
references with Res
, adopting the unified theme, and adjusting some import packages.if...else...
at the top level.At this point, we can run an application that doesn't include camera functionality. The images below demonstrate how these CMP codes run on iOS across two pages.
As previously analyzed, we need to extract the CameraView
component for native implementation. Therefore, in the common CameraView
, we'll use two expect
Composable functions: CameraPermissionControl
and CameraPreview
:
@Composable
fun CameraView(
threshold: Float,
maxResults: Int,
delegate: Int,
mlModel: Int,
setInferenceTime: (newInferenceTime: Int) -> Unit,
) {
CameraPermissionControl {
CameraPreview(
threshold,
maxResults,
delegate,
mlModel,
setInferenceTime,
onDetectionResultUpdate = { detectionResults ->
// ...
}
)
}
}
@Composable
expect fun CameraPermissionControl(PermissionGrantedContent: @Composable @UiComposable () -> Unit)
@Composable
expect fun CameraPreview(
threshold: Float,
maxResults: Int,
delegate: Int,
mlModel: Int,
setInferenceTime: (newInferenceTime: Int) -> Unit,
onDetectionResultUpdate: (result: ObjectDetectionResult) -> Unit
)
The Android side is straightforward—we can directly copy the original Jetpack Compose code:
// Android implementation
@OptIn(ExperimentalPermissionsApi::class)
@Composable
actual fun CameraPermissionControl(
PermissionGrantedContent: @Composable @UiComposable () -> Unit
) {
val storagePermissionState: PermissionState =
rememberPermissionState(Manifest.permission.CAMERA)
LaunchedEffect(key1 = Unit) {
if (!storagePermissionState.hasPermission) {
storagePermissionState.launchPermissionRequest()
}
}
if (!storagePermissionState.hasPermission) {
Text(text = "No Storage Permission!")
} else {
PermissionGrantedContent()
}
}
@Composable
actual fun CameraPreview(...) {
// Define properties
DisposableEffect(Unit) {
onDispose {
active = false
cameraProviderFuture.get().unbindAll()
}
}
// Describe the UI of this camera view
BoxWithConstraints {
val cameraPreviewSize = getFittedBoxSize(
containerSize = Size(
width = this.maxWidth.value,
height = this.maxHeight.value,
),
boxSize = Size(
width = frameWidth.toFloat(),
height = frameHeight.toFloat()
)
)
Box(
Modifier
.width(cameraPreviewSize.width.dp)
.height(cameraPreviewSize.height.dp)
) {
// Use AndroidView to integrate CameraX, as there's no prebuilt composable in Jetpack Compose
AndroidView(
factory = { ctx ->
val previewView = PreviewView(ctx)
val executor = ContextCompat.getMainExecutor(ctx)
cameraProviderFuture.addListener({
val cameraProvider = cameraProviderFuture.get()
val preview = Preview.Builder().build().also {
it.setSurfaceProvider(previewView.surfaceProvider)
}
val cameraSelector = CameraSelector.Builder()
.requireLensFacing(CameraSelector.LENS_FACING_BACK)
.build()
// Instantiate an image analyzer for frame transformations before object detection
val imageAnalyzer = ImageAnalysis.Builder()
.setTargetAspectRatio(AspectRatio.RATIO_4_3)
.setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
.setOutputImageFormat(ImageAnalysis.OUTPUT_IMAGE_FORMAT_RGBA_8888)
.build()
// Execute object detection in a new thread
val backgroundExecutor = Executors.newSingleThreadExecutor()
backgroundExecutor.execute {
// Use ObjectDetectorHelper to abstract Mediapipe specifics
val objectDetectorHelper = AndroidObjectDetector(
context = ctx,
threshold = threshold,
currentDelegate = delegate,
currentModel = mlModel,
maxResults = maxResults,
objectDetectorListener = ObjectDetectorListener(
onErrorCallback = { _, _ -> },
onResultsCallback = {
// Set frame dimensions upon receiving results
frameHeight = it.inputImageHeight
frameWidth = it.inputImageWidth
// Update results and inference time if the view is active
if (active) {
results = it.results.first()
setInferenceTime(it.inferenceTime.toInt())
}
}
),
runningMode = RunningMode.LIVE_STREAM
)
// Set the analyzer and start detecting objects from the live stream
imageAnalyzer.setAnalyzer(
backgroundExecutor,
objectDetectorHelper::detectLivestreamFrame
)
}
// Unbind any currently open camera and bind our own
cameraProvider.unbindAll()
cameraProvider.bindToLifecycle(
lifecycleOwner,
cameraSelector,
imageAnalyzer,
preview
)
}, executor)
// Return the preview view from the AndroidView factory
previewView
},
modifier = Modifier.fillMaxSize()
)
// Display the results overlay if there are current results
results?.let {
ResultsOverlay(
results = it,
frameWidth = frameWidth,
frameHeight = frameHeight
)
}
}
}
}
The iOS side requires a bit more effort. For camera permission control, we'll directly call iOS's platform.AVFoundation
APIs within this Composable function, asynchronously request permissions, and display appropriate messages based on the result—loading, failure, or success, where we show the camera preview. You'll notice that our iOS implementation is quite comprehensive, covering all three scenarios.
import platform.AVFoundation.*
import kotlin.coroutines.resume
import kotlin.coroutines.suspendCoroutine
@Composable
actual fun CameraPermissionControl(PermissionGrantedContent: @Composable @UiComposable () -> Unit) {
var hasCameraPermission by remember { mutableStateOf<Boolean?>(null) }
LaunchedEffect(Unit) {
hasCameraPermission = requestCameraAccess()
}
when (hasCameraPermission) {
true -> {
PermissionGrantedContent()
}
false -> {
Text("Camera permission denied. Please grant access from settings.")
}
null -> {
Text("Requesting camera permission...")
}
}
}
private suspend fun requestCameraAccess(): Boolean = suspendCoroutine { continuation ->
val authorizationStatus = AVCaptureDevice.authorizationStatusForMediaType(AVMediaTypeVideo)
when (authorizationStatus) {
AVAuthorizationStatusNotDetermined -> {
AVCaptureDevice.requestAccessForMediaType(AVMediaTypeVideo) { granted ->
continuation.resume(granted)
}
}
AVAuthorizationStatusRestricted, AVAuthorizationStatusDenied -> {
continuation.resume(false)
}
AVAuthorizationStatusAuthorized -> {
continuation.resume(true)
}
else -> {
continuation.resume(false)
}
}
}
Now, let's tackle the core camera preview functionality. According to the CMP documentation, we can embed a UIKit view within a Composable function using UIKitView
:
// Example 1
UIKitView(
factory = { MKMapView() },
modifier = Modifier.size(300.dp),
)
// Example 2
@OptIn(ExperimentalForeignApi::class)
@Composable
fun UseUITextField(modifier: Modifier = Modifier) {
var message by remember { mutableStateOf("Hello, World!") }
UIKitView(
factory = {
val textField = object : UITextField(CGRectMake(0.0, 0.0, 0.0, 0.0)) {
@ObjCAction
fun editingChanged() {
message = text ?: ""
}
}
textField.addTarget(
target = textField,
action = NSSelectorFromString(textField::editingChanged.name),
forControlEvents = UIControlEventEditingChanged
)
textField
},
modifier = modifier.fillMaxWidth().height(30.dp),
update = { textField ->
textField.text = message
}
)
}
These examples applied some default iOS components, not custom ones. The corresponding reference headers are already converted to Kotlin by JetBrains, such as platform.UIKit.UITextField
, which can be directly imported into the KMP project's iOS target.
In our case, we want to reuse a custom CameraPreview
view with recognition capabilities. However, the app.framework
produced by KMP is a shared layer upon which the iOS native code depends. Due to the dependency hierarchy, we can't directly call CameraPreview
defined in the iOS app's source code. There are generally two solutions:
cameraview.framework
, which the KMP's app
can depend on.app.framework
in the iOS app, pass a lambda that initializes and returns a UIView
to app
.We'll opt for the second solution, defining IOSCameraPreviewCreator
as the protocol for interaction between the two sides.
// Definition
typealias IOSCameraPreviewCreator = (
threshold: Float,
maxResults: Int,
delegate: Int,
mlModel: Int,
setInferenceTime: (newInferenceTime: Int) -> Unit,
callback: IOSCameraPreviewCallback
) -> UIView
typealias IOSCameraPreviewCallback = (result: ObjectDetectionResult) -> Unit
// Inject the implementation from the iOS side during startup and add it to Koin's definitions
fun onStartup(iosCameraPreviewCreator: IOSCameraPreviewCreator) {
Startup.run { koinApp ->
koinApp.apply {
modules(module {
single { LLMOperatorFactory() }
single<IOSCameraPreviewCreator> { iosCameraPreviewCreator }
})
}
}
}
// In the implementation of CameraPreview,
// we inject and invoke this function to obtain a UIView instance
import androidx.compose.ui.viewinterop.UIKitView
import platform.UIKit.UIView
@Composable
actual fun CameraPreview(
threshold: Float,
maxResults: Int,
delegate: Int,
mlModel: Int,
setInferenceTime: (newInferenceTime: Int) -> Unit,
onDetectionResultUpdate: (result: ObjectDetectionResult) -> Unit,
) {
val iOSCameraPreviewCreator = koinInject<IOSCameraPreviewCreator>()
// Similar to how Android integrates the native Camera View
UIKitView(
factory = {
val iosCameraPreview: UIView = iOSCameraPreviewCreator(
threshold,
maxResults,
delegate,
mlModel,
setInferenceTime,
onDetectionResultUpdate
)
iosCameraPreview
},
modifier = Modifier.fillMaxSize(),
update = { _ -> }
)
}
The above code leverages Koin for dependency injection, simplifying the interaction process. Now, let's follow the injection of startup parameters to check the iOS side.
MainKt.onStartup(iosCameraPreviewCreator: { threshold, maxResults, delegate, mlModel, onInferenceTimeUpdate, resultCallback in
return IOSCameraView(
frame: CGRectMake(0, 0, 0, 0),
modelName: Int(truncating: mlModel) == 0 ? "EfficientDet-Lite0" : "EfficientDet-Lite2",
maxResults: Int(truncating: maxResults),
scoreThreshold: Float(truncating: threshold),
onInferenceTimeUpdate: onInferenceTimeUpdate,
resultCallback: resultCallback
)
})
The IOSCameraView
here is essentially the CameraViewController
from the original iOS demo. We've modified some initialization and lifecycle content and simplified parameter change listeners to highlight the core migration aspects:
Lifecycle Handling: ViewController
uses methods like viewDidLoad
, while UIView
uses didMoveToWindow
to handle logic when the view is added or removed. ViewController
initializes through its lifecycle, whereas UIView
provides custom initializers to pass in models and detection parameters.
Subview Setup: ViewController
uses @IBOutlet
and Interface Builder, while UIView
directly creates and adds subviews via a setupView
method, manually setting constraints with AutoLayout and handling tap events.
Callbacks and Delegates: ViewController
uses delegates, while UIView
adds closure callbacks like onInferenceTimeUpdate
and resultCallback
. These are passed during initialization and set up for type conversion, facilitating callbacks to the KMP layer.
We also retain OverlayView
, CameraFeedService
, ObjectDetectorService
, and parts of DefaultConstants
, without modifying their code. Notably, ObjectDetectorService
encapsulates the Object Detection SDK. If you examine its API calls, you'll find it's closely coupled with iOS's Camera APIs (like CMSampleBuffer
), indicating the difficulty of abstracting it into the common layer.
With this, we can run the camera preview with Object Detection on iOS.
The above GIF showcases the performance of EfficientDet-Lite0 running in CPU mode on an iPhone 13 mini. In official tests using the Pixel 6's CPU/GPU, switching to GPU execution can slightly improve performance. It's evident that the real-time performance is sufficient for production environments, and the accuracy is acceptable.
The demo comes with two optional models:
Both models are trained on a dataset containing 1.5 million instances and 80 object labels.
ResultOverlay
in the common layer, and the iOS side set up result callbacks to the KMP layer. However, on iOS, we still used native views for implementation due to the cost of migration. In real-world scenarios, we can further consider:
ResultOverlay
at the Compose layer.StickerOverlay
highly valuable. In this case, regardless of the camera preview size, the cost of adaptation becomes acceptable. Moreover, there's potential for optimization in calculations within StickerOverlay
, such as using sampled calculations and interpolated animations.