Augmented Reality: ARKit and SceneKit

One of the most exciting new frameworks announced at WWDC 2017 was ARKit. Virtual reality has been fairly hot the last few years, but in my opinion augmented reality has a better future. Both are garnering a lot of attention so I am writing this post about what augmented reality is and the easiest way to natively integrate it into your applications.

What is Augmented Reality?

Virtual reality is the creation of a totally immersive experience. Think of the holodeck from Star Trek. The characters enter a room and the computer creates a completely immersive simulation. The characters can explore cities and planets or go back in time. Like the TARDIS, the holodeck seems bigger on the inside. It theoretically is indistinguishable from the real world.

Stepping back in time virtually.

Augmented reality is different. Augmented reality doesn’t seek to immerse you in another world, but rather exists to add to the world around you. The best known example of this is Pokemon Go. Pokemon Go overlays Pokemon characters within the real world to emulate the experience of being a Pokemon trainer and going out to capture Pokemon. This allows players to experience what it would be like to live within the game.

This distinction is one reason I feel AR has a brighter future than VR. One reason that people got into Harry Potter is partially embedded in the idea that this is a hidden aspect of the real world. It’s the idea that somewhere in London you can find a hidden neighborhood where you can buy magic wands or that you could receive an acceptance letter to Hogwarts. It’s a distinction between total escapism and having the idea that there is magic hidden within the real world.

Augmented reality was possible on the iPhone before ARKit was introduced. I found a book on how to implement AR going back to iOS 5. Pokemon Go predates ARKit. What ARKit does is wrap a lot of complexity within a few easy abstracted methods and classes.

Augmented reality requires three different functionalities:

  • Tracking
  • Registration
  • Visualization

Tracking is the ability of the application to detect surfaces and know what objects exist in space. In Pokemon Go, tracking is used to ensure a Pokemon appears on a road or sidewalk as opposed to inside of a tree.

Registration catalogs and keeps track of areas and objects of interest. In the Pokemon Go example, registration is used to keep track of which Pokemon exist in the area and where those Pokemon are supposed to appear. If you move your phone away from the Pokemon and then come back to it, the Pokemon should still be there, unless someone else captured it.

Visualization encompasses everything that you see in the scene. This is where the Pokemon’s mesh is loaded, shaded, and rendered. This is the main point in the process that people are most aware of because it’s the thing they see directly.

What Does ARKit Do For You?

ARKit is not a very large framework. It has fewer than ten classes and those classes do not have a lot of methods in them. The reason for this is that ARKit only handles two of the three requirements for augmented reality: Tracking and registration.

Tracking and registration are pretty straightforward conceptually without a lot of special modifications. Nothing about these tasks are going to fundamentally change very much between applications. An AR measuring tape is going to function the same way in respect to these tasks as a game or a portal to another dimension. That makes it a prime candidate for a Cocoa framework. It wraps a lot of boilerplate code in some friendly and approachable Swift code.

This small framework is deceptive in how much it actually does. The vast amounts of machine vision being done under the hood is astonishing. If you were responsible for setting up the real time video analysis and image filtering for object detection on your own, it would take a research team and five years.

What ARKit Doesn’t Do For You

ARKit does not handle the most visible step in the process, which is visualization. Just knowing the ten or so classes in ARKit doesn’t really get you a working augmented reality application. This is similar to learning CoreML. Both of these frameworks wrap a bunch of boilerplate code around a complex but relatively similar pipeline so that you don’t have to worry about the ins and outs of how the camera is capturing frames. You just have to worry about telling it what to do once the frame is captured.

ARKit in a Nutshell

I mentioned earlier in this post that ARKit isn’t a very large framework. There are only three broad categories of object you need to be familiar with. Within these three categories are three classes, so about nine classes total.

ARKit Foundational Structures

The first category of object are what I call ARKit’s foundational structures. These are objects that manage ARKit in your application and talk to other things on your behalf.

These classes are:

  • ARSession
  • ARConfiguration
  • ARSCNView/ARSKView

If you’ve done any work with AVFoundation, you should be familiar with sessions. ARSession is a shared object that manages the camera and motion processing needed for augmented reality experiences. You can’t have an AR application without an ARSession. It’s the project manager of the AR functionality.

ARConfiguration is an abstract base class that has a few different subclasses depending upon what you need from your application. ARWorldTrackingConfiguration is the most comprehensive and highest quality tracking configuration. It allows plane detection and hit testing. Not all devices support world tracking. For those devices you need to offer AROrientationTrackingConfiguration, a more limited configuration. Finally, If you’re only working with facial data, you would use ARFaceTrackingConfiguration.

The final set of classes are the views. Both ARSCNView and ARSKView inherit from their respective 2D or 3D frameworks. They are subclassed to respond to ARKit events like hit testing. You will need to create an instance of the ARSCNView at the top of your class:

var sceneView: ARSCNView!

If you create your application from the Augmented Reality template it will automatically configure the session for you:

override func viewWillAppear(_ animated: Bool) {
	super.viewWillAppear(animated)
		
	// Create a session configuration
	let configuration = ARWorldTrackingConfiguration()

	// Run the view's session
	sceneView.session.run(configuration)
}

The session has to be set to both run and pause. The pause is set when the view disappears:

override func viewWillDisappear(_ animated: Bool) {
	super.viewWillDisappear(animated)
		
	// Pause the view's session
	sceneView.session.pause()
}

Now that a session is in place, you need to add some ARKit objects.

ARKit Real World Representations

In order to place virtual objects in the real world, you need a way to understand where in three dimensional space the object should be rendered. Our eyes and brains recognize objects and can remember where books and pictures are when we leave the room. The program has to be able to do the same thing. There are three main objects ARKit uses to analyze the world and persist objects:

  • ARAnchor
  • ARPlaneAnchor
  • ARHitTest

ARAnchor instances represent real points in space. They are used for placing objects in the augmented reality scene. If you wanted to place a virtual pug in the middle of a rug, you would create an anchor in space to tell the renderer where to draw the pug. ARPlaneAnchor is similar but deals exclusively with flat surfaces.

ARHitTest instances are information about a real-world surface found by examining a point in the device camera view of an AR session. These include:

  • Feature Points
  • Estimated Horizon Plane
  • Existing Plane

If you want more information about these types, you can access it here.

Camera and AVFoundation Wrappers

One difficult thing in iOS development is pulling camera frames from the camera and doing something with them. If you’ve ever mucked around with AVFoundation, you know what a pain in the neck it is. A lot of classes are incredibly long and have very similar names. ARKit takes these tasks and abstracts them away from you so you don’t have to worry about them. These classes are:

  • ARFrame
  • ARCamera
  • ARLightEstimate

The ARFrame is a video image and position tracking information captured as part of an AR session. Properties you can muck around with on this object include the depth data and the time stamp. The captured image is a CVPixelBuffer which allows you to process this as you would a frame of video (because that’s exactly what it is).

The ARCamera contains information about the camera position and imaging characteristics for a captured video frame in an AR session. This includes the tracking state and the orientation of the camera. You also have access to the camera geometry, which allows you to perform matrix transforms.

Finally, the ARLightEstimate estimates scene lighting information associated with a captured video frame in an AR session. This primarily involves the intensity and color temperature. You can use these to incorporate into your lighting shaders to get the virtual objects to match the ambient lighting in the scene.

By this point, you’ve basically covered everything you need to understand about ARKit, but you still don’t see anything on the screen. From this point, you will need to work with a renderer.

ARKit with SceneKit as the Renderer

ARKit has several options for both native and non-native renderers for ARKit. Both Unity and Unreal plug into ARKit, allowing you to harness the power of those game engines to do some truly astonishing things. Alternately, you can use the native SpriteKit and SceneKit frameworks to do some cool stuff as well. SpriteKit as a renderer is somewhat limited, so for the rest of this post I will be creating the simplest SceneKit integration with ARKit that I can, which is to create a box that is anchored in space.

I used
Mohammad Azam’s
Udemy ARKit course as a basis for this sample project. I highly recommend the rest of his course if you want to get more in depth with ARKit.

Augmented Reality has its own template. It isn’t under Games.

The following code is the only code I had to add to this project. Everything else was already set up by the template. The template provides a scene already, but I want to render a box rather than the built in ship asset:

// Create a new scene
let scene = SCNScene()
		
let box = SCNBox(width: 0.2, height: 0.2, length: 0.2, chamferRadius: 0)

SceneKit has a large library of built in primitive objects. It has a surprising amount of built in functionality that allows you to make a lot of progress fairly quickly.

The box requires some surface customization. The surface texture and color is determined by it’s material property. This material can be an image that is mapped onto its surface, a color, and a degree of shininess. There is a lot of customization you can do to an object’s material property, but for our purposes we’re just making it purple for now:

	
let material = SCNMaterial()
material.diffuse.contents = UIColor.purple

We have created a box and a material, but these variables are not yet associated with one another. These are going to be properties of a SceneKit object called an SCNNode. SceneKit, like SpriteKit, is completely composed of graphs of nodes. If you want something to appear on the screen, you need to set properties on it and add it to a node:

	
let node = SCNNode()
node.geometry = box
node.geometry?.materials = [material]
node.position = SCNVector3(0, 0.1, -0.5)		
scene.rootNode.addChildNode(node)

Finally, you need to commit the scene to the view:

// Set the scene to the view
sceneView.scene = scene

This isn’t much, but it’s pretty impressive for fewer than twenty lines of code. Having spent a hundred lines of code trying to render triangles with no lighting to the screen, this is magical voodoo to me.

Current ARKit Limitations

ARKit impressed me with the degree of simplicity it was able to accomplish for the boilerplate operations around capturing frames. Like most Apple frameworks in their initial release, ARKit does have a few weak points that should be improved over time. These primarily include:

  • Bad Lighting
  • Excessive Motion
  • Limited Features

If you don’t have good lighting, it’s difficult for ARKit to pick out features in a dark room. Similarly, it’s difficult for the camera to keep track of points in space if you’re waving the camera around a lot. It’s using machine vision algorithms that don’t do well with motion blur. Additionally, as you run your ARKit application, the program will keep analyzing the scene and the points may change over time.

Currently, the only objects that ARKit works well with are flat surfaces and horizontal planes. The framework isn’t equipped to recognize three dimensional objects. You can’t create a virtual cat that will recognize a chair in a scene and hop onto it. It would recognize many flat surfaces and may wind up on a table or the TV. This may be closer to how cats behave in real life, but you probably want a little more control over your virtual cats.

Final Thoughts

You can access the sample code for this blog post here. I intend to add to this repository over time.

If you’re serious about working with augmented reality, I strongly suggest learning a rendering engine like SceneKit or Unity.

For me, the limiting factor in ARKit is the same one I have for SpriteKit and SceneKit. Programmatically you can do a lot of cool stuff, but unless you have access to diverse and well done graphical assets, what you can do on your own is pretty limited. SceneKit compensates with its primitive library, but if I want to create a game about pugs I need to either find an artist to create those assets for me or I have to learn how to create them myself.

A lot of people are interested in AR for real world applications, like medical imaging. Myself, I am interested in using it to create immersive real world experiences that you can’t get any other way. I feel that this could be used as a tool for engagement. Rather than staring at our phones while blocking out the world around us, we can use them as a way to more deeply explore the world outside our devices.