Deploying on the Edge: Bringing “AI” out of the clouds and back down to Earth
The driving pressure behind the advancement of modern Machine Learning has been the idea of “scale”. That by scaling our models up in size and training them on ever more data they will start to perform miracles. So captivated, starry-eyed, by this seemingly modern thought that people seem to be overlooking a basic fact.
We’ve known that Deep Neural Networks improve in performance with scale and data for decades. Generative models especially, like LLM ChatBots and image generators, will obviously be able to produce outputs with larger variety and complexity as we increase the amount of data they are trained on. And if you are going to train on more data (and you want to capture all that new variety and complexity) you’ll need a larger model. This makes headline stories about the slowdown in improvement of LLMs on “benchmark” tests and anecdotal stories of the seeming lack of advancement of newer models somewhat inevitable.
Just a few more data centers and we’ll be set….
Nonetheless, the pioneering companies in this space insist that the future is still going to be the gigantic, monolithic models that they have spent hundreds of millions of dollars on developing. While there will certainly be room for such a design paradigm, a real concern is how this idea of training large monolithic models has bled into other domains and has started to establish itself as “the way to do things”.
It’s just the way to do things…
For example, in the field of ecological monitoring the “Species Net” project by Google and “Mega Detector” by Microsoft seek to detect and identify any and all species. They are still a while away from this ultimate goal and, while there is some utility in such a model, I can’t help but think they are pushing the development of this technology down a narrow road.
An obvious issue with using such monolithic models is that while they’re usable in a wide variety of applications, they’re overkill for every application. This can be seen clearly with ecological monitoring, where we wish to use species detectors and identifiers to automate the processing of collected data. For an individual study you’re usually focusing on a particular region or a particular family of species. You obviously don’t expect to, or care about, being able to detect every species that exists anywhere on the planet. What this means is that a very large portion of the model's capacity is wasted on the potential to detect species that will never be detected, or ones we don’t even care about.
Definitely not a Red Deer…
What this "unutilized capacity” translates to, in a practical sense, is a whole lot of calculations being performed by the model to try and detect a species that will never be present. As discussed in my earlier article “Why AI”, our models don’t output a single definitive label for an image. They assign probabilities to each class they’ve been trained on, reflecting how likely it is that the given input contains that category. For our species detection model, for every input image we give it, it will provide proposals for where it thinks an animal is. It also provides us a likelihood for every species it was trained on that THAT species is the animal detected. For a theoretical “Australian Bird study” our monolithic species detector would always provide an estimate on how likely the animal in our image is a Red Deer. What is more of an issue is that for our model to KNOW if it COULD be a Red Deer, it must dedicate a part of its computation to working this out!
Accepting this unused, or unnecessary capacity might be worth it, rather than trying to train smaller dedicated models for every application. This is especially true if inference (model deployment) is performed using the ever flowing cloud compute infrastructure. But, for some applications, it restricts us to using ONLY this infrastructure. For alternative methods of deploying our trained models it creates obvious, unavoidable limitations.
The edge of data collection
So called “Edge Compute” refers to the method of processing data at, or close to, the point of collection. This could be done in “real time” where data is processed as it is collected or periodically, with data only being processed once a large batch is collected, or at set times of day. There are many obvious benefits to using this “edge compute” method of processing data:
Results from data processing can be collected faster.
No need for high bandwidth internet connection to transmit data to the cloud
Raw data isn’t being transmitted over potentially insecure networks
No need to pay for ongoing cloud processing costs.
Our modern “AI” algorithms that use Deep Neural Networks (DNNs) provide a challenge to those who wish to deploy their system “on-the-edge”. Modern DNNs are computationally… “heavy”, that is, they require a whole lot of calculations to process a single input. One property of DNNs that we can exploit to our advantage is their "parallelizability". Basically, lots of the calculations that a DNN performs are independent of each other, so we can perform one calculation at the same time as others without worrying about the results of all the others. In order to exploit this structure, we often use Graphical Processing Units (GPUs) - first designed to render 3D graphics onto a 2D screen (a process that also requires a lot of “parallel calculations”).
A Nvidia GEFORCE GPU. The colorful lights make it even faster…
Typically these GPUs are used to both “train” our DNNs on data and “run” them once trained. One important thing to note is that we don’t NEED to use a GPU, our trusty CPU is more than capable of running our DNN. However, for your average DNN, your average GPU can run it at least 10-100x faster than your average CPU. This is thanks to the fact that we can use the GPU to take advantage of the DNN’s parallelizability.
You’ve likely heard of GPUs in the context of modern “AI”. Nvidia has become the world's most valuable company by producing the most advanced GPUs for DNN training and deployment. You’d therefore be excused if you thought GPUs were the only option. The last decade has seen a rise in another technology, a class of devices designed to run DNNs as fast and efficiently as possible.
Neural Processing hardware
The “Neural Processing Unit” (NPU) refers to devices that are designed to do one thing and one thing only, run a trained DNN. Now the actual way these devices do this varies WILDLY from company to company. They even go by all sorts of names, “Brain Processing Units”, “Neuromorphic Processing Units”, “Neural Accelerator”, “Deep Learning Processing Unit”, “AI Accelerator“. And while there is some difference in the type of Neural Network they can run, the general intent behind them is the same “Once you have trained a DNN, use something simple and efficient to run it”.
This technology has opened the door to deploying powerful DNNs out into the world. Instead of sending our collection of images to the cloud for processing, we can now process them at the point of collection. They also allow for something more: a shift in the way we use and think about DNNs.
The Hailo 8 AI accelerator module
A typical use case for a DNN-based animal species identifier is as a method to filter/process images collected by traditional methods such as trail cameras. With this method the DNN is used to identify the species captured by the trail camera, as well as filter out the many many images that don’t contain anything. A trail camera itself uses a “Passive Infrared Sensor” (PIR) to detect moving warm-bodied objects (many camera-based security systems also use PIRs as well). What would be ideal is if we could cut out the “middle-man” and use the DNN itself to both detect and identify the species directly from the camera sensor. We would therefore not be limited to identifying what the PIR can detect, but can instead extract/detect anything that can be picked up by the camera.
The Anything detector
Once you build a detection system around using a DNN running on a NPU you can treat the whole “package” as a single device. It’s now not a camera, computer, NPU and DNN, it's an anything sensor. The same device can be deployed with a different DNN trained to detect/identify whatever it was trained on. We can process an image resolution of whatever size the NPU can handle, and run it as fast as it’ll go without worrying about 4G/5G bandwidth or satellite internet costs.
Our self-contained “AutoEcology” Edge Compute device running a DNN-based Object Detection model to detect and identify animal species.
To fully complete the picture of a world of fast, efficient and specialised edge compute sensors we need to train fast, efficient and specialised DNNs. While many NPUs do have the compute power to run many of the monolithic models, they introduce inefficiencies that take away from the potential of edge computing. Large models mean slower inference and higher power requirements. Slower inference means less flexibility and higher power requirements mean larger batteries and larger devices.
Because of this I believe that we need to turn our focus on how we can create systems that have the ease of use that large monolithic models provide, while also being small and efficient.
A change in thinking…
As of this year, the year 2025, the field of DNNs has greatly matured. There are now established ways in which we use and train DNNs for specific applications. If you want to train a DNN-based object detector, you just need the data. The method, the code and the know-how already exist. Instead of training larger and larger monolithic models I propose we further develop the automation of the process. That is, the process of training and fine-tuning small task-specific models ready for edge deployment.
This may look like simply automating the process that is currently used to manually set-up the training/fine-tuning of models using existing labeled data. We could even use single monolithic models to automate the labeling and curating of new data that can then be used to automatically train new smaller models. Versions of these methods do already exist in some form, but there is far less investment and interest in this area compared to its monolithic model counterpart.
But big model is better!
There is a good argument to make as to why training large monolithic models is favorable. Ignoring the fact that being able to advertise that your model can “identify every species of bird in the world” will be attention grabbing, models that are trained on more data perform better. A model that is trained on 10 birds will likely identify those 10 birds worse than a model that can identify an extra 100 birds. The extra variety introduced by the additional data helps the model identify what a “bird” is, and what is not. However, it should be possible to develop new methodologies that have the best of both worlds.
A MOE model routing/gating the input to only part of the model.
“Mixture of Experts” (MOE) models are already established in the work of LLMs as a way to try to have the “best of both worlds”. This method trains a model that is broken into several parts. The model is then encouraged to “compartmentalise” its abilities into these separate parts. For a given input it also “predicts” what parts of the model are going to be needed to produce an accurate output. We then only pipe the information into these parts of the model, reducing the overall number of calculations needed.
It’s hard to know how many of the largest LLMs use this method, as the companies that develop them like to keep their system architecture a secret. There is also some evidence that this method is going out of fashion and MOE methods would also require us to deploy the whole model even if we would only be using a part of it for most inputs.
However the general idea of compartmentalizing the models abilities could lead to new methods of quickly creating ultra efficient and task specific DNNs. Instead of creating a standard single monolithic model, we could instead train a model with a novel architecture that allows us to identify and extract parts of it that will be useful for a specific application and deploy only this part of the model.
Deploying a DNN on an edge-compute device
The best of both worlds?
Even with the apparent stagnation of the monolithic LLMs there are still a lot of potential avenues to development in the world of Deep Neural Networks. We must continue to think outside of the box with the paths we explore and not get stuck following the leader.
The combination of cameras with DNNs to create a single, stand-alone, edge compute device that could be configured to detect anything will be revolutionary. It is a future that I am confident will happen soon. With the right combination of efficient edge compute hardware and configurable DNNs we can create an ecosystem that will deploy sensors on the ground to detect whatever we want. We just need to change the way we think about developing and deploying DNNs, we need to get our “AI” out of the clouds.
If you want to know more about deploying DNNs for ecological monitoring contact us today!