Why is AI-generated synthetic data all the craze lately? On this article, I’ll clarify my favourite method: with cats!
Let’s say I wish to prepare a cat-not-cat classifier from scratch, however I solely have one photograph to work with:
(The whole lot that follows is an analogy for what folks do with tabular information and textual content information, so it applies past picture information.)
Ideally, I’m going to wish a dataset consisting of 1000’s of cat and not-cat photographs. If I’ve a digital camera and plentiful entry to cats, I can take a bunch of photographs just like the one I have already got, guaranteeing that I get precisely the dataset I designed:
However what if I don’t have a digital camera and I dwell catless on the moon? I may get the photographs I would like from a vendor, although I should watch out since inherited data is extra harmful than major information.
However what if there’s no vendor who’ll promote me some cat photographs? (Sure, working out of cat photographs on the web is a state of affairs that’s extra sci-fi than dwelling on the moon, however bear with me.)
Effectively, if I can’t gather them and I can’t purchase them, then I’ll should make them myself. Behold, my creation:
No good? Yeah, drawing was by no means my robust swimsuit. One other method to make faux information is to repeat present datapoints, besides this isn’t going to be a lot use for offering educational selection.
It’ll be like instructing a human scholar by giving them the identical instance again and again, so all they study is that one factor. If my dataset is 30,000 copies of this Huxley photograph…