My wife and I both have a tendency to leave the garage door open. You’re in and out, grabbing garden tools or supplies, and at the end of the day you enter the house through the back door and forget to check the garage. Luckily, we live in rural Canada, surrounded by wonderful people, where the door could sit open for days without anything “disappearing”. But it still makes me feel nervous to discover it’s been forgotten, if only because it is a waste of heat in the winter (not to mention the chance of blowing full of snow!). I tried a couple ways to solve the problem:

  • Gogogate2 wireless sensor can detect whether the door is open based on an accelerometer inside the device. My mistake was to order one without realizing that you also need the main Gogogate2 wireless module. The Gogogate2 can open and close the door remotely in addition to detecting whether it is open, but it was quite a bit outside the budget I had planned.
  • Particle Internet Button caught my eye as an interesting way to get into the world of IoT. I figured I could code the accelerometer to ping me when the door was open. I tend to get bored, edgy, or burnt when I use a soldering iron, so the Button was a good way to get everything I needed pre-installed. So I ordered one, played with it a bit, and managed to brick it enough that the reset sequence didn’t even work. I sent it back. I wasn’t sure how I was going to power it anyway; it would burn through a lot of batteries.

Ultimately, I settled on using Machine Learning (computer vision) to solve my problem. I have been studying the iconic Coursera Machine Learning course and thought it would be neat to apply some of what I knew. I have found the course extremely interesting and lucid, by the way, and definitely recommend it. I now have a basic understanding of neural networks and support vector machines, among other concepts, and figured I’d try to apply it to the garage door problem.

The first thing I needed was an in-budget camera. My first instinct was to find a cheap android phone on Kijiji and use one of the many Android security camera apps to digitize the images. But, surprisingly, I couldn’t find any good deals. There were oldschool flip-phones for $20 and flagship phones for $300, but no $70 Android phones a few years old. Not even with a cracked screen! So I went to the web and did a lot of research on cheap security cams. My main requirements were that it be wireless and that images could be accessed remotely using a script. The latter was surprisingly hard to find; most low-cost cameras seem to have Mobile Apps, but no direct access over IP.

Ultimately, I bought a cheap Nettoly security camera. For $50, I’m delighted with it. It will take a snapshot every X minutes and upload it to an FTP server that I set up on my local network. As a bonus, the camera features infrared imagery, so the nighttime images look very similar to daytime images (especially in grayscale), rather than being a useless, completely black image.

Motivation for this article

I wanted to write about my experience because I was disturbed by how unapproachable most of the existing tutorials on machine learning and computer vision are. They seem to be written for people with a math or science background or are already familiar with machine learning. This article targets the opposite audience; people who know how to code in Python, but maybe haven’t used much of the scientific python stack before.

Further, a lot of those tutorials have a disturbing lack of sound software engineering principles. It is said that data scientists are not software engineers. The argument goes that they are academics who need their scripts to work once, and not be maintained indefinitely, so they get sloppy and often never even learn proper development principles.

This stereotype is unfair, as stereotypes always are. I’m a good software engineer, but I found myself falling into the same habits that I was seeing in other examples while I was implementing my project. I had scripts all over the place with copy-pasted code and some pretty awful idioms. One reason for this was that I was too excited to get to the next step once I solved the first one. But I think the main reason is that the nature of data science is pipeline oriented. Once you have solved one step in the pipeline, you dump all your data into a folder. After that, you never need to think about that step again. You just start on the next stage and if you need some old code, the easiest thing to do is copy-paste it.

So I admit it, I wrote my code like that. It was very relaxing, actually, to just code with no regard for maintainability! But I cleaned it up before committing, because I want my tutorial to be comprehensible. And I want my code to make sense to me if I need to adapt the algorithm this winter: I’m not certain that images of an open door onto a snowy white driveway will be classified as accurately as those pointing at black pavement!

Three lines of code

It took me a long time to get there, but it turns out that computer vision is really easy.

Ok, that’s a grand, sweeping statement. Let me rephrase: Computer vision is really easy for this particular problem if you know what you’re doing.

Which I didn’t. But let’s start with the spoiler:

from sklearn import svm

classifier = svm.SVC(C=5), expected)

There you have it. That’s all that was needed in order to train a machine learning model on a bunch of images of my garage door. And it has had 100% accuracy in the weeks since I first spun it up. I was surprised and a bit disappointed with how simple it turned out to be.

Some other lines of code: the overview

Of course, the devil is in the details: svm stands for Support Vector Machine, a type of machine learning algorithm that, if you’re extremely interested, I’ll leave Andrew Ng to describe to you in week 7. Or you can take his (and my) word for it that it works. Apparently SVMs were really popular in the last decade, but neural networks excite the machine learning crowd more these days. However, support vector machines are not inherently “inferior” to neural networks, and they have the advantage of being much faster, in addition to only needing three lines of code. The sklearn library (pip install sklearn, which should also pull in numpy and pandas) already has a full implementation of the algorithm, so all I had to do was call into it with appropriately formatted data.

Getting that appropriately formatted data wasn’t easy. There are only three parameters: C, features, and expected, but they took a fair bit of effort to figure out.


expected is the easiest to understand; it’s just a list of expected results for each of the images I manually classified in order to train the algorithm. I chose 1 if the image represented an open door, 0 if it was closed. The result is stored in a numpy vector (basically a list of integers, in this case 1s and 0s). Machine learning typically has to crunch a lot of numbers, and numpy is ideal for crunching a lot of numbers in Python.


C is a constant real number that defines how “sensitive” the algorithm is. The “correct” way to derive C is to try a bunch of different numbers for it by running the algorithm with different values for C and then picking the one that gives the best result for some validation data (Tip: your validation data should be completely separate from your training data). In my case, I tried it at whatever the sklearn default is, and it was pretty inaccurate against my test data. Then I tried it at 5, and it was 100% accurate. I picked 5 arbitrarily, so I guess I just got lucky.


features is the most complicated parameter. It’s a matrix of numbers where each row contains data about one of the images I had previously classified (machine learning professors would call this the “training set”).

For those who don’t know numpy or linear algebra: A matrix is like a box of data. In this case it’s a two dimensional box, although numpy can also support a three dimensional box. Or more than three dimensions, but I try not to think about that because it hurts my brain to try and fail to visualize four dimensions.

Each column in the 2D table represents a separate “feature” for a training sample. In this case, each separate feature represents a single pixel in one of my garage door images, after being scaled down to 640x480, converted to grayscale, and divided by 255 (the total number of possible grayscale values at any one pixel position) so each feature is a value between 0 and 1.

I did these preprocessing steps because they are pretty common across all the other vision examples I’ve seen, both in the course and the tutorials I had tried to read. However, the course actually suggests that you should try running the algorithm on the original data before doing any sort of preprocessing. The most common reason to process the data in this way is to reduce the time it takes for the learning algorithm to do its calculations.

Converting the image from RGB to grayscale reduces the amount of data by one third. Scaling the image from over 3000 pixels to 640 pixels wide similarly reduces the data size. Converting the features to a range of 0 to 1 probably wasn’t necessary. If you are running SVM on something other than images (for example, thousands of square feet vs a handful of ‘number of rooms’ for a house classification problem), the learning algorithm will perform better if you normalize them all to the same range. But since my pixels all take on a similar range of values, it probably didn’t matter if it was between 0 and 1 or 0 and 255.

The rows in the vector represent one “sample” of data; in this case one image from my camera. The rows are in the same order as the expected vector. So row 10 in features represents an “open” image if the tenth element of expected is a 1.

Getting and viewing data

Let’s rewind to the beginning and look at some code (the good part!). I want to explain the steps in the process in a more linear fashion, tutorial style. Feel free to follow along at home if you’ve got a security camera in your garage.

I’m going to skip over the part where I configured the camera to copy a snapshot of the garage doors into an FTP folder every 10 minutes. It’ll be different for every camera, and even if you use the same camera, it came with instructions. It took me a couple tries to get vsftpd configured the way I wanted it on my NUC, but a web search will probably do you more good than my personal configuration decisions.

Assuming you have both the ftp server and camera configured correctly, the following Python code is all that’s needed to retrieve them from FTP and store them locally:

from ftplib import FTP

def get_ftp_files(host, username, password, source_dir, remove_older_than=2):
    ftp = FTP(host, username, password)
    dates = [d.partition("/")[2] for d in ftp.nlst("pub")]

    for d in dates:
        files = ftp.nlst(f"pub/{d}/images")

        for file in files:
            dest_file = file.rpartition("/")[2]
            ftp.retrbinary(f"RETR {file}", (source_dir / dest_file).open("wb").write)


You might have to change this code a bit depending on how your camera names and structures the directory tree in your FTP server. In my case, each image is in a directory named pub/<date>/images, so I iterate over the files in that directory and download each one. Then I write it to the local source_dir. That variable, along with all the filesystem paths in my project, are instances of pathlib.Path.

You will also have to completely replace this code with something else if your camera is not providing your images over FTP. You could make an http request using the requests library, or just copy the files off a hard drive using shutil, for example.

Note that in my production code, I have extended this method a bit to remove images older than a few days so my ftp server doesn’t run out of disk space.

I also wrote (copy-pasted from stack overflow and lightly edited; sorry I can’t find the source now) a fairly simple function to create a video of all my downloaded images. This was mostly to sanity check that the downloader was working, but it’s fun to watch the video:

def make_video_of_files(fps=5):
    output_path = config.options.output
    out = None
    first_image = True
    for _, image in iter_source_images():
        if first_image:
            height, width, channels = image.shape
            fourcc = cv2.VideoWriter_fourcc(*"mp4v")
            out = cv2.VideoWriter(str(output_path), fourcc, fps, (width, height))
            first_image = False


        cv2.imshow("video", image)

        if (cv2.waitKey(1) & 0xFF) == ord("q"):


This uses the OpenCV (pip install opencv-python) library to load each image and feed it into a video file. OpenCV takes care of all the heavy lifting; you’ll be seeing a fair bit of openCV before you finish reading this. You’ll also see quite a bit of the iter_source_images generator, which looks like this:

def iter_source_image_paths(filter=None):
    for path in sorted((config.options.data_dir / "source").iterdir()):
        if path.suffix != ".jpg":
        if not"P"):
        if filter and not filter(path):
        yield (path)

def iter_source_images(filter=None):
    Loop over all images in the source directory and yield the image
    path and a numpy array containing the image data.
    for path in iter_source_image_paths(filter):
        yield ImageData(path, cv2.imread(str(path)))

Variations of these two functions, by the way, were originally copy-pasted between several different scripts. I refactored them out before showing them to you, the general public, but I wanted to point out one place where the “don’t repeat yourself” principle just didn’t seem as necessary as it does in traditional software engineering. But I’m glad the code is cleaner now, and I thank you for giving me the incentive to improve it.

Manually classifying training data

There are two general classes of machine learning algorithms, or at least two that I’ve been taught about: supervised and unsupervised learning. In supervised learning, you train the algorithm by manually assigning examples to specific classes (like “open” and “closed”), and then giving them to the algorithm so it can learn how to classify new samples. Unsupervised learning just takes a pile of data and tries to make various inferences about it without any previously know results.

Support vector machines are a type of supervised learning, which means the first thing I needed to do was create some data to train it. This is a manual process, no matter how you cut it; if it could be automated, you wouldn’t need machine learning (or you’d already have it…)! So I wanted to write a script that allowed me to view each image and quickly categorize it into one of two classes: OPEN or CLOSED. Once I had that, I would dump rows with the image name and the class to a CSV file for later processing.

The key to this project is efficiency. I want to spend as little time doing a tedious classification task as possible. I arranged it to leverage my aged, but trusty Kinesis Advantage keyboard. If you look closely, you can see that the keyboard has separate keys under the thumbs. I arranged my program so that it would display an image, I could hit either “left thumb” (backspace) or “right thumb” (space) for the OPEN and CLOSED cases. If you don’t have a similar keyboard, you’ll probably want to find different keys, but that’s up to your taste.

The meat of my classifier reads like this:

    for image in iter_source_image_paths():
        if in previous_files:

        frame = cv2.imread(str(image))

        cv2.imshow("image", frame)
        key = cv2.waitKey(0) & 0xFF

        if key == ord("q"):  # Q means quit
        elif key == 32:  # SPACE means CLOSED
            classifications.loc[len(classifications)] = [, "CLOSED"]
            print(, "CLOSED")
        elif key == 8:  # BACKSPACE means OPEN
            classifications.loc[len(classifications)] = [, "OPEN"]
            print(, "OPEN")

    classifications.to_csv(classification_file, index=False, header=False)

It asks openCV to display an image on the screen, and then reads in a keycode for any keys that were pressed while that image was displayed. Depending on which keycode it received, it classifies the image and updates a pandas dataframe with the new row of data. After it’s done all the images, it writes out the csv file and cleans up the openCV windows.

You can look at my production code to get the full version of this method. That version does stuff like:

  • download the images from the ftp server using the same get_ftp_files method you saw earlier
  • loads the existing CSV file into a pandas DataFrame for easier manipulation. I’m not an expert with pandas, so I had to do a little experimenting with the header and names argument to figure out how to get it into the right format:
        classification_rows = pandas.read_csv(
            classification_file, header=None, names=["name", "class"]
  • Back up the old CSV file. If I mess up, the easiest thing to do is restore the backup and start over.
  • Put all the existing image names in a set so I don’t have to classify stuff that was downloaded before:
        for c in classification_rows.itertuples():
            classifications[c[1].strip()] = 1 if c[2] == "OPEN" else 0
  • Interpret a 0 keypress as marking the next 10 frames as closed. This was useful for classifying during the night, when I knew the door hadn’t opened.

With my camera taking a snapshot every 10 minutes, I collect 144 frames per day. With this program, I can easily classify those 144 images in about a minute. It’s actually quite a zen process: right thumb, right thumb, right thumb, left thumb, right thumb left thumb, zero, right thumb, right thumb…


The next step, once I had all these images stored in a directory, and a CSV file showing which class each image fell into, was to run the SVM algorithm to train the data. This required two steps, as I briefly described earlier: preprocessing the data to get it into the format the classifier desired, and then running the classifier.


The preprocessing step is probably the hardest part to grok, so I’ll break it down a bit:

    classification_rows = pandas.read_csv(
        classification_file, header=None, names=["name", "class"]

    classifications = {}

    for c in classification_rows.itertuples():
        classifications[c[1].strip()] = 1 if c[2] == "OPEN" else 0
This first loads all the rows into a pandas data frame, setting header to None (meaning there is no header row in the CSV file) and passing names for the columns. As I mentioned earlier, it took some research to figure this out. In addition pandas requires named columns and will default to the first row, so I passed those in explicitly

After it’s loaded the file, it creates a dictionary mapping image file names to integer classes, where 1 means the picture includes an open door, and 0 means it is an image of a closed door. My garage actually has two doors, but I don’t care which door is open, so I only created these two classes. If either or both doors is open, it gets a 1.

    features = []
    expected = []

    for image, frame in iter_source_images():
        classification = classifications.get(
        if classification is None:
            print(f"No classification for {}")

After initializing two lists that will eventually get passed into the classifier, I iterate over each image path name. I check whether the door is open or closed (if it’s not in the CSV file, I just skip that image), and then add that classification to the expected list. It is important to note that features and expected need to be in the same order; that is, each row in features needs to include the pixels that represent the same class in expected.

This next part is still inside the for loop, so it’s operating on each image:

        small = cv2.resize(frame, (640, 480))
        gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
        normalized = gray / 255  # Convert to 0-1.0 range

        sample_vector = numpy.reshape(normalized, (normalized.size, 1))

        features.append(sample_vector[:, 0])

One thing that took me a very long time to figure out is how to convert the image loaded by OpenCV into a numpy array suitable for sklearn. Anyone who knows the opencv-python library is probably laughing quite heartily right now: cv2.imread (which was called inside the iter_source_images generator) returns a numpy array by default. I spent at least two hours of research to learn that I didn’t have to do anything.

So the frame variable above is a numpy array. Each number in the array represents a red, green, or blue part of a pixel, with a value between 0 and 255, depending how much of that colour is visible at that location. It’s therefore a three dimensional array; each row and column represents one pixel in the source image, while the third dimension (depth?) represents one of those three colours.

The first thing I do with this array is scale it down to a smaller frame using the cv2.resize function. It’s still a three dimensional array, but it now has fewer rows and columns.

Then I convert it to grayscale, which coalesces the three RGB columns into a single number. That removes the third dimension, and there are now 1/3 as many numbers in the array.

Next, I divide the array by 255, which is the same as dividing each of the pixels by 255. The result is to convert each pixel, which might have a value between 0 (all black) and 255 (all white) into an array with values between 0 and 1. As I mentioned before, I have a feeling this step wasn’t necessary, but the algorithm was working so well I didn’t want to mess with it.

Next comes a tricky bit. The numpy array we currently have is a 640 x 480 two dimensional matrix. Each of those pixels can represent a single ‘feature’ to the sklearn classifier, but the classifier needs a one dimensional vector. The numpy.reshape function converts the matrix of 640 x 480 values into a one dimensional vector containing 307,200 values. They are the exact same values, but you can think of each row being stacked up into a single column. Or maybe it’s each column being stacked up. I knew when I wrote it what the reshape call was outputting, but I’ve forgotten now. So let’s just call it, an array suitable for input into a classifier.

After all that processing, the vector representing the original image is added to the features list at the same index as the class value indicating whether that image contained an open or closed door.


The next step is to feed all that data into the classifier which, as we saw earlier, takes three lines of code, plus a few extra:

    features = numpy.array(features)
    expected = numpy.array(expected)

    classifier = svm.SVC(C=5), expected)

    joblib.dump(classifier, str(training_parameters_file))

The loop I described in detail above appended the features and expected values into two lists, but the classifier expects them to be in numpy arrays for efficient processing. My understanding from reading numpy documentation is that it is more efficient to create Python lists and then convert to an array than to concatenate individual rows into a single array as you go along. In the latter case, numpy would copy all the data into a new location in memory, which gets expensive. Using the list method, there only needs to be one copy after the loop is finished.

Then I construct a classifier object, as I described in the spoiler, and call the fit function to generate the learned parameters for this particular training set. This is obviously doing a ton of work on our behalf. In essence, the classifier has taken all the numbers representing the input images and done some (quite complicated) math on them to come up with even more numbers. Those resulting numbers can be combined with the extracted feature vector for future images in order to spit out a 0 or a 1 to determine whether the door is open or closed.

The last line really annoys me. The classifier stores these parameters as an instance variable when you call fit. The joblib module, which comes with sklearn, is essentially a wrapper around Python’s notorious pickle function. As far as I can tell, it’s the only way to store the parameters the classifier has calculated. But that means I could run into issues loading the pickle into a different interpreter, and I obviously can’t share the pickled data with anyone else because it might contain malicious code. I can’t believe there isn’t a built-in serializer to a safer data format, but the only one I found had to do a lot of manual processing. At any rate, the parameters are now in a file that can be loaded in the future to predict the state of future images.


Now that those fancy parameters have been stored in a file, it’s pretty easy to write code for sklearn to do the guessing step:

def predict(image_path):
    cv2_image = cv2.imread(str(image_path))
    small = cv2.resize(cv2_image, (640, 480))
    gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)

    normalized = gray / 255  # Convert to 0-1.0 range

    sample_vector = numpy.reshape(normalized, (normalized.size, 1))

    features = [sample_vector[:, 0]]
    features = numpy.array(features)

    new_classifier = joblib.load(str(training_parameters_file))

    predictions_val = new_classifier.predict(features)[0]

    return "OPEN" if predictions_val else "CLOSED"

First, it loads the image and performs the same scale / convert to gray / divide by 255 / convert to vector steps that we did for each image in the training step (I just realized this duplicate code should be pulled out to a separate function, I’ll do that in the git repository, but I’ll leave it inline here since it’s a bit easier to understand).

Then it loads the parameters that were previously dumped out by the training step using joblib.load, the inverse of the dump call from the trainer.

Last, it runs the classifier’s predict function and converts the resulting 1 or 0 into a human readable string indicating whether the door is open.

To my shock, every one of the sample images I have tested since have come up with an accurate description. No false positives or negatives at all.

Bonus material


The whole point of this project was to remind me if the door has been left open. I wrote a fairly simple service that periodically checks whether the door is open. Any time the status changes, it pings me using If This Then That webhooks (in my case I have it connected to Facebook Messenger, but you can hook IFTTT up to lots of things).

Further, if the door has been left open for more than an hour, it pings the Twilio API to send me a text message. I chose Twilio for two reasons. The first is that IFTTT webhooks don’t seem to be reliable. I was afraid I’d leave the door open and IFTTT either wouldn’t notify me, or wouldn’t notify me until quite late. The second is that I often have Facebook messenger muted, so I wouldn’t get the notification. But I have SMS messages unmuted, partially because nobody ever texts me enough to get annoyed with the notifications, and partially because, historically, most of my oncall roles use text messages to wake me.

The IFTTT and Twilio APIs have their own documentation, and the code is straightforward. But I’ll quickly go over the code that decides whether to call them:

    last_image = latest_image()
    last_status = predict(last_image)
    last_change = image_date(last_image)

    print(f"Current status: {last_status} at {last_change:%Y-%m-%d %H:%M}")

    while True:
            get_ftp_files(**config.ftp, source_dir=source_dir)
            last_image = latest_image()
            current_status = predict(latest_image())
            if (
                last_status == "OPEN"
                and ( - last_change).seconds
                > config.open_warning_seconds
                print(f"Warning: Door has been open since {last_change:%Y-%m-%d %H:%M}")
                last_change =

            if current_status != last_status:
                last_change = image_date(last_image)
                print(f"Status changed to {current_status}")
            last_status = current_status

        except Exception as ex:
First I load the latest image (a simple max() call in my library) and ask the predict code we just wrote whether the door is currently open or closed.

The last_change variable contains the date that the door status changed, either from closed to open or open to closed.

Then I start the infinite loop. First I get any new files from ftp. Then I predict the status on whatever the latest image is. If the door is currently open and has been open for more than an hour, it sends me a warning message (using SMS).

Then I check whether or not the status has changed. If so, I store the date and current status and send an IFTTT message telling me about the change. I was planning to delete this code eventually because it would be annoying to get pinged every time the door opens or closes. However, the notification only happens if the camera catches the door in an ‘open’ state. Since it only snapshots every ten minutes, and the normal usage of the door is to open it and then immediately close it, it is actually quite unlikely that I get notified unless it’s been open for several minutes.

Finally, I store the status that I just checked so I can compare it again in the next iteration, and I go to sleep for 293 seconds. Why 293? It’s the highest prime number lower than five minutes. My camera updates the image about every 10 minutes. If I check the door every five minutes, it will take at most fifteen minutes from the time the door opens until I get notified. But ten minutes is divisible by five minutes, so I worry things might get into some kind of lock-step. I usually try to use a prime number for things like this so that there is no chance they’ll ever converge into some strange scenario. You’ll see 293 in a lot of my code because five minutes is a nice round number, and 293 seconds is the closest number that isn’t! It’s obviously not going to matter in this case, just a habit that’s followed me around from my “working at scale” days.

The whole loop body is wrapped in a generic try/except clause so that it won’t just disappear if one of the operations fails.

Command line script

I won’t go into it since it’s just simple argparse stuff, but I wrote a command-line script to perform all the operations discussed in this article. Here are the various commands:

    make_video          Make a video of all available source images
    manual_classify     Manually classify source images that have not
                        previously been classified
    train               Train SVM on classified images
    predict             Predict whether image is OPEN or CLOSED
    message_service     Start service that sends an IFTTT message when door
                        status changes


Not committed to the git repo is a file. If you want to use my code, you’ll need to set up an ftp server and sign up for ifttt and twilio accounts. The config file looks like this:

ftp = {"host": "<host_ip>", "username": "<user>", "password": "<secret>"}
ifttt_key = "<key>"
open_warning_seconds = 3600 # one hour
twilio = {
    "account": "<account number>",
    "token": "<secret>",
    "from": "+<my twilio line>",
    "to": "+<my personal phone number",

In addition, a few of the scripts require command-line arguments (Use --help to see them). The CLI script puts all these in an options attribute in the config module.

And that’s it. Hopefully, if you were a bit intimidated by machine learning, as I was, this introduction has removed some of the mystery. I have really enjoyed my foray into artificial intelligence. It’s been a long time since I’ve studied a new field, and I am finding this one both more approachable and more interesting than I expected.