Wednesday, April 30, 2008

Google App Engine: [A Better] Many-to-many JOIN

GAE(This is a follow-up to my original post GAE: Many-to-many JOIN. It probably wouldn't hurt to read that first, since this post sort of assumes you have.)

After getting some feedback on my original post, a simpler, more SQL analogous way to obtain the many-to-many behavior was pointed out to me.

I've created another sample (download it here), and will go over it below. Afterwards, I'll talk about why you shouldn't model your data this way, and instead should denormalize your data for optimization in the Datastore.

Here are the new data Models. (The full code listing is here.)
class Libraries(db.Model):
notes = db.StringProperty()

class Books(db.Model):
notes = db.StringProperty()

class Library(db.Model):
name = db.StringProperty()
address = db.StringProperty()
city = db.StringProperty()
libscol = db.ReferenceProperty(Libraries,

def books(self):
return ( for x in self.librarybook_set)

class Book(db.Model):
title = db.StringProperty()
author = db.StringProperty()
bookscol = db.ReferenceProperty(Books,

def libraries(self):
return (x.library for x in self.librarybook_set)

class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
I still have the Books and Libraries models, as you can see. These are needed to collect the Library and Book entities so I can easily iterate over them and output. The Book model contains a reference to Books, via Book.bookscol, and Library to Libraries, via Library.libscol.

The LibraryBook model just contains references to the Library and Book models. This creates our "join". After we add libraries and books to the Datastore, we will link them to each other using LibraryBook entities.

When the page loads, we first create and store our data entities.
# Library collection
libs = Libraries()

# Book collection
books = Books()

# Setup libraries
lib1 = Library(name='lib1', address='street a',
city='city1', libscol=libs)
lib2 = Library(name='lib2', address='street b',
city='city2', libscol=libs)

book1 = Book(title='book1', author='author one',
book2 = Book(title='book2', author='author one',
book3 = Book(title='book1', author='author two',
book4 = Book(title='book2', author='author two',
book5 = Book(title='book3', author='author two',

l1 = LibraryBook(library=lib1, book=book1)
l2 = LibraryBook(library=lib1, book=book2)
l3 = LibraryBook(library=lib1, book=book4)
l4 = LibraryBook(library=lib2, book=book4)
l5 = LibraryBook(library=lib2, book=book5)
l6 = LibraryBook(library=lib2, book=book3)
l7 = LibraryBook(library=lib2, book=book1)
First, we create our Libraries and Books entities, libs and books. These will be passed into each Library and Book entity we create.

After we create our books and libraries, we generate a lot of LibraryBook entities, assigning a library and a book to each one. Each LibraryBook entity now links one library with one book. As you may have noticed, some books are assigned to both libraries, some are not.

Library contains a method called books(). It returns every book in the librarybook_set as an iterable data structure. Because LibraryBook holds a reference to Library, any Library entity (say, lib1), is given a back-reference to the collection of LibraryBook entities. If you do not define a collection_name, GAE automatically creates one by appending "_set" to the model name. This is where librarybook_set came from, in case you were wondering.

Given a library entity like lib1, the books() method allows us to easily return all the books at that library by simply assigning or iterating over lib1.books(). The Book model contains a method called libraries() which does just the opposite: allows you to get all the libraries where a given book resides.

Our data has been created and linked. Now we pass it in to the template.
template_values= {
'books_at_lib': lib1.books(),
'forbook': book1.title,
'libs_by_book': book1.libraries(),
'libs_books': libs.libscol.order('name'),
'books_libs': books.bookscol.order('-author').order('title')
In this example, we not only display all libraries and all books (via libs_books and books_libs) the way we did in the previous post, but also output all books at a library (books_at_lib), and all libraries that contain a given book (libs_by_book).

Here's the template, if you want to take a look at it.

Denormalize your data

As I stated before, the GAE Datastore is not a relational database. Databases were designed for compactness and efficiency, and normalization is used, in part, as a way to minimize the size of your data on disk.

The Datastore has been built, first and foremost, with scalability in mind. Scalability means, in essence, "add more servers as needed, without re-writing your code". Specifically to the GAE Datastore, it means "disk space is cheap, stop worrying about it, and scale".

Consider modifying our LibraryBook model above to look like
class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
booktitle = db.StringProperty()
libraryname = db.StringProperty()
Now, we are not only storing each book's title in the LibraryBook entity, but we are also storing it in the title property of the referenced Book entity. While this is obviously not space efficient, and certainly not the elegant, normalized way of storing relational data our brains are used to, it scales well and is fast.

It scales because the Datastore runs on who knows how many commodity computers in the background (without the knowledge of our application), and it's fast because we have the most commonly needed fields available immediately. If you need to poke further into the data, like to get the street address of the library, you would use the referenced models, and our JOIN then comes into play.

(Thanks, Ben the Indefatigable for illuminating this.)

No comments:

Post a Comment