Saturday, April 26, 2008

Google App Engine: One-to-many JOIN

GAE By now, no doubt, most developers have heard about the Google App Engine (GAE). And even if you didn't get one of the 10K free accounts, you might still have downloaded and started messing around with the SDK.

Google touts the platform's ease of development, and stepping through the samples reinforce that it is, in fact, quite easy. However, it doesn't take long to discover what will probably be the biggest hurdle for developers entrenched in the relational database paradigm: The Google Datastore. It's not a relational database, and it's not an OOP wrapper to a relational database. It's a web-specialized data storage mechanism, accessed through classes called Models, and objects called Entities.

I'm willing to bet that most of the developers playing with the SDK will first really "get" this when they move past the simple "one table" queries in the samples, and try to do a basic JOIN query. Although there is a SQLlike syntax called Gql -- as stated in the Docs -- there is no JOIN.

To get this functionality, you have to use db.ReferenceProperty to link one object to another. Here's a short demonstration of how it's done. I figure this is much needed, since there seems to be no good examples for it in the Google documentation. (The best information I could find was in the GAE discussion group.)

Below, I've listed example.py in its entirety (don't worry, it's short), and I'll refer to each pertinent section by the line numbers. (You can download the entire sample here. Put it under the SDK folder, and run it like any of the GAE samples.)
1  import os
2 import cgi
3 import wsgiref.handlers
4
5 from google.appengine.ext import webapp
6 from google.appengine.ext import db
7 from google.appengine.ext.webapp import template
8
9 class MainPage(webapp.RequestHandler):
10 def get(self):
11
12 url = EnteredUrl(url="http://domain.com/page.html")
13 url.put()
14
15 match1 = AffinityUrl(
16 url="http://domain.com/dir/page1.html",
17 affinity = .83,
18 entered_url=url
19 )
20 match1.put()
21
22 match2 = AffinityUrl(
23 url="http://domain.com/dir/page2.html",
24 affinity = .8301,
25 entered_url=url
26 )
27 match2.put()
28
29 matched_urls=url.matched_urls.order('-affinity')
30
31 aff_entries = AffinityUrl.all().order('url')
32
33 template_values = {
34 'url' : url.url,
35 'matched_urls': matched_urls,
36 'aff_entries': aff_entries
37 }
38
39 path = os.path.join(os.path.dirname(__file__), 'index.html')
40 self.response.out.write(template.render(path, template_values))
41
42 class EnteredUrl(db.Model):
43 url = db.StringProperty(required=True)
44
45 class AffinityUrl(db.Model):
46 url = db.StringProperty(required=True)
47 affinity = db.FloatProperty(required=True)
48 entered_url = db.ReferenceProperty(EnteredUrl,
49 required=True, collection_name='matched_urls')
50
51 def main():
52 application = webapp.WSGIApplication(
53 [('/', MainPage)],
54 debug=True)
55 wsgiref.handlers.CGIHandler().run(application)
56
57 if __name__ == "__main__":
58 main()
The above stores a URL someone has entered, and then stores other URLs that match it by some degree (the "affinity"). The affinity is a numeric score. This is a simple one-to-many relationship, and to get at the data using standard SQL, we'd write something like:
SELECT
    entered_url.url,
    affinity_url.url,
    affinity_url.affinity
FROM
    entered_url
JOIN
    affinity_url
ON
    entered_url.KEY=affinity_url.FOREIGN_KEY
Here are the steps using the GAE Datastore.

Lines 42-49.
First, let's define the data Model. EnteredUrl defines a single string property, url, for the obvious reason. AffinityUrl defines a string property for url, as well as a float affinity property, for storing the score.

Lines 48-49.
Also, AffinityUrl defines a db.ReferenceProperty named entered_url, which refers to an EnteredUrl object. This is the link between our two data objects, and how we effectively do a JOIN. The collection_name, matched_urls, is used to refer to the collection of AffinityUrl objects that will be linked.

Lines 12-13.
When the page is loaded in the browser we create an EnteredUrl entity named url, setting its url property to a string value.

Lines 15-27.
We setup two AffinityUrl objects, and assign them both a url and a numeric score. Additionally, we point entered_url to our EnteredUrl object, url. We have just linked one object (url) to many (match1, and match2).

Line 29.
This line queries the data in the one-to-many way, and stores it in an object, matched_urls, which I pass through to the template for iteration and output. This is where the collection name we defined in the db.ReferenceProperty attributes is used. Note that the collection name, matched_urls, is called like a method from url, since url is the object being referenced.

Line 31.
Additionally, for illustration, I query the AffinityUrl object data and save it in aff_entries. Just as in SQL, where you can JOIN tables, or query them individually, the App Engine allows you to do both. (Hopefully, you've realized by now that although they look and are accessed differently, these linked entities are behaving quite a lot like relational database tables.)

In the template, I output the data from matched_urls by getting each AffinityUrl object in the collection, and displaying that URL. Note that because of the .order('-affinity') call, we are displaying the URLs with the closest affinity at the top (descending order).
<table>
{% for affurl in matched_urls %}
<tr><td>{{ affurl.url }}</td></tr>
{% endfor %}
</table>
Load this up in your browser, and refresh a few times, and this is what you get:



You may have noticed from the code that I also pass all the data stored in the AffinityUrl model (line 31) to the template as well. This is output in the second table, above.

Because I've refreshed the page several times, I've generated and stored the match1 and match2 objects multiple times to the Datastore. This highlights something strikingly different about the Datastore and a SQL table. SQL statements like the one I give will display all the entries that match between EnteredUrl and AffinityUrl, even if entries in AffinityUrl are duplicated. As you can see, even though we have duplicate AffinityUrl entities stored, the reference from the EnteredUrl entity is smart enough to realize that they are duplicates, and only displays the ones that are unique. Update: please see the comments for a correction of the previous statements. The Datastore is creating new entities each time with a unique ID...

The Datastore takes a little getting used to, especially for those experienced in the standard relational data models. (Good ol' paradigm shift.) The GAE documentation feels unfinished or at least rushed, which is unfortunate. I personally think they should have concentrated more on giving good examples that demonstrate mapping relational concepts to Datastore concepts, since the majority of developers looking at the GAE will be old hands at the relational stuff.

I'm sure they'll get there eventually. In the meantime, I hope you found this tutorial useful.

2 comments:

  1. It is worth explaining further that in line 49, as another critical contribution to the 'join-equivalent' GAE functionality, the GAE framework automatically creates a Collection named 'matched_urls' on each EnteredUrl instance AND the framework automatically adds a reference to the given AffinityUrl into that matched_urls collection of the given EnteredUrl instance.

    ReplyDelete