HBase Datastores
DataNucleus supports persisting/retrieving objects to/from HBase datastores (using the
datanucleus-hbase plugin, which makes use of the
HBase/Hadoop jars). Simply specify your "connectionURL" as follows
datanucleus.ConnectionURL=hbase[:{server}:{port}]
datanucleus.ConnectionUserName=
datanucleus.ConnectionPassword=
If you just specify the URL as hbase then you have a local HBase datastore, otherwise
it tries to connect to the datastore at {server}:{port}. Alternatively just put "hbase"
as the URL and set the zookeeper details in "hbase-site.xml" as normal.
You then create your PMF/EMF as normal and use JDO/JPA as normal.
The jars required to use DataNucleus HBase persistence are datanucleus-core,
datanucleus-api-jdo/datanucleus-api-jpa, datanucleus-hbase
and hbase, hadoop-core, zookeeper.
There are tutorials available for use of DataNucleus with HBase
for JDO and
for JPA
Things to bear in mind with HBase usage :-
- Creation of a PMF/EMF will create an internal HBaseConnectionPool
- Creation of a PM/EM will create/use a HConnection.
- Querying can be performed using JDOQL or JPQL.
Some components of a filter are handled in the datastore, and the remainder in-memory.
Currently any expression of a field (in the same table), or a literal are handled in-datastore,
as are the operators &&, ||, >, >=, <, <=, ==, and !=.
- The "row key" will be the PK field(s) when using "application-identity", and the generated id
when using "datastore-identity"
Field/Column Naming
By default each field is mapped to a single column in the datastore, with the family name being
the name of the table, and the column name using the name of the field as its basis (but following
JDO/JPA naming strategies for the precise column name). You can override this as follows
@Column(name="{familyName}:{qualifierName}")
String myField;
replacing {familyName} with the family name you want to use, and {qualifierName} with the
column name (qualifier name in HBase terminology) you want to use.
Alternatively if you don't want to override the default family name (the table name), then you just
omit the "{familyName}:" part and simply specify the column name.
MetaData Extensions
Some metadata extensions (@Extension) have been added to DataNucleus to support some of HBase
particular table creation options. The supported attributes at Table creation for a column family are:
- bloomFilter : An advanced feature available in HBase is Bloom filters, allowing you to
improve lookup times given you have a specific access pattern. Default is NONE. Possible values are:
ROW -> use the row key for the filter, ROWKEY -> use the row key and column key (family+qualifier)
for the filter.
- inMemory : The in-memory flag defaults to false. Setting it to true is not a guarantee that
all blocks of a family are loaded into memory nor that they stay there. It is an elevated priority,
to keep them in memory as soon as they are loaded during a normal retrieval operation, and until
the pressure on the heap (the memory available to the Java-based server processes)is too high, at
which time they need to be discarded by force.
- maxVersions : Per family, you can specify how many versions of each value you want to
keep.The default value is 3, but you may reduce it to 1, for example, in case you know
for sure that you will never want to look at older values.
- keepDeletedCells : ColumnFamilies can optionally keep deleted cells. That means deleted
cells can still be retrieved with Get or Scan operations, as long these operations have a time range
specified that ends before the timestamp of any delete that would affect the cells. This allows
for point in time queries even in the presence of deletes. Deleted cells are still subject to TTL
and there will never be more than "maximum number of versions" deleted cells. A new "raw"
scan options returns all deleted rows and the delete markers.
- compression : HBase has pluggable compression algorithm, default value is NONE.
Possible values GZ, LZO, SNAPPY.
- blockCacheEnabled : As HBase reads entire blocks of data for efficient I/O usage,
it retains these blocks in an in-memory cache so that subsequent reads do not need any disk
operation. The default of true enables the block cache for every read operation. But if
your use case only ever has sequential reads on a particular column family, it is advisable
that you disable it from polluting the block cache by setting it to false.
- timeToLive : HBase supports predicate deletions on the number of versions kept for each
value, but also on specific times. The time-to-live (or TTL) sets a threshold based on the
timestamp of a value and the internal housekeeping is checking automatically if a
value exceeds its TTL. If that is the case, it is dropped during major compactions
To express these options, a format similar to a properties file is used such as:
hbase.columnFamily.[family name to apply property on].[attribute] = {value}
where:
- attribute: One of the above defined attributes (inMemory, bloomFilter,...)
- family name to apply property on: The column family affected.
- value: Associated value for this attribute.
An example that would apply to the "meta" column family, that would set the bloom filter option
to ROWKEY, and the in memory flag to true would look like:
@PersistenceCapable
@Extensions({
@Extension(vendorName = "datanucleus", key = "hbase.columnFamily.meta.bloomFilter", value = "ROWKEY"),
@Extension(vendorName = "datanucleus", key = "hbase.columnFamily.meta.inMemory", value = "true")
})
public class MyClass
{
@PrimaryKey
private long id;
// column family data, name of attribute blob
@Column(name = "data:blob")
private String blob;
// column family meta, name of attribute firstName
@Column(name = "meta:firstName")
private String firstName;
// column family meta, name of attribute firstName
@Column(name = "meta:lastName")
private String lastName;
[ ... getter and setter ... ]
References
Below are some references using this support