Skip to content

tsoubiran/UCD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Unicode Character Databases (UCD)

This repo contains most of the Unicode Character Database (UCD) version 13.0.0 files converted to R data.frames as well as other contributory data files from other databases and code used to download and convert the files. It is intended to serve as a companion site to this blog series on the Unicode standard.

The UCD provides machine-readable data files related to the Unicode Standard implementation characters properties and is documented in Unicode Standard Annex #44.

Additional information on this repo can be found here and in other entries of the series.

Data Layout

Most files consists of Unicode character integer code points —cp— or code points range —cp_locp_hi— or sequences and either one or more corresponding property value or mappings to other code points. Most files also contains one or more additional informative fields. Since they may vary in length, meaning and format, they're usually stored in single column named comments located at the end of the data.frame. Note that this is different from the line comments which contain information about code points or code point ranges range —see Metadata section below—.

In addition, some data.framehave the variable.labels attribute set with a short column description. Use attr(<data.frame/>, "variable.labels") to see them.

Also, some R data.frame names are abbreviated in order to avoid too much typing because of the added prefix. The mapping between the original files names and the corresponding data.frame is documented in the README file of the ucd sub–directory in addition to a link that points to the file's description. A listing of each file's header and first six lines is also included.

Contributory data files for the Unicode Collation Algorithm are located in the uca sub–directory and Ideographic Variation Sequences files are located in the ivs sub–directory.

The Unicode Consortium also provides data files for Unicode Security Mechanisms which are located in the security sub–directory of the repository :

File Name data.frame Name
confusables ucs.confusables
IdentifierStatus ucs.idStat
IdentifierType ucs.idType
intentional ucs.intentional
confusablesSummary ucs.confusablesSummary

Please refer to UTS#39: Unicode Security Mechanisms for further details.

If you don't want to download the entire repository, you can download individual files from R like this :

 load(co <- url('https://github.com/tsoubiran/UCD/blob/master/13.0.0/ucd/Rdata/UnicodeData.Rdata?raw=true'));close(co)

Note that in order to d this, you need to change the domain name. Otherwise, using https://github.com/tsoubiran/UCD/blob/master/13.0.0/ucd/Rdata/UnicodeData.Rdata won't work because because in that case github.com will redirect to raw.githubusercontent.com and url() does not handle redirection so it seems.

You can also use the getUCDRdata function as demonstrated here.

Code

The src sub–directory contains —very hackish— R code used to download the original files and convert them to R data.frame in addition to some utils for dealing with the files metadata. In order to use this code, you'll need to have the stringi library installed for parsing the files as well as the RCurl and XML for the download script.

Please refer to this blog entry for instruction on how to use this code.

Metadata

Each data.frame also stores the original commented lines —if any— in an attribute named "htxt". For example, you can retrieve the header of the original UCD files which describes the file's format and points to the relevant Technical Report like this :

ucd.hdr(ucd.propValal)

or extract every comment blocks

## w/o expansion
comments <- ucd.comments(ucd.propValal)
## with expansion
comments <- ucd.comments(ucd.propValal, xpd=T)

ucd.comments returns a list of all comments. Using xpd=T returns a list of the same length as the original data.frame with comments located at their original position in the file. This can be useful for extracting relevant information and merging them back with the data.frame.

License

UCD files are distributed following the Unicode® copyright and terms of Use. Please refer to this page for more information.