XML Length Restrictions

May 1, 2014, 5:24 pm

≪ Previous: Unclean… who thought of that?

This week I spent some time in Stockholm attending one of the SDL Roadshows. As usual it was a great event, and we have more to come. In fact this year I get to attend a fair few so if you’re attending Copenhagen, Milan or Paris in May then I’ll look forward to seeing you there!

But I’m not writing about the roadshows. I also enjoyed a day before the roadshow with some of our very technical customers in a small workshop and as usual they had lots of interesting questions to tax our software and my brain! But this time I had reinforcements in the shape of Iulia who is a QA Engineer from our Cluj office. The team in Cluj never cease to amaze me with their dedication to making the products better and in supporting our customers, in addition to their knowledge of our products. But the reason I want to mention Iulia in particular is because these technical sessions always involve questions around how we handle XML in Studio. This time was no exception and one question in particular had me dreaming up all kinds of workarounds… they were interesting I think, but unnecessary because Studio has some clever features here I’d never looked at before, but Iulia had. Of course I don’t know why I’d expect anything less from a team that QA our products, but I thought it would be good to share.

The question was this “When will there be an easy way to check the number of characters for a translatable string in an XML file?“.

Studio of course can check any segment against a fixed character count by using this option in the QA Checker:

But if you have an XML file that uses attribute values that vary for different elements used in the file then this check is next to useless. Take a file like this simple example with one translatable element called segment and a length attribute called length!

Workaround #1

Create a custom XML filetype using the Studio SDK (Software Development Kit) that could use a length attribute value as a QA check by default. So it could read the value of whatever attribute you defined and apply it during interactive translation or when you ran a QA check.

Actually I kinda like this workaround and if it was implemented I think it would be a very smart solution. But it’s not out of the box so I move on…

Workaround #2

Build the attribute value into a stylesheet for a visual check against the segment count shown in the Studio Editor:

I quite like this too because the visual is always nice to see and stylesheets are not that hard to create. But the lack of any automated checking is an obvious drawback here. So the next solution I can’t call a workaround because it’s using out of the box XML niftiness in Studio as suggested by Iulia.

Solution #3

The idea here is to give each element a different structure based on the attribute value and then use a built in check for maximum or minimum lengths in the Advanced Tag Settings… brilliant! I look at these settings quite a bit and have never used these, or paid attention to what they can do before. Proves the value of spending a little time to review the options in this product from time to time and just see what’s possible.

Studio has this concept of Document Structure Information which can be used to provide the translator with additional information about the segment they are translating in. So if you open a Word file in Studio for example and then look at the right hand column you might see something like this. You can click on these abbreviations if you like and you’ll be presented with more detailed information, but this is telling you that you have:
- a Heading style
- a Paragraph style
- a List Item
- a Table Cell (the + symbol means there is more structure in here)
- a Text Box

This is very handy because the context of the segments you are translating can prove useful. Localisation Engineers will often define their own Document Structure when creating XML filetypes so they can improve the experience and also provide more context for the translator. It’s this structure we’re interested in here.

In my example I have three different values in the length attributes (10, 20 & 50), so I need to create three parser rules to allow me to set a different length check for each one. I do this by using a little xpath expression like this:

//segment[@length="50"]

This just means extract the contents of the segment element when the value of the length attribute is 50. I repeat this three times so I have a rule for 50, 20 and 10 characters. I then add my Structure to these rules by clicking on the Edit… button and then here:

Now I can add a new Structure property to the element by clicking on the Add.. button here:

I can call this whatever I like, but to keep it simple I have used the same name as the number of characters so I have some consistent rules around this process (at least this seems logical in my head!):

I then click ok twice and stop at the Edit Rule window because now I want to go to my Advanced… settings to add the length check. I can also see the name of my new Structure property in the field now:

In the Advanced… settings pane I can add the value I wish to check for, which in this case is 50 characters:

After repeating this a further two times with the appropriate values for each length attribute value I now have four parser rules for my simple example:

All I need to do now is go back to my verification settings and choose this option that checks if the target segments are within file specific limits (I don’t know how I missed this in the past because I have been asked this question before and didn’t have a good answer!):

And now when I translate the file I can see several things. First of all I get a warning interactively with the little yellow triangle. Hovering over this triangle tells me exactly what the problem is:

In addition to immediately knowing there is a length restriction on this segment simply because of the pink..ish LN abbreviation in the right hand column I can also read the Document Structure Information by clicking on the coloured LN+ abbreviation on the right. I only added the “10″ as a structure here, but Studio added the LN for the length restriction imposed in the QA settings:

Finally I also have plenty of detail in the verification message details panel so I know exactly what to try and achieve in order to satisfy the requirement to translate this segment in 10 or less characters:

This is really great, sophisticated capability that I’m glad Iulia was able to share. Combined with the stylesheet it’s a really cool solution… and of course my initial workaround is still an option for anyone with a developer and that would bring all the benefits of this out of the box solution into a maintenance free one for a localisation engineer. Studio truly is a versatile and capable localisation tool!

↧

Working with Studio Alignment

May 20, 2014, 5:02 pm

≫ Next: Why do we need custom XML filetypes?

≪ Previous: XML Length Restrictions

#01 The new alignment tool in Studio SP1 has certainly attracted a lot of attention, some good, some not so good… and some where learning a few little tricks might go a long way towards improving the experience of working with it. As with all software releases, the features around this tool will be continually enhanced and I expect to see more improvements later this year. But I thought it would be useful to step back a bit because I don’t think it’s that bad!

When Studio 2009 was first launched one of the first things that many users asked for was a replacement alignment tool for WinAlign. WinAlign has been around since I don’t know when, but it no longer supports the modern file formats that are supported in Studio so it has been overdue for an update for a long time.

It wasn’t until SDL released Studio 2014 in the third quarter of 2013 that a new alignment tool was released. The new tool was based on the premis that most of the time aligning files is a waste of your time! Many users find themselves being provided with a bunch of files, maybe even hundreds, with matching translations where to make matters worse one is often a PDF while the other is a DOC file so the alignment effort, and the value of the resultant Translation Memory is out of proportion. Many translators have told me they have spent days aligning files (normally over the weekend!) to create translation memories they may never get any value from again, and got very little value from in translating the project they aligned for in the first place!

So the idea behind the original Studio 2014 alignment tool was to allow you to very quickly create a usable Translation Memory based on a sliding scale of alignment quality. So you threw your hundred document pairs into the alignment tool, made a decision on what sort of quality you wanted, which in practice could be a little tricky and it paid to do a couple of trial runs with some smaller documents to make sure you had this right, and then with almost one click your Translation Memory was magically created and you could concentrate on the real work of translating.

#02

This desire to make things easy for the translator was a worthy one and most of the time it produces a pretty good alignment and it does this quickly. But it didn’t allow for these:

Really poor quality files that needed some sort of manual touches to ensure a decent alignment and useful Translation Memory.
Alignment Projects! It’s not uncommon for a translator, or a company to be tasked with creating the best possible Translation Memory you can get from all the monolingual documents available in both source and target languages.

So, when Studio SP1 was released SDL added an alignment editor to allow both of these things to be catered for. The SP1 release is the first one, and there will be continual improvements to the editor, but the first incarnation does a reasonable job; especially if you know a few simple tricks and ways to work with it.

So, I have created a video, around 17 mins long where I aligned a couple of files and explained some of these tricks as I went along. Hopefully by the time you have watched it to the end you will have a better idea of how to get the most from the existing version and can use it happily while waiting for future enhancements that will improve it even more. I put the video at the end of this post because first of all I thought it would be helpful to just note a few important things that are useful to know when working with the alignment tool.

You can still use the quick alignment and just throw 500 document pairs into the tool and have the Translation Memory created without any effort on your part at all. All aligned segments will have a quality value added as a Field Attribute so you can further refine how you work with this Translation Memory on your Project as well as recording the filename of the source and target files used in the alignment process:
You can have two kinds of alignment projects. One for working with multiple files and have all the alignment files created and saved ready to open and work with, or you can do a quick file pair alignment that opens immediately in the editor after you select the files and a Translation Memory:
You can change the segmentation rules and other language resources for the source and/or the target file by selecting the appropriate language in the Translation Memory settings:
The filetype settings for the files being aligned are based on the active Project in Studio. This is because the assumption is that you will always be aligning for a specific Project and so this will be active before you start. However, we know from the reasons above that this might not always be the case, so a useful tip might be to create a Project just for use in Alignment Projects and then you can change the filetype settings in your dummy Project as needed, and make it the active Project (select it and press the Enter key so it goes bold), whenever you carry out any alignment work:
If you start the alignment and find that the segments are not automatically aligned very well at all because of differences between the source and target files, then use the Realign function. This is a very powerful way to improve the alignment quite quickly by doing the following:
1. Disconnect all or some of your segments.
2. Connect some of the segments in your alignment projects around the worst affected areas and then click Realign.
3. This will take advantage of your “help” and vastly improve the alignment process, reducing your effort and making it easier to work through the file.
Alignment can be carried out using the icons in the ribbon, with the mouse, with keyboard shortcuts or using the Alignment Edit mode (described in the video)
The finished alignment can be imported into a Translation Memory, or saved as an SDLXLIFF. If you use the latter you can then use Studio to perform quality assurance checks and do any further refined editing you consider necessary if you are preparing a high quality Translation Memory for yourself, or your client. The Quick Import will just import everything into your Translation Memory that you have confirmed… so only quality values of 100. The Advanced Import will allow you to import based on the quality values of the aligned pairs; so you use the slider to set the value and everything above that value will be imported:
When you do the alignment the coloured lines have a meaning. Solid green lines are confirmed and have a quality value of 100. Dotted lines are unconfirmed. Colours in red are poor quality value and the greener they get the better the software believes them to be.
There is a limit to how many segments you can select to merge. In both Alignment Edit Mode and normal aligning mode you cannot select more than three segments at a time. In Edit mode the Connect n:n command greys out, and in normal mode you will see a small no entry symbol displayed when you try and select the fourth one:
You cannot split segments. If you need to split segments because you want two Translation Units then this could be carried out by saving the file as an SDLXLIFF and making the changes in there.
You cannot delete or insert segments. So in the example video where I have added new segments to one of the files on purpose you would simply align around them. You could not insert segments to provide a source translation for them, nor could you delete them to avoid the misalignment.
Alignment penalty… Studio adds a 1% penalty to all Translation Memory results that come from an alignment by default. So you will only get a 99% match even though you confirmed the alignment and it now has a quality value of 100. You can change this to zero and turn your 99% matches into 100% matches here:

I have tried to explain anything else I thought was important in the video. 17 minutes is longer than anything I would normally expect you to sit through, but I hoped it would be useful to work through a complete file and tackle the sort of things you are likely to come across in the process, and the 17 minutes were over almost as soon as I started… or at least it felt like that!

↧

Why do we need custom XML filetypes?

May 31, 2014, 3:04 pm

≫ Next: The JSON files…

≪ Previous: Working with Studio Alignment

My son, Cameron László, asked me how my day had gone and before I could answer he said in a slightly mocking tone “blah blah blah… XML… blah… XML … blah blah”. Clearly I spend too much time outside of work talking about work, and clearly his perception of what I do is tainted towards the more technical aspects I like the most! Aside from the note to self “stop talking about this stuff after I leave the office!” it got me thinking about why I probably think about XML as much as I apparently do and how I could help others avoid the very same compulsion! I’ve written articles in the past about how to use regular expressions in Studio, and an article on using XPath, and I’ve probably touched on handling XML files from time to time in various articles. But I don’t think I’ve ever explained how to create an XML filetype in the first place, or why you would want to… after all Studio has default filetypes for XML and this is just another filetype that the CAT tool should be able to handle… right?

Wrong! Well partly wrong anyway. If the XML is simple then the default filetypes will probably handle the file perfectly well. But what makes XML unique compared to most other filetypes is that the translatable text could be hidden in user defined locations, and Studio (or any CAT tool for that matter) does not necessarily know where it is, or that it should be translated, without you providing some additional information.

But before I dive in I think it might be helpful to understand a little of the terminology here, not everything you’ll ever need to know about XML as this is a pretty big subject and I’m still learning myself, but rather a few simple things that are relevant to knowing how to extract translatable text. Take this short XML file as an example:

The first red line <grandparent> is called an element. In this case it’s an element I decided to call “grandparent”. In fact this element is a special one because it’s also the first and last element in the file. So this is also called the “root element” and it’s important to note this because we can use the root element as one way to automatically decide which filetype to use when opening an XML file in Studio. There are two more elements in my file; “parent” and “child”. I deliberately used these names because this nesting of the elements inside one another is important. Here we have one element called “parent” which is a child element of the “grandparent” and it contains the translatable text “I’m the parent and I have two children”. Inside this “parent” element I have two “child” elements and they also contain an attribute. The translatable text is in black, “My son is called George.”, and “My daughter is called Sally.” The attributes are providing more detailed information about the children, so in this case defining whether they are male or female. As a general rule, the information I have just provided is how we would like things to be in an ideal world. But that would be too easy and in reality there are no real rules to say when you should use an element and when you should use an attribute! In practice this is just how we would like to see them in the translation industry and if they come like this then the default XML filetype, called AnyXML in Studio, will often suffice. But we want to look at the real world!

Why use custom XML filetypes?

So let’s take a look at a couple of files, starting with an XML file that looks like this and following the simplistic logic I explained above:

02_simple_xml

This is quite straightforward, every piece of text that looks as though it’s translatable is inside an element. If I open this in Studio without creating a custom XML filetype it looks like this – I’m using the TagID mode to display tags so the tags are numbered and the orange tab at the top is displaying the name of the filetype that is being used to open the file for translation:

02_simple

So here, the Any XML filetype (the default in Studio) does a pretty good job and even manages to determine what is likely to be inline tags versus external tags, so the text flows quite nicely making it easy for the translator. For this file you could even copy source to target for things like segment #2 which is not really translatable text. So pretty simple and definitely usable… But now let’s take an XML file like this which contains exactly the same content but has been prepared in another, equally valid, XML way:

03_complex_xml

The one is trickier because some of the translatable text is now in an attribute rather than an element (the productname attribute for example), and the default AnyXML filetype does not extract text from an attribute. We might not consider it to be good practice to use an attribute for this, but in practice we all know it’s pretty commonplace and there can be good reasons for the file to be constructed this way. In addition to this much of the text is provided in the file by using a CDATA section. This is normally used for part of the file that should not be parsed at all with an XML parser, often including characters that would be illegal in XML. This can be a complete html file that is embedded into the XML, or part of one as I have done here, or even something else based on a custom script written by a developer. So there is no single way to handle all embedded content. If I open this file with the default AnyXML filetype then it looks like this where we see html entities, opening and closing tags (<> for example) and the name of these tags (div class=”mat” for example), all of which you would not want to have to try and translate around:

03_complex_default

Yuck! Not very nice at all because not only is it parsing the html code inside the CDATA section as translatable text without any kind of tag protection at all, it’s also missing the product name at the start because in this file it was stored in an attribute as opposed to an element. So what we really need is a custom XML file that can deal with the specific nuances of this particular XML file. The release of Studio 2014 SP1 provides a neat way of dealing with the CDATA or any other form of embedded code, but the basic principle of how to create a custom XML filetype in Studio is the same now as it has been since the release of Studio 2009. At the risk of this being a ridiculously long post let’s take a look at how this is done using the second, and more complicated XML example… at least we’ll look at how I normally tackle it as there are other ways.

A general note though. I won’t be covering every single thing about XML file types in Studio either. So if you have a question I didn’t address in this post please refer to the online help which is pretty useful, or post your question into the comments below so we can build a useful reference article for anyone else.

First Steps

My first step in creating a custom XML filetype in Studio is to import the XML file so I have all the elements and attributes in the XML available to me for selection as I create the custom rules. To do this you go to the File Types node in your options (not forgetting there are differences between the general Options and Project Settings as usual) and click on New which will bring up the Select Type dialog box:

If you’ve done this before in any version of Studio prior to Studio 2014 SP1 you will note that there are now two options for XML.

XML (Legacy Embedded Content), and
XML (Embedded Content)

I mention them in this order because the XML (Legacy Embedded Content) is the same as it was in previous versions. XML (Embedded Content) is the new to Studio 2014 SP1 approach and I’ll cover this after discussing the old one briefly as I go through the steps. The next steps will be the same irrespective of which of the two XML types you choose.

File Type Information

First you select the XML filetype you want and click on OK. This brings up the File Type Information dialogue box where the only two things I normally change are the File type name and the File type identifier:

I change these because it makes it easier for me when I’m working to see the name of the filetype I created in the list of available filetypes, and also when I open the file and change the tag display to TagID mode the orange tab at the start and end of each file will display the name of the filetype too. Because I generally create XML filetypes to help other users I find it useful to easily distinguish the names in this way. Then I click on Next >.

XML Settings Import

This takes me to the XML Settings Import dialogue where I typically select my XML file so that all the elements and attributes are added to my filetype to make it easier for me to create the rules:

I browse to my XML file, or one that is representative of a batch of XML files, and after it’s selected as shown above click on Next >.

Parser Rules

I now see all the elements that have been identified in the file, and they are listed like this based on whatever defaults Studio believes the tags should be:

At this point I would normally just click on Next > and would address the rules in detail after the filetype was created, but to keep this simple and ensure the article flows logically I’ll make the changes to the parser rules now. But it’s worth noting that if you miss something it doesn’t matter as it’s simple to make changes later on.

If you look at the XML example above, the one with the CDATA as this is the XML we are addressing here, you can see that what we want to extract for translation is the the content of the productname attribute and the productcatalogentry element. The rest I don’t want. So first I remove, or disable, the rootelement and the product. Then I add in two rules… you’ll see the Add…, Edit… etc toolbar becomes active for more options when you select a rule. My Parser Rules now look like this:

I created the //* by selecting the XPATH option in the Add Rule dialog box (read more on XPath in Studio). This is basically a wildcard where the star simply means select everything, and I made that Not translatable. I did this because the first thing I want is to make sure I get nothing at all parsed into my file, and then I can bring in the information I specifically ask for, which in this case is the two rules above that. This is not essential, but I was shown this when I first learned from the Master, Patrik Mazanek, and the habit stuck!

The productcatalogentry element was already there, I just changed the Translate property to Always Translate by editing the rule. I did this as a matter of course because the default is Translatable (except in protected content) and I want to be sure that the content of this element will always be extracted even if it’s parent element is not. Plus of course I wanted to explain this concept that could be the reason for text not being parsed if you set a parent element to Not translatable.

I could have created the //product/@productname rule using XPath too, but because I imported the file into Studio earlier on as part of these steps it’s easier to let Studio do this for me. So I just Add… a new rule and select the Rule type Attribute, then select the element containing the attribute I want to narrow it down (a large XML file could contain a huge list, and sometimes with overlapping attribute values):

I set this to Always Translatable as well and then click Next >.

File Detection

I’m now brought into the File Detection dialog that provides me with a number of different ways to recognise the XML file I am opening. This is very important because as you create more and more it’s quite easy for Studio to use the wrong filetype for a particular file and if you didn’t notice (also remember why I always change the identifiers at the start) you may find your translated file coming back partially translated or parsing information that should not be at risk of change at all. This is particularly so when handling XML files from the same customer as they may well use the same root element but have different schemas for example. In this case, let’s keep it simple and just use the root element as the criteria for recognising my file:

My root element was actually called rootelement and as you can see in the image it is already populated because I imported the XML into my filetype at the start. So all pretty straightforward… and at this point I can click on Finish… and that’s it. My custom filetype is complete… almost!

My attribute is being parsed this time so segment #1 contains the translatable text from the attribute value (note that this is also annotated as a TAG in the document structure column on the right because it is an attribute and not an element), and I have not got the product code which was extracted with the default AnyXML. So I’m nearly there. All I have to do now is tackle this embedded HTML in the CDATA section.

There are two ways to do this depending on which XML filetype you created, but I’m mostly interested in the new way with Studio 2014 SP1. However, I’ll take a brief look at the Legacy Embedded Content first.

XML (Legacy Embedded Content)

In both the Legacy and the new method for handling Embedded Content you have to first enable it. So for this legacy filetype I would do this and check the box:

When I do this I can now add the Document Structure I want to be handled with the Embedded content processor. At this point you may well be asking what do I mean by this? Well, take the file we have so far. The right hand column, the one that appears when you open the file, contains this information, and you can expand it by clicking on it:

The Code you see here is the code you need to add into the list for any text that contains embedded content that you wish to treat with the embedded content processor. In this example TAG is the code, but I don’t want to use that one as there is no embedded content in this segment. It’s also worth noting that if there were I could not handle embedded content inside an attribute anyway… hopefully most users will never come across anything so poorly written as that! There is in the next segment however as this is the CDATA Section. Now, because these two types of Document Structure are also Studio codes and not custom codes that I created myself I can use the Location (Tag Content in the example above) to identify it in the list. So I actually want the CDATA Section which I can select like this:

Once I’ve done that I need to create my Tag definition rules. Now this will be a similar process to the way you handle embedded content in a Microsoft Excel file which I wrote about in “Handling taggy Excel files in Studio…“. So I won’t write a lot more on this process for Legacy XML filetypes. Suffice it to say the finished rules for my filetype might look something like this:

These take a while to create, are pretty rough and the finished article, whilst better than than the version produced by the AnyXML filetype, does still leave a bit to be desired. I could spend time working on this to make it more user friendly, but even after all of this it would only take a file to be provided that contained different markup and I might have to start changing the rules again. Using these rules I get this which protects things I wanted protected but also doesn’t really make for simple translating because everything is a tag including the entities, and the translator will have no idea about the context of the text because the embedded content rule with this method cannot hold Document Structure Information of their own. I would not say this has no place however, because there are some files where the ability to be able to use regular expressions to protect tags, and text you don’t want to be translated is a real plus. But there is a better way!

Ziad Chama also recorded an excellent webinar that is freely available called “How to create an XML File Type in SDL Trados Studio 2014” which goes through the process in detail. I’d thoroughly recommend you watch this if you have any interest in creating XML filetypes in Studio as it is very informative and Ziad is a real expert. It covers Studio 2014 prior to the release of SP1 which introduced a new method, so that’s what I’ll cover next.

You can also find a handy knowledgebase article here that is straight to the point!

XML (Embedded Content)

But now let’s take a look at the new method in Studio 2014 SP1. The first steps are exactly the same, but when you get to the embedded content section this is where you’ll notice the difference. It looks like this:

So two new things:

There is a drop down box that seems to refer to completely different filetypes
You can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype)

Selecting the embedded content processor to use

Let’s tackle point one first. This is a drop down box that refers to completely different filetypes. So you will probably already see that the concept here is to use a filetype within a filetype rather than have to create the regular expressions to handle the content as we did with the legacy embedded content processor:

The defaults are the regular expression filetype, and the two HTML filetypes that come with Studio out of the box. But you can add your own which makes it possible to configure one of these filetypes so it does not use the defaults and then have different embedded content processors depending on the content of the work you are doing. So if I collapse my navigation menu I now see this in my options:

Expanding this allows me to take a copy of one of the three defaults and then configure it as I see fit. I don’t really have to do this for the simple complex example I have used for this article, but this is how you would do it! You click on the Embedded Content Processor node and you’ll see the three available filetypes. Select the one you are interested in; so in my case I picked the HTML 5 filetype, and then click on Copy…:

You get a small dialogue box where you can change the name of the File type and the File type identifier as before, and pay attention to the name because you cannot use the same identifier as you did for the main filetype as duplicate file type IDs are never allowed. Then click on OK and close the Options. You need to close them because if you don’t the list won’t be refreshed (a little issue I’m sure that will get resolved in a future release!) and then when you open the options again and go to the Embedded Content node of the new XML filetype you created you will be able to select your new filetype as an embedded content processor like this:

Identify where the embedded content is found?

This brings us onto my second point which is that you can decide whether to apply the embedded content processing to CDATA Sections or any other named Document Structure Information (as before with the legacy filetype). If the embedded content is in a CDATA section which is probably the most common usecase then now you do nothing more than check the CDATA sections checkbox as shown in the introduction to this part of the article. I can then open the file and see this without having to do any additional work at all:

Much better… and easier! because I’m also still in TagID mode you can see the name of the files which are being used in the orange tabs, and the embedded content processor displays the correct Document Structure Information for this filetype which adds additional context for the translator. You’ll also note that the entity values are correctly transposed so I don’t have to deal with them as tags.

If the embedded content was in another type of Document Structure then it works in exactly the same way. You select the appropriate code and that’s it. No need to add a bunch of regex rules in here.

Sharing custom filetypes with others

I can’t leave this section without mentioning how you share your custom filetypes with others. This is done by exporting your settings, but now with Studio 2014 SP1 you have two lots of settings to share.

*.sdlftsettings

This is the settings file for the custom XML filetype you created. To export/import these files you click on the File Types node in your options and then select the specific filetype you wish to export from the list that now appears on the right. This will activate the Import/Export Settings… buttons:

*.sdlecsettings

This is the settings file for the custom Embedded Content Processor you created. To export/import these files you click on the Embedded Content Processors node in your options and then select the specific filetype you wish to export from the list that now appears on the right. This will activate the Import/Export Settings… buttons exactly as for above.

This is actually another good reason to always create a copy of your default Embedded Content Processor because if you are sharing custom XML files with a colleague then they may get unexpected results if you used the default HTML file and when your colleague used it the settings were different because they had a customised HTML filetype for example.

Checking your work!

At this point I think I’ve covered enough for you to get started and have a play. But seeing as I’ve written all of this I just wanted to mention Pseudotranslate and how this can help you to make sure your filetype is extracting everything you want, or possibly too much. Once you have completed your filetype to the best of your knowledge, it’s worth opening it quickly with the Translate Single Document approach and without a Translation Memory. Now run the Pseudotranslate batch task with these options:

When complete you will see that the target column of your file in Studio is now full of question marks, so these stick out like a sore thumb! Save the target file and inspect the result with a text editor:

If you missed anything out that should have been translatable text it’s much easier to spot it in here and you can refine your filetype until it’s ready for production. But this looks good to go, as the only recognisable text that is between elements is the product code in the product element, and I deliberately excluded this with my custom filetype!

THE END!!

↧

The JSON files…

March 18, 2015, 4:37 pm

≫ Next: Comments… chapter and verse!

≪ Previous: Why do we need custom XML filetypes?

The JSON files… not really related to Jason Voorhees of course, but for some users who have received these file types for translation the problem of how to handle them and extract the appropriate text may well seem like an episode of Friday the 13th! I’ve seen a few threads in the last couple of weeks sharing various methods for handling these files ranging from opening them in MSWord and applying a hidden style to the parts you don’t want, to asking vendors to create variations on javascript filetypes. But I think Studio offers a much simpler mechanism for handling them out of the box.

So what are these file types and how can you handle them with Studio 2014, or even 2009/2011? In this article I’m going to look at the regex filetype as this is very well suited to files like this, but before we get into that detail let’s take a look at what they are.

JavaScript Object Notation

This filetype is a simple text based format that was introduced around 2001, so it’s nothing new, and it’s used as a method of sharing data between aplications irrespective of the programming language used. For those of you are interested in this stuff it was derived from the ECMAScript programming language and you can find the full specification for it on the JSON website.

I like to read this stuff to an extent, but really I stop at the point where I can figure out how to get at the translatable text. The format of these JSON files is based on a simple structure which I have taken straight from the JSON specification:

The four components on the left represent the structure components and the image on the right is an example file coloured to show you which components are which. Now, the reason I did this is because these files can contain any kind of data, and the important part is for you to know which parts are translatable. If you translate something that should not be translated then it’s likely that the file won’t be fit for purpose when you give it back to your client.

How do you know what should be translated and what should not? You ask your client!!

Handling the file in Studio

Once you know what’s translatable the next step is to create a filetype in Studio to handle this. It’s actually quite straightforward using the regex filetype. The steps are like this:

Create new filetype

Go to File -> Options -> Filetypes and then select New… Select the “Regular Expression Delimited Text” type and click on OK.

Once you’ve done this you give the new filetype a little bit of information:

Filetype name: you can call this whatever you like. I called it JSON
File type icon: this is completely optional, but as I have never actually done this before in Studio I thought I’d try it! If you want the icon file I used you can download it from here.
File dialog wildcard expression: this is just the file extension written like this *.json so that Studio knows to use this filetype when you open a JSON file.

Click on OK and that’s it. Your filetype is created! Not too hard was it and you can now open a JSON file and translate it in Studio. However, using my example file which you can find here if you would like to play with it, and you don’t scare easily, the result isn’t too clever because the default will just extract everything in the file like this:

So the next thing you have to do is tell the filetype what you actually want to see in the editor for translation. This is why it’s important to speak to your client so you understand the requirements of the job.

The files I have seen so far all seem to follow the same principle for the translatable text. You have a String at the start followed by a Structural colon and then another String or Number, finally ending in a Stuctural comma like this:

"FileId": "45b1b4b4-32ae-4a33-b2bb-b35b6940d348",
"FileName": "thejsonfiles.docx",
"Language": "French (France)",
"Film": "Friday the 13th",
"Year": 1980,
"Director": "Sean S. Cunningham",
"Writer": "Victor Miller",
"Producer": "Sean S. Cunningham",
"Synopsis": "Friday the 13th is a 1980 American slasher film etc.",
"UtcDateTime": "2015-03-16T22:52:36.1622185Z"
"Recorded": true,

The first string represents an identifier of some sort, similar to an element name in an XML file. The second string contains the translatable text. So all you have to do is extract the contents of the second string. We do this using our old friend the regular expression. However, you still need to know if all of them are translatable or only some, and then once you do you can create your expression to suit. The expressions go in here:

You need two, an opening pattern and a closing pattern. The translatable text will be the text that is inbetween these patterns. So in a line of text that contains code you don’t wish to translate you can move the text found by the opening pattern into the hidden part of the editor so the translator doesn’t have to deal with it; similarly for the closing pattern. So using Regex Buddy (my preferred tool for this stuff) let’s look at a couple of examples and what they would extract. If you don’t understand how to use regular expressions I’d really recommend you learn a few basics, they are incredibly useful. You can find four articles here on how they can be used in Studio that I have written in the past… starting with simple explanations and leading up to slightly more complex examples.

Extract all the second strings

".*": "
",$

The first line is the opening pattern and the second line the closing pattern. The first line basically means look for a quote, then look for anything and keep looking until you find the next quote followed by a colon, then a space and then a colon. So this opening pattern should select the following segments only and make the coloured parts structural ie. hidden in the editor:

The closing pattern is just going to find the last quote and comma and move that into the structure so it’s also hidden in the editor. So when you add these rules into the filetype you see this on opening my test JSON file:

This is much better because now all the JSON structural elements are gone and I’m only getting the second string extracted for translation. However, some of these don’t need to be translated at all so I can further refine my filetype by using a different rule.

Extract only named strings

"(Film|Director|Writer|Producer|Synopsis)".*?"
",$

This time I am saying look for a quote and then find any of the words between the pipe symbols followed by a quote, and then anything at all up to the very next quote. So in effect I extract this:

This is because I don’t think I need to translate any of the other strings at all. In reality I guess I would only translate Film and Synopsis from this file, but this is just an example! So have a play and you’ll see how simple this is to work with. However, if the file contains many different translatable strings then the list of identifiers is going to get longer and longer. In this case it might be easier to specify what you don’t want instead!

Extract everything apart from named strings

"(?!FileId|FileName|Language|UtcDateTime)\w+".*?"
",$

With this expression we are using something called a negative lookahead… wonderful names but quite sensible. This means take a look ahead of you and see what’s coming, if it doesn’t match the following text then it’s what we want. So the opposite of a positive lookahead where it would match what it found. Maybe takes a little getting your head around, but have a play!

So the expression says look for a quote and then look ahead to see if any of the following words between the pipe symbols match. If they do then don’t use this segment, but if they don’t then look for any word character, one or more, followed by a quote, and then anything at all up to the very next quote. So in effect extracting exactly the same as before. But this time I used a rule to specify what I didn’t want rather than what I wanted!

Phew… makes your head go giddy! But in Studio I now see this:

Exactly what I wanted. The beauty in this of course is that the simplicity behind the JSON concept translates nicely into the simplicity of the regex filetype!

A final note here on SDL Passolo after Daniel reminded me! If you want to have full native functionality with JSON files out of the box then you should really create your translation projects in SDL Passolo in the first place. Here you have full control out of the box over all aspects of this filetype including developers comments etc. You can read a little about this here. You will need the full version of Passolo to create the projects in the first place as the free Translator Edition will not allow this. But if you are serious about working with these filetypes then it’s worth the effort. So this article provides, I hope, a good workaround for anyone sent JSON files and they don’t have the full version of SDL Passolo. Perfect for the occasional job but perhaps lacking if you are going to make a habit of it and need to accommodate more variations in the content than I have shown here.

↧

Comments… chapter and verse!

April 26, 2015, 1:02 pm

≫ Next: Bilingual Excel… and stuff!

≪ Previous: The JSON files…

The ability to work with comments in a managed localisation process is an important part of communication between translator, reviewer, project manager and the end client… and not necessarily in that order! Comments are used to clarify misunderstandings in the source text, questioning completed translations you’ve been told to ignore that just don’t look right, suggesting improved terminology, explaining why you translated something in a particular way, clarifying why you changed a translation in review, providing additional context from the client, adding notes to the target file for an in country review, they could even be comments that are just there to be translated, or ignored… the list of reasons could be pretty long and so could the comments. So it’s very important to be able to keep them linked to the context so it’s easy to deal with the referred text, and also to be able to get to the comments quickly when they might only relate to a couple of segments in three files that are part of a five hundred file project nested within a complex folder structure.

So this post is going to deal with two things… first of all the places where comments can be used in Studio out of the box and secondly a very neat OpenExchange plugin that I reckon many project managers, and translators, have wished for and didn’t know it was there already!

Working with Comments

First of all comments can be added to the SDLXLIFF (Studio bilingual file) for any source file type. They can’t always be transferred into the target file type however and this is because not all filetypes will support this. We’ll look at source file types further down, but first let’s just consider how you work with them in an SDLXLIFF within the Studio Editor.

Comments in Studio can be added to the entire segment, to individual words/phrases or to the entire file. You do this by either by selecting the text and then adding the comment via the ribbon, in the review tab, or just by using a customisable keyboard shortcut… the default is Ctrl+Shift+N:

There’s also a severity level you can choose from for each comment, and this can refer to it being informational, warning or an error. The colours used are customisable by going to File -> Options -> Colors where you’ll see the option to select the colours used under “Comment colors:”. The default settings look like the screenshot below where segment #1 is informational, segment #2 is a warning and #3 is an error. You can’t see the comment on the file as a whole:

Screenshot: en-da machine translation… apologies for the quality if it’s poor.

In addition to the visual stimulus, you can also work with the “comments” window which is shown here above the editor. This is pretty useful for several reasons:

You can see all the comments in the file in one go,
You can identify comments by their severity, location (segment number and source or target indicator), which file they relate to, the scope (file, segment, range) and the date the comment was created.

You might also find it’s helpful to move this comment window to the left/right-hand side of the screen where you can see much more of the list and use it to navigate through the comments. For one or two it’s not worth it, but if you have a lot it appears aesthetically pleasing to the eyes… mine anyway!

Importing files with comments

Probably the first thing to clarify here is which filetypes have supporting features in the settings for comments and then we can look at what they are. In a nutshell they are;

DOC, DOCX
PPT, PPTX
XLS, XLSX
RTF
PDF
CSV
Tab Delimited TXT
Java Properties

I’m going to ignore XLIFF for this article because this is probably an article all on its own dedicated to the extensibility of XLIFF and how different tools manage the information within. I’ll ignore TTX too! Comments for both of these filetypes are taken care of in the software without the user having any control over how they are used at all… so less interesting for me this time around! If they are in the file, then they will be in Studio as comments to help with the context of the translation.

I’m also going to ignore XML and HTML as these are also “special”. XML is user defined and the best way to handle comments if you just want to read them is usually to create a stylesheet… see Translate with style… for more information on how to do this. HTML has its own tag for comments but these are not supported in Studio at all. I would be quite interested to learn if translating the html comment is ever a requirement?

The file types in my list above however are different. These all have special comment options in the filetype settings and they all allow you to import comments and decide whether you wish to translate them or not. But they have different ways of making them available to you.

DOC, PPT, PPTX, XLS, XLSX, RTF

These all have a single option to extract comments for translation or not. Looks like this and is found under the common node in the filetype settings of each filetype. The actual description varies a little from filetype to filetype but it’s not hard to figure out which one it is… the main point is that there is just one option which is to extract the comments for translation or not:

If you select this option then comments in the file are extracted and added to the translation in the Editor. You can tell that they are comments because the Document Structure Information (DSI) column on the right handside when the file is open in the Editor View provides this level of granularity by using a code COM:

Here we see COM and COM+. The plus symbol just tells you that there is at least one more field relevant to this segment in addition to it being a comment. So the RTF for example also tells you than this is a paragraph of text… maybe not too interesting this time but sometimes this can be very helpful depending on the context of the text, so maybe an index entry, or a list item, for example. You can also use this DSI column in interesting ways with the SDLXLIFF Toolkit from the OpenExchange. This application can extract segments based on the DSI information alone. So you can select all the comments in your Project in one go and lock them for example… very handy if the comments are extracted but should not be translated, or if they should be locked to ensure they are not counted in an analysis.

CSV, Tab Delimited TXT

These file types work in a different way and don’t allow for comments to be translated at all… at least not if the comments are in a different column. These are really bilingual filetypes and the way they work is by providing a mechanism for the user to specify the source language column, the target language column and a column for comments. Like this:

The image is based on the CSV filetype but it’s exactly the same for the tab delimited. In this example the source is in column 1, the target will be placed into column 2 (and read from column 2 if there are any in the file), and the comments are in column 3. So in both of these filetypes this is only for providing a mechanism to do two things:

Determining whether or not to extract the comments at all and displaying them in the DSI information box,
Excluding text from being parsed for translation at all if there is a comment in the comment column

In the previous file types the DSI box was purely information to tell you that the extracted text in the Editor was a comment. In this case the text for the comment itself is held in the DSI box. So you can’t extract it for translation, but you can extract it for information. I think it’s debatable how useful this is and I would prefer to see some preview capability here that allowed you to see the comments as you work without having to click on the DSI column to see what it said… maybe in a future release! For the time being this is what it looks like:

So you can read the comment in the Additional Information column. As I’m writing this it occurred to me that it would be an improvement if it was at least possible to specify a keyboard shortcut for activating the DSI window as opposed to having to use the mouse… but a preview would be better!

The second option on the filetype is quite useful as a way to exclude specific text from being translated. You can add anything at all into the comments column using Microsoft Excel and then use this option to exclude text that doesn’t have a comment next to it.

If you need to translate the comments then you have two options really (based on the example so far):

Copy the comments into column 4 and then process the file using column 4 as the source and column 3 as the target. Delete column 4 when you’re finished… or
Copy the comments to the source column, translate with the rest of the file and copy them over the original comments in the target file when the job is complete

If you try the first way and are using column headings in the file then just make sure you give column 4 a different name to column 3, otherwise Studio will present you with an error and won’t process the file.

Java Properties

The filetype works in similar ways to all the filetypes mentioned so far. You can extract the text for translation and the same text that is extracted is also displayed in the DSI information box… although this is only useful when you don’t choose to extract the comments for translation. The option in the fietype settings allows you to extract comments for translation or not. You also have to make sure that the comment identifier is listed in the box. There are two there by default, the hash and the explanation mark, but if the speficic Java resource file you are translating used something different then you can add it in here:

If you choose not to translate the comments then they can be seen in exactly the same way as the CSV and Tab Delimited filetypes… by clicking on the DSI column:

If you choose to extract the comments for translation then they are presented as a normal segment in the editor with the TAG attribute (as opposed to COM) showing in the DSI column. The order of text is as if the comment line is a translatable line, so checking this option just makes it translatable.

DOCX, PDF

DOCX and PDF files both have the same option, and then two that separate them from the rest:

Both of these filetypes state that they can not only extract comments from a file but they can also convert Studio comments into usable comments in the filetype itself, and convert filetype comments into Studio comments. These last two options set them apart from all other filetypes. But there are some interesting things to note when considering this capability.

When you open a PDF file it becomes a DOCX and you are actually translating a DOCX converted from a PDF. So the target file is a DOCX. Only significant in that it’s clear why the options are the same for saving.
PDF comments are what?

I’m asking this second question for myself too because my own reaction was to wonder how you get comments into a PDF file in the first place! I created a small file with one segment and then added two comments to it. One using InFix and then one with Adobe Reader. Irrespective of whether I check the box to extract “As Studio comments” or not the comments are extracted as translatable text. So I’m not convinced this is working functionality in the current version of the filetype for PDF… it’s also interesting, but a red herring, that the timestamp is some 2-hrs adrift from reality. Must be set to GMT by default as it’s two hours behind me! If it is working then I wonder how you put the comments in the PDF?

For DOCX files it’s more straightforward. The options to extract as translatable text or as Studio comments work like this… the image shows the comments in the Word source file and then the result of opening the file in Studio with either option:

This is pretty neat, and it’s the only filetype that behaves this way. This is most likely because the way Word files are used commenting like this is very common and is something everyone knows how to use. Similarly it works the other way too… so if I add some comments into Studio and save my target file the final formatted target DOCX can include my Studio comments (the final option to “Retain Studio target comments in target file“). I wrote a separate article on this feature here if you are interested, so will try not to repeat myself below.

In the screenshot below you can see the two original source file comments in the message box, also the two new comments I added in Studio into the target text which I also left untranslated (just copied source to target) and what this looks like in the target Word file:

The important point to note is that whilst the Studio comments are now in the target file, the original comments are gone. So you need to be very careful with comments in a DOCX file, not only in terms of deciding how you wish them to be handled for translation but also in terms of whether it’s important to retain the original comments in the translated file or not.

If it’s important for you to retain the original comments in the translated file, and perhaps even add new comments as I did above then the option to use would be to “Extract comments as translatable text” and “Retain Studio target comments in target file“. This way both sets of comments are retained. I don’t know why Word formats the first pair the way it does but I can click on the “2 comments” and this expands to show me both comments. It might be related to the selection I made being seen at file level (segment zero) and also causing the problem I noted in the Export for External Review below.The second two are clear to see:

So even though I added my Studio comments to exactly the same word as the source comments this is no problem. All comments are retained.

Perhaps a useful tip at this point is how to lock all source comments extracted as translatable text so they are not touched if you use the option above. I have mentioned a few times how the Document Structure Column contains the code COM whenever a comment extracted for translation. This is no different with DOCX. In Studio out of the box this is purely informational, but if you use the SDLXLIFF Toolkit you can generate the DSI in the files in your project, select the comment tag as shown below and then lock all these segments before you start work. Very handy!

Using the Export for External Review

Let’s come back to Studio comments for all filetypes now. There is a feature in Studio called “Export for External Review” which I have covered in the past, but not in relation to comments. You can find the two articles I wrote, which I hope are useful, here:

I’m bringing this up again here because this features allows you to add comments in Studio to help with clarification during the translation/review process, for any filetype at all. This is because the exported file is always Word, so the comments are rendered in the familiar Word style:

Similarly any comments made in the Word review document will be imported back into Studio as Studio comments. This feature is not for translatable comments at all.

Perhaps one final point on this feature is about source comments. If the comments were extracted as Studio comments and you export for external review then the source comments will be retained in the Word review file and will survive the round trip. I also noted an additional problem to be aware of when using the Export for External Review. If you have a comment in the first segment that encompasses the entire segment then the Export for External Review will fail… at least this is the experience I have found testing this a few times as I wrote this article (I’ll add this to the list in the appropriate article above about troubleshooting problems with this feature).

Comment View Plugin

Now that we see how comments can be used in various places in Studio the last thing I wanted to do was quickly expand on where I started which was making use of comments that are Studio comments as opposed to translatable text. If you have a large project, with many files, and you’re sharing the work between many translators you may end up with a project full of comments requesting information, explaining where there were problems etc. But how do you know where these comments are?

The easiest method out of the box is probably to open all the files up in one go and read the comments window that I showed at the start. This works until you have a project with really large files, or just a project with so many files that the time taken to do this can be a little wasteful. This is where the comments view plugin can really help. Once installed and with the view in place all you have to do is select all the files in your project (Ctrl+A with all files showing in the files view) and the plugin tells you two things. First of all which files contain comments and how many there are:

And secondly you can navigate through the comments themselves and see the source and target text for the associated segment at the same time:

If you need to action any of the comments you can do it quickly without even having to open the files, and if you do need to open a file you just double click the comment in this window and the file opens up at the correct location. Very handy!

The Comment View Plugin was developed by Junya Takaichi, actually Junya is quite a prolific developer for the OpenExchange and I’d recommend you taking a look at his other apps:

I think his combination of translator, localization engineer and developer gives him a unique insight and ability to be able to use the OpenExchange to create the tools he wants and I think you may find them useful too.

↧

Bilingual Excel… and stuff!

September 27, 2015, 4:27 pm

≫ Next: ATA56 – SDL Trados Studio Advanced

≪ Previous: Comments… chapter and verse!

Copyright Rudall30 | Dreamstime.com I’ve written about how to handle bilingual excel files, csv files and tab delimited files in the past. In fact one of the most popular articles I have ever written was this one “Creating a TM from a Termbase, or Glossary, in SDL Trados Studio” in July 2012, over three years ago. Despite writing it I’m still struggling a little with why this would be useful other than if you have been given a glossary to translate or proofread perhaps… but nonetheless it doesn’t really matter what I think because clearly it was useful!

So, why am I bringing this up three years later? Well, the recent launch of Studio 2015 introduced a new filetype that seems worthy of some discussion. It’s a Bilingual Excel filetype that allows you to handle excel files with bilingual content in a similar fashion to the way it used to be possible in the previous article. There are some interesting differences though, and notably the first would be that you won’t lose any formatting in the excel file which is something that happened if you had to handle files like these as CSV or Tab Delimited Text. That in itself mught be interesting for some users because this was the first thing I’d hear when suggesting the CSV filetype as a solution for handling files of this nature. Most of the time I don’t think this is really an issue but for those occasions where it is this is a good point.

But this new filetype is more than just an Excel version of the old one. So let’s just take a look at the options using this excel layout as an example:

So I have five columns of text, with the source and target in columns B and C, the name of the character playing the part (it’s a film script) in column A, a maximum character length for the text in column D and some notes in column E. The text is also partially translated.

Columns

In addition to the usual source and target column I have a couple of other options.

I can set a maximum number of characters that are allowed in the target. This is quite useful because sometimes, particularly with gaming scripts where the text box is a limited size, it’s important for the translator to know how many characters are allowed. So here, if you use this option the standard QA Checker in Studio can use this and flag something like this if you go over the limit:

You can also check the allowable length at any time by clicking on the document structure column on the right hand side. If you don’t have the context information populated (see below) then the righthand column in Studio will say LN (for Length Restriction ;-)) but if you do, as I do in this example, then it may use a different code with a plus symbol indicating there is more than one code in there. So in my example it says ACT+:

The checkbox “Preserve Target Style” allows you to apply the style of the target cell in Excel to the target translation rather than overwrite with the style of the source cell. So just giving you another option for handling formatting in the Excel file.

Exclude

In here we have another new option compared to the CSV filetype, and that’s “Translation column content“. If you check this then any of the cells that have been translated in the Excel file already will be ignored. So if you do check this then the options in the next part of the settings will not apply:

Existing Translations

These options were already available in the CSV filetype and are quite useful because they can save you having to deal with existing translations at all, and more importantly using the locking option allows you to exclude these segments from the analysis:

Context and Comments

We had Comments availability in the previous CSV filetype too, but there the comments were added to the document structure window. Useful but hard to get at as you needed to click on the document structure column to see the available information and you only saw one cell at a time.

In this filetype the comments can be displayed as Studio comments like this which allows you to see more at a time and to read them without having to click on anything at all. In fact if you have a lot of comments and they are needed to provide important translation context then moving them to a window on the side can be very useful and easy to use. If you don’t know how to move windows take a look at this article:

The Context Information column is useful because it provides a good way ot tracking string IDs, or any other information which might be useful to know as you work. In this example I used the name of the characters in the film. These are in column A of my spreadsheet and they are displayed in the Document Structure Column as noted above in the section on Columns.

Where is it?

Perhaps one little thing I forgot to mention and that’s where it is. This is quite important to note because the default settings for Studio are like this with all three types of Excel filetype checked:

Studio uses the filetypes on a first come first served basis depending on information in the filetype settings. So if you want to use the Bilingual Excel filetype you need to either disable the Microsoft Excel 2007-2013 filetype or just move the Bilingual Excel filetype so it sits above the others in the list. I guess if you do a lot of these and also work with Excel then you could create project templates that allow you to simply select the appropriate one to match the filetype you’re working with and this would save you having to mess around with which one is active and taking priority in the list.

So all in all quite a useful filetype. There is no preview with this, but in many ways it doesn’t feel as though it needs one as the layout of Studio is very similar to the sort of files you are likely to be handling with this filetype and hopefully there are enough options to include the contextual information from the file to help anyway. But before I end I thought it might be interesting to share a little translation conundrum that was posted on ProZ a few weeks ago where Excel and this new filetype could be used to solve it; this is the stuff!

Stuff…

Excel is an interesting format for many things, so I thought I’d share an little problem that appeared on ProZ a few weeks ago. There are many ways to handle this but I thought it might be fun to share a way to tackle it using the Bilingual Excel filetype… and I’m not trying to start a war over whose tools handle it the best… this is just some excel stuff I thought would be fun to share. Since the original idea and reading what some of the other solutions are I’d probably handle this using regex in EditPadPro to get the text out anyway. But I like this because it’s just Excel and Studio.

The problem was how to create a TMX translation memory from an SGML file that was formatted something like this (you can see the full text in the ProZ post and the video at the end):

<doc id='N0001'>
 <head>
  <title>What is a Fenqing ?</title>
  <corpus url='http://code.google.com/p/evbcorpus/'>EVBCorpus</corpus>
  <author attributes='stuff in here'>name</author>
  <citation>"Building a Bilingual Corpus for MT"</citation>
 </head>
 <text>
 <spair id='1'>
  <s id='en1'>What is a Fenqing ?</s>
  <s id='vn1'>Fenqing là gì ?</s>
 </spair>
 <spair id='2'>
  <s id='en2'>Fenqing is a Chinese word which literally ...</s>
  <s id='vn2'>Fenqing là một từ tiếng Hoa mà nghĩa đen...</s>
 </spair>
 </text>
</doc>

So here’s one way to do it!

Create an XML filetype for this SGML… pretty simple using just two rules (if you don’t know how to do it this article might help but you can also watch the video as I explain it in there):

//s (always translatable)
//* (Don’t translate)

So these rules extract the translatable content in the s element and nothing else. There is no distinction between English or Vietnamese at this stage as I have ignored the language attributes altogether. Next I just open the SGML file in Studio and save it. Now I have an SDLXLIFF with source/target repeated in the source column only throughout the file.

Now I can use the SDLXLIFF Converter for MSOffice (installed with Studio since 2011) and convert the SDLXLIFF to Excel. If you didn’t know this was possible take a look at this article.

The result of this operation is that I now have an excel file with an ID column, a source column (populated), a target column (empty) and an empty notes column.

Now comes the fun excel part. I can use this formula in the target column:

=IF(ISEVEN(A3),B3,””)

The ISEVEN function in excel is a neat formula that lets you check whether numbers are odd or even. You probably see where I’m going with this now.

This formula will look at the ID column (column A) and check if’s an even number or not. If it is then it will copy the contents into the active cell. If it’s an odd number it puts nothing at all. Once I’ve done this I can copy the formula down the spreadsheet, copy all of column C (target column) and paste it as plain text to remove the formulae.

Now I have a spreadheet with every other row containing source on the left and target on the right. So I can filter on the target column and sort it in alphabetical order. Now I just delete all the rows with nothing in the target.

This leaves me with a simple spreadsheet I can drag into the Glossary Converter and convert to TMX which resolves the question asked by the user. However, seeing as I am more likely to want to use this Translation Memory in Studio I won’t do that. Instead, I just open the excel file with the Bilingual Filetype and then update it straight into a Studio Translation Memory. Piece of cake!!

If you want to see this in realtime, and the video is only 10 minutes long then you can see this in the video below… I hope it’s useful and perhaps gives you a few ideas of how excel can be useful for data manipulation especially since we have the new Bilingual Excel filetype:

Video is 10 minutes 8 seconds

↧

ATA56 – SDL Trados Studio Advanced

November 11, 2015, 9:14 am

≫ Next: A little Learning is a dang’rous Thing;

≪ Previous: Bilingual Excel… and stuff!

I ran a beginners and an advanced workshop at the ATA56 pre-conference day in Miami this year. A really fun day for me as we start the day with no specific agenda or pre-defined course and then try to shape the session to suit the needs of the attendees. The beginner tends to be a little more prescribed, to start off with at least, and the intention is to try and cover the basics of how Studio and MultiTerm work.

The advanced is a lot different… after all, what is advanced?

I had an interesting twitter discussion while waiting for the plane on the way home from the ATA on how we should have a Studio manual that has no gaps. Of course this is a worthy ambition, but in my opinion not at all achievable because once you move away from the basics the subject matter you could cover is huge. How advanced do you go in filling gaps? Even the excellent manual written by Mats Linder which is some 500 pages long is full of holes once you start asking the right questions. A manual is great… but it’s not enough, in fact it may never be enough!! Translation technology today needs to cover a lot of areas and a tool like Studio opens a real pandora’s box of things you could put in the advanced category. The solution is a flexible online help with live links to other resources and an opportunity to keep it dynamic. It would be nice to be able to print out pages and topics for sure, but the old style manuals are simply inadequate today. SDL do provide a lot of help already… summarised here… and SDL are always very willing to take feedback to help this get better and more relevant.

But I digress… I wanted to talk about the advanced workshop at the ATA and what we covered, so more to the point what is advanced? This of course depends on who you ask? So this is what we did… we asked the question and had a series of things to start with like this (I added some useful links for each topic here just for additional information):

how to resolve save target errors, and loss of dependency files
- KB #3897 | Message when opening (a file from) a package in SDL Trados Studio: Dependency file not found
- KB #3696 | How to troubleshoot problematic files
- KB #3309 | Error message when saving files in SDL Trados Studio: Could not find file (…) / Failed to save target content
how to create a custom filetype to handle Markdown files
- Using the Regex Filetype in Studio
how to create a custom xml filetype
- Why do we need custom XML filetypes?
- X Files… ATA56
terminology (no specific question… just wanted to know more)
quality processes and checks
- The 12 QA checks of Christmas…
- Qualitivity… measuring quality and productivity
using fields in Translation Memories
- Fields and Attributes in Studio

I don’t intend to repeat everything we covered in three hours… plus a bit… particularly as the sessions were recorded so all attendees will be able to play these back on the ATA website as a reminder of what we covered. So I just added a few links above that might be helpful as a reminder of where to find additional information on some of the things we discussed. But there was one thing in particular that I didn’t complete on the day that played on my mind afterwards, so I’m going to share this with you here. If I missed something please feel free to let me know in the comments and I’ll address it. I added most of the bigger things we discussed into the powerpoint and then didn’t save the file… doh!!

Markdown Filetype

The creation of a Markdown filetype. For those of you who aren’t familiar with this is it’s a lightweight markup language that can be written by anyone using a text editor. A good explanation can be found here in Wikipedia. It’s quite popular today in online tools like github and those provided through Atlassian. You can also find tools like MarkdownPad, which I just happened to have installed, so we knocked up a quick test file to see how to create the filetype.

The creation of the filetype itself was simple. We created a custom filetype using the Regular Expression Delimited Text option and knocked up a simple test file using MarkdownPad something like this:

So, nothing complicated but on the day I did run into a short problem that needed more thought later on. Creating the filetype itself was simple, but when I added the inline tags to show how to mimic the formatting in a wysiwyg style I ran into this error:

The reason of course was because I tried to use \*\* as my opening and closing tags to pick up the markdown being used for bold for example and this created overlap as we discussed. So to resolve this I tried something else later on and used a lookaround in order to be more specific with the regular expression and not cause any overlap which was invalidating the rule. This worked… I used these as the opening and closing tags for bold as in **sample**

(?<=\s)\*{2}(?=\w+\*)
(?<!\*)\*{2}(?=\s)

The opening tag says find two stars exactly, but only when you find a space first, and then some text followed by a star afterwards. So a positive lookbehind at the start and a positive lookahead at the end. The closing tag says find two stars exactly but not if there any stars before this, and only if there is a space after it. So a negative lookbehind at the start and a positive lookahead at the end. These are not perfect and won’t be the solution in every single permutation of markdown for bold, but they did prove it could be done if you ensure that the expressions don’t cause any overlap between the tags.

So I was able to make Studio render this example text as follows which is quite nice:

Obviously I’ve hidden the tags to make this look nicer by using the advanced tag properties so Ctrl+Shift+H toggles the display of the formatting tags on and off. But it’s quite nice, and I think you can handle Markdown easily, represented in a smart way through Studio. The complete set of rules I used for this file were these… the formatting was controlled through editing the rule and using the format button… but that was the easy part :-)

I could of course have used \*\* as a placeholder instead of trying to make the opening and closing tags work, or perhaps this \*{1,} as a catch all for markdown using the star symbol, but that would have been too easy ;-) I’m looking forward to any suggestions for a better solution from anyone interested in regex or anyone who might have tackled this particular filetype before. If nobody has then I hope this helps you get started.

↧

A little Learning is a dang’rous Thing;

November 21, 2015, 5:22 pm

≫ Next: Read this and I may have to shoot you!

≪ Previous: ATA56 – SDL Trados Studio Advanced

Drink deep, or taste not the Pierian Spring:
There shallow Draughts intoxicate the Brain,
And drinking largely sobers us again.

I’m quoting Alexander Pope in 1709, rightly or wrongly, for hitting the nail on the head when it comes to the truly intoxicating mix of language and technology. A little knowledge is indeed a dangerous thing and it’s something I know I’ve been guilty of all my life… I learn a little something new and now I’m an expert. That is of course until I learn a bit more, and then a little more after that, and before I know it I realise I know nothing at all! Translation technology is great for dropping us all into this trap… Trados user since Trados 5, translator for over 20-years… can handle any type of file. Falling into this trap is pretty easy in fact, especially when the tools available for translation today take a lot of the effort out of the tasks at hand. But not everything is what it seems and sometimes it takes a mistake or three to sober us up again! There’s a reason why well organised and successful translation companies, dealing in all kinds of content, have Project Managers, Translators and Localization Engineers within their midst.

To explain this better I’m going to tell you a little story of how I spent a couple of evenings during an SDL Roadshow trip to Warsaw and then Budapest this week working on a process related problem with an agency in Canada. It was actually quite an interesting time and let me play around with a few things I like to pretend I know about. Be warned now… it’s quite a long post so if you’d rather just watch the movie scroll to the bottom!

The Problem

The problem started with the client providing an XML for translation to the Project Manager, who then passed it straight over to the translator after asking if they were ok workiing with XML. Studio, like most CAT tools, can of course handle an XML file, but as we know this is just the start of the story with XML. The translator opened the XML, translated all the English text into German and then sent the translated XML file back to the Project Manager and onto the client. The client wasn’t happy! Why?

Let’s start with the original XML, which looked like this:

If you know about XML you are probably thinking, ok, it looks to me as though the German translation should be placed into the target element. This would be a reasonable assumption because if you knew this then you’d also know that if you translate this XML using the simple default XML filetype then you’d probably end up with something looking like this:

Clearly wrong because the German translation has just overwritten the English source! In order for this to work when using a monolingual XML filetype you must ensure the text to be translated is in the target element (in this example) before you start so you can correctly replace this. If the file isn’t partially translated already then this is fairly straightforward using a regex search and replace.

So search for this (I’ll make one, two very important points here. First this expression was created for this particular file… if you have a file you wish to do this with it may not be the same layout and will most likely require a different expression. Second, always back up your file before you do it!):

^(\s+<source xml:lang=".*?">)(.*?)(</source>\s+<target xml:lang=".*?">)
(</target>)

And replace with this:

$1$2$3$2$4

I used Notepad++ for this making sure that the “. matches newline” option was checked in addition to “use regular expressions” because the file contained a mix of windows style linefeeds and unix style carriage returns (I’m grateful for an explanation from Jan Goyvaerts on this problem because I was messing around with \n\r and \s in the expression to make it work originally… I did manage it but the expression was messier). All we’re doing is looking for all the source and target elements and capturing them into four back references. Then replace them in the same order but adding the second back reference (containing the source text) twice so it’s also placed into the target element. This takes seconds and now I have this:

If the translator had received the XML file prepared like this it would have been more obvious what was required, and this leads me back a step, because when I open this in Studio I see this:

Note how the text is duplicated and we see the source text twice. This is because the AnyXML filetype, which is the default XML filetype for handling monolingual XML files will extract the text from ALL the elements it finds. This is the correct action for the AnyXML filetype, but it’s not what we want to do here. Unfortunately the translator, who is probably an excellent linguist and probably knows very well how to work with the translation tool, but doesn’t know enough about what’s happenning under the hood or why it’s important to handle XML with particular care. So a case of knowing just enough to get into a bit of trouble, but not enough to be able to recognise that a scenario like this needs handling in a different way.

Back to our story though. The translator sent the translated SDLXLIFF and the translated XML to the Project Manager, with neither of them knowing what was wrong. The client of course sees the problem immediately and asks the Project Manager to resolve it. The translator believes the job is done because the file was translated and the Project Manager just knows he has a translator who wants to be paid, and a client who is unhappy and may not be prepared to push any more work his way!

The Resolution

So the start of this story, for me, was receiving the translated SDLXLIFF and the translated XML. How do we resolve this given we also don’t have the original XML (it wasn’t included and no time to explain and then wait for it to be provided).

Not too tricky in fact… but here’s the first steps:

Open the translated SDLXLIFF in Studio.
Use the Advanced Save option, in the File menu to save the source file as opposed to the target file… a neat option in Studio.
Open the XML in Notepad++ and make sure there are no target elements already containing translated text. I used this simple regex to look for any that were populated:
<target xml:lang=”.+?”>.+</target>
Once happy there were none I could apply the regex search and replace above and populate the XML target elements with the source text.

All simple stuff so far. But now I must make sure I don’t have the text in the source and the target elements extracted for translation which leads to doubling of the wordcount as I showed before and of course overwriting the source and target elements with the translation. To do this I just need to create a custom filetype. I won’t go into detail here as I’ve discussed this many times before, so if you’re interested refer to this article, although I will show the overall process in a video at the end. But it’s pretty simple for this file, two rules:

This will ensure that I only extract the text in the target element for translation. So what I get in Studio is exactly what the translator saw, but the difference being I’m now going to translate the text in the target element and not the text in the source. I also added come context because after doing this for real the project manager noted that there was also embedded code (%s) in the midst of the thousands of lines and the translator sometimes mistakingly handled it as translatable text, or missed it out of the translation altogether, so the translation contains this sort of thing:

There was also stuff like this in the file that is easily mishandled, and in this case probably translated when it should not have been changed at all, so I created tags for these too which ensure they are excluded from the translation altogether:

There were a couple of others, and some that just look plain wrong in the source file, but I ensured they remained as provided although given more time and the relationship wth the end client I think it would make sense to question some of it in case it was a mistake they need to correct. So I had these rules:

Now when I open the file in Studio it looks better :

It’s also safer because not only will the translator see these are tags, but Studio carries out an automated check to see if they are missing. These things don’t happen if you treat the content as text. The other tags with the curly brackets were all external so these were moved safely out of the translation altogether ensuring a mistake cannot be made.

Unfortunately, due to the urgency of the request from the Agency, and my lack of availability late at night inbetween roadshows, I didn’t inspect the text in detail as I have done now. I just recovered the situation so the Agency was able to generate XML target files that contain source and target translations as the client intended, based on the content of the original translated SDLXLIFF I received. So they have had to carry out a manual check to ensure all the code that should not have been touched, was written into the target XML correctly.

But this still leaves an interesting piece of the puzzle. How to recover the situation now that I have the correct XML file, and a customised XML filetype to work from? To do this I first created a Translation Memory from the SDLXLIFF provided and just pretranslated the new SDLXLIFF I created. But unfortunately this wasn’t good enough because the original contained duplicate source segments with multiple translations and these all incurred a penalty leaving me with two, or sometimes more choices over which one was the right one to select. Like this for example:

I highlighted the differences in yellow, all things the out of the box QA check could have found in Studio, but also all things preventing me from making sure my file was the same as the one the translator provided but able to deliver the correctly formed XML target file. So I adopted a different approach, and used Perfect Match to get the status of my SDLXLIFF matching the original linguistically, and then my task was complete. This worked perfectly, and the file was matched so I had exactly the same translations segment by segment allowing me to save the target and return an XML file like this:

Now, in this example I’m showing here where I additionally created tags to protect code in the file the Perfect Match operation left me with fuzzy matches and not Perfect Matches. This is because the source has changed… I’ve introduced tags. But this is easily solved manually by creating a Translation Memory from the original SDLXLIFF and using fuzzy matching to compete the task with the tags correctly placed.

All in all I was quite pleased with the features in Studio that made it possible for me to do this, and I kind of enjoyed the challenge and the learning experience… I also thought it would make a nice case study to share as it contains lots of useful lessons that I hope benefit others too.

The Moral

Is there one? Well apart from the improved understanding I have gained for Project Managers and Translators who find themselves in a position like this, and the obvious stress this creates between all parties, including the client, there is a moral to this, or probably several. The first is that it’s not always enough to be an experienced translator or project manager. Today the filetypes you could be asked to handle require a better understanding of the scope of the work than just how many words you think it is. Even wordcount differences in less challenging filetypes can cause disagreement and confusion, but with XML you have to remember that these filetypes can contain translatable text in elements and/or attributes and they can contain conditional translation where you only translate the text under certain circumstances. If you assume you know enough about translating in any CAT tool without getting the answers to what the scope of work is before you start then you could be heading down a log and painful path leading to excessive work without payment and/or unhappy clients.

When handling XML files always ensure the following:

That the client has prepared the XML files for translation and given you clear instructions on what needs to be translated, or
That you (Project Manager or Translator) understand the requirements and what must be translated before you start

If it’s the latter then my advice would be to either employ a localization engineer with the appropriate skills to prepare files for the translators if you do not know how to do this yourself, or only give work of this nature to translators with proven experience in this field. If you aren’t convinced then here’s a little light reading on this sort of topic to explain why it’s different to handling Word files!

Translate with style…
More Regex? No, it’s time for something completely different.
The JSON Files…
X Files… ATA56
ATA56 – SDL Trados Studio Advanced
XML Length Restrictions
Why do we need custom XML filetypes?

These are not exhaustive, but I hope they shed some light on the sort of technical skills you need somewhere in the mix of client, project manager and translator.

One last thing I’ll mention… these particular files could also be resolved another, possibly easier way and it’s a final moral to this tale. Always make sure the person you ask to do these things is experienced enough to see the most appropriate way to handle the files from the start! If you notice when you watch the video these files actually have an XLIFF body embedded into the XML. So if you remove the XML declaration at the start of the files and then add an XLIFF extension (or add XML to the XLIFF filetype in Studio), then you can open them and save them then using the XLIFF filetype as true bilingual files. The result looking something like this:

You’ll see you do get an XLIFF target state added to the files, but you could remove these afterwards if they were a problem and put the XML declaration back if needed. You also wouldn’t be able to handle the embedded content in the way I showed it in this article, but still a simpler solution. So whilst it was great fun to play around this way, you can see that asking me, and I’m not a localization engineer, can often lead to enjoyable but tricky workarounds when the most appropriate solution is almost under your nose!! A little learning is indeed a dangerous thing!!

XML… the Movie

Length approx. 30 minutes

↧

Read this and I may have to shoot you!

December 1, 2015, 9:38 am

≫ Next: Good bugs… bad bugs!

≪ Previous: A little Learning is a dang’rous Thing;

Chapter One

“Gabriela descended from the train, cautiously looking around for signs that she may have been followed. Earlier in the week she’d left arrangements to meet László at the Hannover end of Platform 7, and after three hours travelling in a crowded train to get there was in no mood to find he hadn’t got her message. She walked up the platform and as she got closer could recognise his silhouette even though he was facing the opposite direction. It looked safe, so she continued to make her way towards him, close enough to slip a document into the open bag by his side. She whispered ‘Read this and I may have to shoot you!’ László left without even a glance in her direction, only a quick look down to make sure there was no BOM.”

László needn’t have worried, because the document was encoded, there was no BOM… so if he had attempted to read it in any of the languages provided apart from English he’d see something like this:

Unfortunately the end user he was delivering the file to couldn’t read it either! The problem I’m introducing with a very loose link indeed is one of encoding, but not the secret kind! Every now and again we see the subject of “character corruption” coming up on the forums, and I get the occasional email on the same topic, so I thought it would be useful to explore this a little and look at the reasons for it happenning. But first let’s take a look at the original file, in this case an html file:

03a

There’s not much wrong with this (thank you Sameh, Romulus and Chun-yi for the transcreations… I know who to send Mats to if he ever gets around to writing that novel!) and it displays perfectly in my editor. So why do the characters become completely illegible for Arabic and Chinese, and partially illegible in Romanian when I open this file in my browser? More importantly how do I explain it when I barely understand it myself! Let’s start with an explanation adapted from the W3C writings on this topic.

Written characters whether they are letters, Chinese ideographs or Arabic characters are assigned a number by your computer. This number is called a code point. You’ve probably heard the phrase “bits and bytes” before which are units of your computers memory, well each code point is represented by one or more bytes. Eight bits make up a Byte, 1000 Bytes make up a Kilobyte (KB), 1000 KB make up a Megabyte (MB) etc… and I’m sure you recognise these terms, which are all just units taking up space in your computers memory. The encoding I mentioned earlier, properly referred to as character encoding, is a key that maps code point to bytes in the computer memory. This graphic adapted from one shown on the W3C webpages explains this quite nicely.

The key, or character encoding, I used here was Unicode UTF-8. The bytes are represented by a two digit hexadecimal number (sometimes a useful way to view files as we’ll see later) and you can see how the simple letter “m” is represented by a single byte, whereas the more complex ideograph “看” requires three bytes. If I used a different character encoding for this, say ISO-8859-6 you can see that only the “m” and the Arabic “ح” are displayed correctly:

The Romanian letter “ș” and Chinese ideograph “看” cannot be displayed and this is because the character set supported by ISO-8859-6 does not extend to these characters and they are replaced by a question mark in my editor. The same Arabic “ح” character only takes a single byte when using this character encoding. So in simple terms if you have the wrong key then you can’t decode the characters and as a result they are displayed incorrectly. This of course begs the question how many keys are there? Well, there are a lot and most are the result of legacy systems which were developed to allow a computer to correctly represent different types of fonts supporting different languages at the time. So the ASCII character set (a character set is simply a collection of more than one character) covers letters and symbols used in the English language; the ISO-8859-6 character set covers letters and symbols needed for many languages based on the Arabic script. My text editor offers 125 different character sets. All of this can be further complicated by the type of browser you use because not all browsers are equal and even versions of the same kind of browser can display things differently.

If you had to display characters on the same page that required different character sets it was a very complex scenario because you cannot mix more than one character set per page, or file. Fortunately today the recommended character set to use almost all the time is Unicode UTF-8 which I used for first example above. Unicode UTF-8 is also a W3C recommendation because it contains all the characters most people will ever use in a single character set and it’s supported well by all modern browsers.

The BOM!

So, if UTF-8 solves all the problems then why do we still see problems when UTF-8 has apparently been used? Continuing along the same espionage theme we have the BOM. Not a misspelling, although in some ways it can act like a BOMB when all the characters appear to be corrupted and you can’t read the text! The BOM (Byte Order Mark) is a series of bytes that can be placed at the beginning of a page to instruct the tool reading it (such as a browser) on two things:

1. That the content is Unicode UTF-8
2. What order the bytes should be read if the file is encoded as Unicode UTF-16

The latter, UTF-16, is something you will hopefully not come across today because it was used around 1993 prior to UTF-8 and as interesting as it is I’m going to refer you to this page if you’re interested to learn more on that. I also read on the W3C pages that according to the results of a Google sample of several billion pages, less than 0.01% of pages on the Web are encoded in UTF-16. So for today’s use let’s stick to UTF-8 and take a look at how this tells the browser that the content is UTF-8 encoded. The first thing to note is that in keeping with our espionage theme the BOM is invisible to a simple editor, so if you open it up with Notepad you won’t see anything more than the content you wish to display. This is important because if you don’t know about this then the reason for the display issues we see at the start of this article could remain a mystery that a project manager could blame the translator for and neither know enough to explain why… often this means the translation tool unfairly gets the blame.

The way to see it is to look at the file in an editor that supports Hexadecimal editing, so one that allows you to see the content at the low level bytes. I use EditPadPro for this, but UltraEdit has a Hex Editor and I believe NotePad++ has one too. If I open the file László received at the start and view it as hexadecimal the start of the page looks like the top image:

The bottom image is what it would look like with a BOM. These three bytes EF BB BF also represent a “zero width no break space” which is why you can’t see it, but today its use as a “zero width no break space” is virtually redundant so its primary use is as a BOM. The reason the BOM is so important is because its prescence tells the application reading the file that it is UTF-8. So in the html I showed you at the start of the file the bit you were unable to see was that it had no BOM. If I add the BOM and then give the file to László he’d be able to read it and then I’d have to shoot him… which of course I don’t want to do!

If you need to add the BOM you’d need a decent editor such as the ones I mentioned to be able to add it, or maybe use a tool such as the File Encoding Converter from a master of useful tools and applications for localization and espionage, Patrick Hartnett. This tool supports the changing of encoding for loads of files in one go… an essential tool for anyone regularly dealing with these kinds of issues.

Declaring the encoding

Phew… this is getting complicated, but I haven’t finished yet!! The lack of a BOM in the file only causes potential problems if the correct character encoding is not declared in the file in the first place. The W3C recommendation is that this declaration is always made. If the file is HTML then I could rewrite my file like this for example including the meta statement in my header, or as I have done here having omitted the head element altogether, before the body (html is very forgiving!):

This time it wouldn’t matter whether I had a BOM or not, the characters would all be displayed correctly in my browser. Now in theory if you use a BOM, browsers since HTML 5, are supposed to take this in preference to whatever the encoding is, but not all browsers respect this precedence so I believe the recommendation would be always declare, and don’t use the BOM! The other advantage of declaring is that you can always see the intended encoding and hopefully avoid problems where it’s driven by invisible bytes.

Now I’ve used HTML for the example here, but similar problems can occur with XHTML as well… so the W3C have published a recommended list of DOCTYPE declarations to help ensure these kinds of problems are minimised. They also publish a summary of character encodings and their respective declarations for all XML and (X)HTML documents. Wouldn’t it be good if all the files received for translation followed these rules ;-)

Now with everything corrected my final decrypted file looks like this:

But what about Studio?

I can’t finish this article without mentioning how Studio handles these sorts of problems. The “writer” options on these filetypes has similar options to control the BOM, and the HTML has the specific META declaration which is not used in XHTML or XML. The options speak for themselves, but should be used in conjunction with the requirements of your client to help ensure that the final target files are always going to be created in such a way that the encoding is appropriate for their intended use:

I think these are settings that are often overlooked, especially if the creation of the filetype is left to the translator or project manager. In many ways these sorts of problems fall into the category I wrote about in the previous article “A little Learning is a dang’rous Thing;” but maybe the explanation I’ve tried to provide here will at least provide a little awareness that they exist and why they exist too… and maybe even help to resolve a problem the next time you come across it.

At the start of this article I set out to try and explain the problems of encoding in a simple way, but as I scroll back through what I’ve written I’m not at all sure I’ve done this. Perhaps this is just one of those subjects that is complex and handling files with the potential for encoding adventures is something that requires knowledgeable people somewhere in the mix to ensure things are done correctly and the problem doesn’t land at the feet of the translator or project manager. Perhaps I’d be better of writing a novel! I read in the introduction to Mats Linders excellent manual on Studio that he always wanted to write a novel… I hope he does a better job than I would!

↧

Good bugs… bad bugs!

December 10, 2015, 3:54 pm

≫ Next: Paying it forward with MS Publisher files

≪ Previous: Read this and I may have to shoot you!

What the heck is a good bug? I don’t know if there is an official definition for this so I’m going to invent one.

“An unintended positive side effect as a result of computer software not working as intended.”

I reckon this is a fairly regular occurrence and I have definitely seen it before. So for example, in an earlier version of Studio you could do a search and replace in the source and actually change the source content. This was before “Edit source” was made available… sadly it was fixed pretty quickly and you can no longer do this unless you use the SDLXLIFF Toolkit or work in the SDLXLIFF directly with a text editor. In the gaming world it happens all the time, possibly the most famous being the original Space Invaders game where the levels got faster and faster as you killed more aliens. This was apparently not by design but it was the result of the processor speed being limited, so as you killed the aliens the number of graphics reduced and the rendering got faster and faster… now all games behave this way! Another interesting example in the Linux/Unix world is using a dot at the start of a filename to hide it from view. This was apparently a bug that was so useful it was never “fixed”.

All this brings me onto my next point. If you found that a particularly “good bug” was so useful that you built a process around it because it solved a problem you’d been unable to resolve before then you could be more than a little disappointed if it was “fixed”! So these “good bugs” may often turn out to be opportunities for “features” rather than “bugs”.

This brings me onto the reason for this article which is that a “bad bug” was brought to my attention today, and as I was investigating it I discovered it was potentially a “good bug”. I say potentially because it needs a little work to become a “feature”, if indeed it doesn’t get “fixed” first!

So, here’s the “bad bug”. When you create a project in Studio with multiple files and haven’t merged them during the creation process then you can still merge them with virtual merge at any time. This is a cool feature, everyone loves it, but what you may not have noticed (and I didn’t either) is that every time you do this a temporary file is created in your local temp folder… so in here:

c:\Users\[USERNAME]\AppData\Local\Temp\

So for example, I open a project containing these files below and open them all up together with the virtual merge:

When I create this project I can see that temp files are being created in this Temp location, and when I open them all up together and press Ctrl+S to save this creates a much larger temp file. So now I see something like this where the larger file is the temp file for all the files I’ve opened in a virtual merge:

Now this is where the “Bad bug” comes in. If I have Auto-save in use then every time the save kicks in that large file at the top is saved again, and again, and again, and grows in size as I work through the translation. When I close Studio many of these temp files are deleted or just become zero byte files, but not these big temp files. They just stay there filling up my hard disk until I run some software utility that clears out the temp folders. So the “Bad bug” that needs to be addressed is that Studio is not removing these temp folders when we’re done and have closed Studio down. We don’t need them because the Auto-save feature saves the files in the AutoSave folder over here:

c:\Users\[USERNAME]\Documents\Studio 2015\AutoSave\

So if I’ve already got the files saved in the AutoSave then the temp files are useless right? Until now I would have thought so, but this is where the “good bug” kicks in. The virtual merge in Studio allows you to open many files at once, but if you wanted to then share the merged file, which many people do, then you cannot. If you try and save it you get separate SDLXLIFF files, one for each of the SDLXLIFF’s you opened in the merge. Even the AutoSave files are separate, one for each of the individual files in your project.

If you take a copy of that big temp file that was created and add .sdlxliff to the end you can open it in Studio. Once open it you’ll see that it contains all the merged files just as if you’d either merged them when creating the Project or opened them all up together with the virtual merge. So the temp files that are created are actually the very files Studio cannot create through the UI, and that many users are looking for so they can send them out for translation as a single file. That was an interesting find indeed, although maybe it’s a deliberate behaviour that has a “bad bug” associated with it!

This has a useful implication because as a Project Manager I could even add them to my project so they could be handled in a package and used to update the translation memories and pretranslate the original files when done. I have been able to save the target files from these temp files so pre-translation of the originals might not even be necessary, but as this is an unintended benefit I’m wary of relying on the temp files for the final files as matter of principle. So if I create a folder for the temp files in my project and then drop them into the folder after adding the .sdlxliff extension I’d have something like this now:

So an interesting “Good bug” with an unexpected benefit which might be useful until the development team either “fix” the bug, or turn it into a feature. In reality I think that even if the bug is fixed you might still be able to use this method to get at the temp file, but you’d have to make sure you do it as soon as you opened the files in Studio as they might not be there anymore once Studio is closed!

Just in case you got to the end of this and still don’t see what I’m getting at then here’s a short video to explain the process. In the meantime my parting comment would be “enjoy the good bugs while you can!”

Approx. 7 minutes 25 seconds

↧

Paying it forward with MS Publisher files

January 11, 2016, 5:41 am

≫ Next: Handling PDFs… is there a best way?

≪ Previous: Good bugs… bad bugs!

001 If you’ve never come across Microsoft Publisher before then here’s a neat explanation from wikipedia.

“Microsoft Publisher is an entry-level desktop publishing application from Microsoft, differing from Microsoft Word in that the emphasis is placed on page layout and design rather than text composition and proofing.”

It’s actually quite a neat application for newbies to desktop publishing like me, but it’s a difficult tool to handle if you receive *.pub files (the format used by MS Publisher) and are asked to translate them. And I do see requests from translators from time to time asking how they can handle them. The file itself is a binary format and even with Office 2016 (which includes Publisher if you have the Professional version) the only export formats of PDF, XPS and HTML are not importable. So very tricky indeed if you need to be able to provide your client with a translated version of the pub format.

In the past I would have suggested T-Windows for Clipboard which is installed with Studio 2015 and this would allow you to translate the text (if you have a copy of MS Publisher) using your Studio Translation Memories. There has also been an application on the OpenExchange for around a year that can create an XML file from MS Pub (for the 2010 version only) and this does the job quite nicely and again requires a copy of MS Publisher to be installed. But now there is a new application available, pub2xml, which also supports the latest versions of MS Publisher and it also provides some nice touches making it far easier to use. It’s also free, but still requires MS Publisher to be on your machine.

But there is a catch! The developer created the bones of this application and it seems to work really well for most files. But it’s not 100% complete. The things that are missing relate to the handling of internal formatting, like tags for font changes midway through a sentence, or hyperlinked text. It is a good catch though because the developer has created a solution to a problem faced by many users and then made the code available as OpenSource on his Github site. This means that any developer could make a pull request to get the code and could make changes, enhancements, fix bugs etc. and share this back into the source for others to use. This is something we also do with every application we have created through the SDL OpenExchange development team and you can find the source code for these here. In fact the developer of pub2xml, Patrick Hartnett, has shared the source code for several apps and a few other things on his Github site too, so it’s great to see other developers following suit and helping to grow the developer community with shared resources… I guess it’s a sort of “paying it forward” approach and I like it. Another Patrick, Patrick Porter, has done a similar thing with his code for machine translation plugins to Google and Microsoft and you can find them here as well as a few other things.

I’m really hoping that as the development community becomes more established in 2016 we’ll start to see more of these community initiatives with more developers “paying it forward” by investing in sharing a little of their knowledge to benefit everyone.

But I digress… back to MS Publisher!

Overview

The basic idea is that this is a standalone application which makes the content available (text, formatting and images) for localization. The application itself is not complicated and has two screens, one for export and one for import. A simple overview of the entire application is that it’s a drag and drop interface like this:

002

So you would drag and drop all the files you wish to convert into the interface of the Export tab, set your options and click “Export”… it’s as simple as that. The export options here are very interesting:

Export translatable content : pulls the translatable text out of the file and inserts it into a simple XML file
Export pictures : pulls the images from the pub file and stores them in a separate folder where they can be localized if necessary
Create PDF file : creates a PDF rendition of the pub file making it easy to see the format in context without MS Publisher available
Create Pseudo file : creates a new pub file during export with the extension .pseudoTranslation.pub and it replaces all the vowels from the translatable content with % or $ characters. This allows the Project Manager to quickly confirm that all the translatable content was in fact exported… so a quick sanity check
Markup internal font information : this relates to part of what’s incomplete with the app. If you select this option then bold, italic, strikethrough, superscript and subscript formatting will be honoured in the XML with appropriate tagging. Any other type of formatting is currently ignored.

The import options are very similar and of course make sense as they are aligned to the export:

Import translatable content : pushes the translated text in the XML target file back into the MS Publisher file
Import pictures : pushes the exported images (which could now be different) back into the MS Publisher file
Create backup file : renames the original source pub file as .BAK so it can easily be recovered if needed
Create PDF file : creates a PDF rendition of the localized pub file making it easy to see the format in context without MS Publisher available

That’s essentially it… very nice application and easily to use!

The XML file

Well… ok, that’s nearly it. How about the XML file and how do you handle that in Studio? The format is very simple, so simple that the default AnyXML filetype would be sufficient to translate the file. However, there are a few internal tags here and there that are extracted as translatable text, so it makes sense to create a custom XML filetype for this to ensure that the tags are properly protected. The example files I have tested so far all seem to make it very simple and I created one which is available in the zip you can download from the OpenExchange, as well as a sample pub file in case you need one to get started. Looks like this and I highlighted the translatable text that was extracted. All between the <text> elements as you can see, so very straightforward:

003

In Studio, using a bit of text with all the currently supported internal tags catered for, it looks like this:

004

But to avoid this being an unnecessarily long post I created a short video showing the process from end to end so you can see what it looks like in practice.

Approx. playing time: 15 mins

Developers “paying it forward”… an excellent concept for 2016! I’m looking forward to seeing more of these, so perhaps we can review this with the last article I write at the end of this year. I hope it’s going to be a full one! Now just one more thing I forgot to mention… you can download the current version free from the SDL OpenExchange via this link.

Enjoy!

↧

Handling PDFs… is there a best way?

July 27, 2016, 4:32 pm

≫ Next: Getting a filetype preview…

≪ Previous: Paying it forward with MS Publisher files

001 We all know, I think, that translating a PDF should be the last resort. PDF stands for Portable Document Format and the reason they have this name is because they were intended for sharing with users on any platform irrespective of whether they owned the software used to create the original file or not. Used to share so they could be read. They were not intended to be editable, in fact the format is also used to make sure that the version you are reading can’t be edited. So how did we go from this original idea to so many translators having to find ways to translate them?

I think there are probably a couple or three reasons for this. First, the PDF might have been created using a piece of software that is not supported by the available translation tool technology and with no export/import capability. Secondly, some clients can be very cautious (that’s the best word I can find for this!) about sharing the original file, especially when it contains confidential information. So perhaps they mistakenly believe the translator will be able to handle the file without compromising the confidentiality, or perhaps they have been told that only the PDF can be shared and they lack the paygrade to make any other decision. A third reason is the client may not be able to get their hands on the original file used to create the PDF.

Whatever the reason, handling the PDF is tricky for a number of reasons:

It might be a scanned image in which case you need to OCR the file first to have any chance of getting at the text and the success of this will vary considerably with the quality of the PDF and how easy it is to get at the text where it sits over images for example, or even coloured backgrounds.
The conversion of the PDF to allow you to translate it might be a text only extraction in which case you might have to extensively DTP the file afterwards to provide a formatted document.
The conversion of the PDF might be an attempt at creating a formatted text & image extraction, probably in Word format, and the extent of DTP afterwards will range from nothing to a serious amount of work depending on the type of content and the quality of the PDF.
And then the final format of the file. What is the client expecting? If they provide a PDF and expect InDesign files back then you have more work after the translation because you are probably going to end up with a Word file at first. There are tools to help with this but it’s still more work afterwards.

The last point there is probably no way around without a lot of work so I’m going to concentrate on how to return a PDF. I know that sometimes you may have a client who actually wants a Word file back because they really did lose the source, and Studio is excellent for this because you’ll have a source and target version in Word when you’re done, but I’m going to concentrate on returning a PDF and how to get the best quality finish. Now, some translation tools will handle a PDF for translation as we know. Some can even do a rudimentary OCR, some do a very good OCR. Some handle the PDF as a text file only and some will make an effort to reproduce the formatting of the PDF by converting to DOCX. But as far as I know, none will allow you to recreate the PDF so that the formatting is as good as it is in the PDF itself. So is there a best way? Probably not one best way as it will depend on the file you have been given, but I’m going to share the one I like the best so far as it has bailed me out of many tricky PDF related problems in the last year or so.

InFix PDF Editor

InFix is a PDF editor developed by Iceni Technologies, and basically it’s a tool that allows you to edit the text in a PDF… sort of an Adobe substitute you might think. But in actual fact it’s a little more than that because it has this very handy menu giving it away as a tool that could be very handy for translators:

002

Actually this menu is from Version 7, and the XLIFF approach may have resulted from the valuable lessons they learned in working with a few people in our business. The difference is that the Local menu item at the bottom is from Version 6… ish and this allowed you to export the extracted text to an XML or Plain text (with markup) format. They even provided some filters for use with “popular CAT tools”, although sadly haven’t realised that Trados is completely redundant and hasn’t been sold for around 7-years, but they still provide an ini file! I’d be happy to provide a suggested sdlftsettings file for Studio if anyone needs it! (Post publication addition: After being asked in the comments I put an sdlftsettings file for the txt and the xml exports here) The other items at the top are all Version 7 and this is far more interesting and reliable. This version extracts the translatable text from the PDF and exports it as an XLIFF.

Now, the reason the bottom item is called Local is because the InFix application does all the work on your computer. The XLIFF parts however are all done in the cloud using their TransPDF website. This is quite impressive and you can use this without the InFix PDF Editor at all. The idea is you upload a PDF, select the language pair you want, download the XLIFF, translate it in any translation tool you like, upload the translated XLIFF and the cloud miraculously returns the now translated PDF ready for further editing or handing over to your client as it is. There is a cost associated with this and at the time of writing you get 50 PDF pages for free and then pay 50 cents a page thereafter. So if you don’t get a lot of PDFs that need translating this could be exactly the tool you’ve been looking for. You pay as you need it and build the cost into the price for your client… couldn’t be easier!

Also worth mentioning the cloud solution a bit more. When you sign up you get your own account which keeps a track of the projects you might be working on and also provides a flight check guide to anything you need to address such as font changes where a different font would be better to represent the characters in the target version for example. You can use this dashboard independently of the InFix Editor, but if you do have InFix then the process is quite well integrated allowing you to work only from the desktop tool, connecting to the cloud when needed.

003

If you do get a lot of PDF files then I’d recommend you purchase the InFix PDF Editor. This is really a wonderful tool even without the translation options. You can almost treat a PDF as if it was just a word file, or a publisher file. Not nearly as flexible of course but amazingly good. On price, well this is another thing that’s changed with Version 7, it’s now a subscription service and has some very good value options:

£5.99 per month, renewed month to month
£59.99 per annum for a single user
£1,199 per annum for up to 1000 users

If you take any of these then the TransPDF feature is free of charge, you just use it whenever you like. So if you do more than 120 pages a year then the annual license pays for itself easily. If you have a 12 page document to do then even the monthly license is worth it. If you have any editing at all to do in the PDF afterwards to try and get a more polished translated version for your client then you won’t need to buy another PDF editor, you just use this.

Normally I would not go on about translating PDFs or software to help you with it, but this tool is really worth a look. To make it easier to follow I’ve created a video with a PDF file I took from the internet (cut down to 3 pages for this demo), deliberately chosen so it’s not too easy, but also not too hard. I did take a look at what you get with various translation tools that can handle PDFs according to their documentation… also quite enlightening, but I’m not going to discuss that in here! That exercise did reinforce my opinion that Studio does have the best PDF converter built in. It’s not always good for all the reasons already discussed but as you’ll see it provides an excellent attempt with this example file. Have a look for yourselves and test it in your tools if you don’t believe me!

Video duration approx. 17 mins

That was it… if anyone asks me what’s the best way to handle a PDF my initial answer is still the same… get the original source file. But at least now I have a pretty good second choice before resorting to the translation tools themselves.

A final word would be the potential for improvements. I would love to see Iceni use the Studio API to create a new view that did the following:

Drag and drop your project PDF files into the new View
Bulk export the XLIFFs for all the files and create a Studio project
Once the Project was complete run a new Batch Task that exported the translated XLIFFs to a location where they can be imported back into the PDFs
Download the translated PDFs for final edit and review
Maybe include a similar view to TransPDF inside this Studio view to complete the picture.
… and one more added after the original article was posted. Support for BiDi languages (Arabic, Hebrew etc.)

That would be a very nice enhancement for project managers and translators dealing with large numbers of PDF files and probably not difficult to do from the Studio side. Maybe for Xmas

↧

Getting a filetype preview…

December 17, 2016, 5:21 am

≫ Next: All that glitters is not gold…

≪ Previous: Handling PDFs… is there a best way?

001 One of my favourite features in Studio 2017 is the filetype preview. The time it can save when you are creating custom filetypes comes from the fun in using it. I can fill out all the rules and switch between the preview and the rules editor without having to continually close the options, open the file, see if it worked and then close the file and go back to the options again… then repeat from the start… again… and again… I guess it’s the little things that keep us happy!

I decided to look at this using a YAML file as this seems to be coming up quite a bit recently. YAML, pronounced “Camel”, stands for “YAML Ain’t Markup Language” and I believe it’s a superset of the JSON format, but with the goal of making it more human readable. The specification for YAML is here, YAML Specification, and to do a really thorough job I guess I could try and follow the rules set out. But in practice I’ve found that creating a simple Regular Expression Delimited Text filetype based on the sample files I’ve seen has been the key to handling this format. Looking ahead I think it would be useful to see a filetype created either as a plugin through the SDL AppStore, or within the core product just to make it easier for users not comfortable with creating their own filetypes. But I digress…

Example YAML file

I’ve seen a few variants on YAML already but the basic principle for our needs (translatable text extraction) is very similar to JSON in that the text is held in constructs known as “scalars”.

 blog_title: "Bridging the divide, merging segments"
 blog_keywords: merging, paragraph breaks, SDL Trados Studio
 blog_ref: 'For more information <a href=%{info_link}>[Click Here]</a>'

It’s apparently acceptable for the scalers to be contained within single quotes, double quotes or no quotes at all and so far I have examples of each of these. In fact the sample above uses all three variants. Reading the specification tells me that it’s also possible to have them in a folded form denoted with >, but I have not come across an example like this yet. So typically supporting YAML using a custom regex based filetype to suit the examples provided by customers has been trivial. I can get at the translatable text within the scaler using document structure rules like this (opening pattern followed by closing pattern):

^.*?:\s"
"$

Or this:

^.*?:\s
$

Or this

^.*?:\s'
'$

But then I occasionally came across a file where both single quotes and double quotes were used in the same file… so I added a non-capturing group, “(?:)“, offering the alternatives through the use of the pipe symbol “|“:

^.*?:\s(?:'|")
(?:'|")$

I didn’t come across a file that used no quotes at all, and also combination of quotes… but here’s how I could handle that eventuality:

^.*?:\s(?:'|"|)
(?:'|"|)$

So for the sort of files I’ve come across so far these last pair of opening and closing document structure rules would do the job. I guess I could have gone straight to this final set, but I thought it might be interesting for anyone playing around with regex for the first time see the iterations. It also gives you an idea of the sort of testing you might go through in getting the filetype right… it can be a lot of to-ing and fro-ing.

The other interesting thing about YAML is that it can contain complete html files, or just text marked up with html, or even marked up with script. The last scalar in my example contains translatable text containing html markup and script. So I can handle these using Inline tags in Studio and just convert any markup to protected tags. This is where the purpose of this article really comes in… using the new preview capability.

Preview File

When I go to my File Types options in Studio now I see this at the bottom of the screen:

002

This little addition means I can browse for my test file, in this case multifarious.yml, and with one click see whether the rules I’m creating are extracting the correct text, and also converting inline code/markup to tags. This replaces this sequence of events:

close the options screen
open the test file for translation
review the content
close the test file
open the options again and apply changes
repeat as needed

In fact the process I outlined there is not even the way many translators/engineers did this in the past. I have seen people not familiar with the single document process creating a new project each time just to test the filetype settings. So having this one click preview is a serious timesaver if you are responsible for creating filetypes in your organisation, or even if you do the occasional one and find it requires a lot of to-ing and fro-ing to get it right. The preview itself is very neat and concise… in my example it looks a little like this:

003

An important point to note is that you can use this feature to check the effects of the settings for any file supported by Studio. So this is not just a tool for geeky regex loving translators/project managers… it’s really good for preparing files of any kind.

Video

But, I thought the best way to demonstrate this would be by a video as this really shows the benefit, and also how to work through my fabricated YAML filetype in detail.

Video: Length is approx. 8 minutes

For me this is one of the best improvements in this release. It’s a small thing, but as I do create quite a lot of custom filetypes for various types of files and this is a real timesaver. In fact it’s also an absolute pleasure to use it!!

↧

All that glitters is not gold…

December 27, 2016, 5:48 am

≫ Next: Cutie Cat?

≪ Previous: Getting a filetype preview…

001 Years ago, when I was still in the Army, there was a saying that we used to live by for routine inspections. “If it looks right, it is right”… or perhaps more fittingly “bullshit baffles brains”. These were really all about making sure that you knew what had to be addressed in order to satisfy an often trivial inspection, and to a large extent this approach worked as long as nobody dug a little deeper to get at the truth. This approach is not limited to the Army however, and today it’s easy to create a polished website, make statements with plenty of smiling users, offer something for free and then share it all over social media. But what is different today is that there is potential to reach tens of thousands of people and not all of them will dig a little deeper… so the potential for reward is high, and the potential for disappointment is similarly high.

My guess is you can probably relate to this in some form or another, but as I like to write about translation technology I wanted to use this to look at a new breed of CAT tool that claims to support around 30 filetypes (Studio supports over 70 plus any available through the API/AppStore) including *.SDLXLIFF, and also has the ability to support the Studio translation package, *.SDLPPX. Now the title of this article is “All that glitters is not gold”, and I called it this because this particular tool, SmartCAT, is free and if you read their rhetoric they can provide everything every other major translation vendor can. They make their revenue by taking 10% commission on translation services ordered through their site, and also for machine translation if you use it (Google, Microsoft and SmartCAT) and they provide a free online translation environment to support the services. Sounds perfect, and this is backed up by their often used quote”everyone wins, translators, customers and us”. I like the model… what I don’t like are some of the claims for it’s technology. The reason I don’t like them is because it has the potential to cause a lot of problems for translators and their customers when it doesn’t do what it says in the glossy website. Perhaps I should just ignore their inappropriate commenting of their solution in forums and social media… perhaps not as I think these things shouldn’t go unanswered. But aside from the glittery claims, the solution is also online only and for me this is a non-starter. Perhaps it’s my age… or perhaps it’s the frustrating hours I have spent travelling with an inability to get an internet connection that prevented me from doing any work that was not possible locally. Online is certainly the future, but I think we are a long way off being able to take this step without any sensible alternative.

Digging a little deeper…

I’m going to focus on the supply chain workflow using *.SDLXLIFF files and *.SDLPPX packages as these are the areas I am most bothered about. The ability of this tool to handle other filetypes is something it’s quite easy to see… it can open the filetypes it mentions but as a professional translator have a think about the options you often need to use when preparing files. With SmartCAT there are no options (other than to tag things using regex afterwards), it simply parses all the text. This might be passable for a part time translator looking for a simple way to handle a one off job, or where all the text being parsed is translatable, but it may not be adequate for a professional. If you look at XML file handling for example there is nothing smart about it at all, everything is parsed if it’s in an element; no ability to handle anything other than an element and no ability to pick and choose which elements. Even something simpler such as a DOCX does nothing more than parse the simple stuff… no field values, no tracked changes etc. I’m sure for the simplest of files this tool might be fine, but I really doubt it’s for a professional translator who needs more flexibility than this. This is my problem… Studio, Trados, memoQ, DVX… all tools designed for a professional translator. SmartCAT is not in the same league, but from reading the website you could be forgiven for falling for the glitter, especially if you were new to the industry.

The supply chain workflow

When you’re working with source files direct for the end client then you take responsibility for the final job. You have to deal with missing translations, or poorly formatted files. But when you are dealing with translation packages or bilingual files you are most likely handing them back to a project manager who has to deal with them. This project manager will have spent time preparing the files so that they are handled in the way they choose and they expect to see the files back fully translated and in the same format they sent them. Anything less is going to create a lot of work, and potentially risk non-payment for the translator. So now, if you don’t handle the files and packages in a supporting tool you are affecting more than just yourself. But let’s take a look at a few examples based on some very simple testing.

Lack of support for languages

SmartCAT claims support for 70 languages (including variants); Studio supports nearly 100 variants of English alone… the total number of languages supported getting onto 600. This is a limitation… fair enough. If you try to add a package containing an unsupported language you get this message:

002

That’s also fair enough. It does limit its ability to be able to “support” Studio packages but at least you are told at this point even if there is no mention of it in the glitter. If you just load an SDLXLIFF however the story is different and you can change the languages to be whatever you like… both the source or target. There is an argument to say this is ok, in fact its great. But for the project manager who sends out *.SDLXLIFF files only this may not be so great. In this case SmartCAT overwrites the language code with the one you select and the file you send back to your project manager is no longer correct. Not so smart.

Perhaps worth mentioning at this point that if I load a partially translated SDLXLIFF into SmartCAT then it presents me with a completely empty target column, so all the work carried out by others is lost. Not so smart support for SDLXLIFF.

Segment statuses and comments

Project Managers use comments to send information to the translator, and get comments back. Project managers use different statuses for the segments to differentiate between the work required and to help identify which segments need to be looked at, and which don’t. SmartCAT is a little like the infamous honey badger in this respect; he doesn’t give a f…hoot!

003

I aligned the Studio statuses on the right with the SmartCAT interpretation on the left. In general SmartCAT considers “translated”, “reviewed” and “signed-off” all as “done”. It considers “draft” and “not translated” where source is copied to target as “in translation”. This all provides no way for the translator to know what to work on at all and to make matters worse when you save the target from SmartCAT every single segment that is “done” now opens in Studio as “signed-off”.

As for the comments… what comments? Definitely not very smart.

Support for translatable controls

What do I mean by this… basically anything that uses a non-translatable feature to protect parts of the text. This is a very common requirement in localisation projects and SmartCAT doesn’t handle this at all well. One quick example creating a Studio project and then opening the package in SmartCAT:

004

I have no idea what’s going on here but clearly the support for packages and sdlxliffs is simply not there.

Conclusion

I’m going to stop looking at this here because I think I’ve made my point. I haven’t even written about what SmartCAT does to the return files and I could easily write a lot more about this; nor have I looked at how SmartCAT is unable to handle other features in Studio and show how they are represented in the packages/sdlxliff files; nor have I looked at their claims to be able to support MultiTerm XML… at least I have not written in here but I have looked. There is a lot I haven’t looked at, and on this basis alone it appears SmartCAT have not looked very well either.

I think the points I really want to make are these. First, don’t believe everything that you read; the glossy websites and tweets telling you that a tool like this is a suitable substitute for your work in Studio are simply not true. I imagine the same thing would apply to the features in other specialist translation tools as well. Secondly, take the trials and make your own tests if you want to be sure. Thirdly, if you’re thinking about using SmartCAT to support Studio projects and want some impartial advice (remember what I said about not believing everything you read, although I will be honest) then feel free to contact me in the comments.

Finally, I don’t want this article to sound completely as if I’m having a go at SmartCAT. I am frustrated by their constant inappropriate marketing messaging, but that doesn’t mean they are all bad. They do have some nice features and in an environment created around files they can work with I can imagine this is a very nice solution. But it’s not all things to all men, and certainly not gold!

↧

Cutie Cat?

April 12, 2017, 4:09 pm

≫ Next: Iris Optical Character Recognition

≪ Previous: All that glitters is not gold…

A nice picture of a cutie cat… although I’m really looking for a cutie linguist and didn’t think it would be appropriate to share my vision for that! More seriously the truth isn’t as risqué… I’m really after Qt Linguist. Now maybe you come across this more often than I do so the solutions for dealing with files from the Qt product, often shared as *.TS files, may simply role off your tongue. I think the first time I saw them I just looked at the format with a text editor, saw they looked pretty simple and created a custom filetype to deal with them in Studio 2009. Since that date I’ve only been asked a handful of times so I don’t think about this a lot… in fact the cutie cat would get more attention! But in the last few weeks I’ve been asked four times by different people and I’ve seen a question on proZ so I thought it may be worth looking a little deeper.

The format of the *.TS files are XML, or at least the ones I have seen so far are. In fact the format for the files I have seen so far seem very straightforward so I knocked one up with two strings like this:

All I had to do to handle this was create a couple of parser rules to extract the text from the target file when the type attribute in the translation element said “unfinished”, so like this:

I could even create a custom preview to show me all the other segments and comments if I needed to provide some context to the translator while they worked on the translatable text. So all good, and simple to achieve. But what if the file needed to be reviewed, so you need to see source and target for example? Here I’d have a problem because the custom XML filetype I created is monolongual. So to solve that one I’d need to transform the file to XLIFF or get a developer to create a custom filetype for *.TS files so it could be handled as a bilingual XML file. All possible! But then it was suggested by a few translators in the SDL Community that Qt Linguist was the way to go as this should support an export to XLIFF… only problem was it’s not that easy to get hold of. So let’s look at that problem!

QT Linguist

Evzen Polenka, in his inimitable style, advised that the main problem is to get hold of Qt Linguist because the Qt developers don’t provide separate Linguist builds. So you either have to download and install the complete Qt framework, or google quite hard for Linguist builds created by other people. (I left out all the colourful parts… google Qt Linguist in the SDL Community if you want to enjoy the whole conversation!)

So a bit of googling and I discovered that the application can be found in the Google Code Archive. but if you look there and navigate to the downloads you’ll see the latest version available is 4.6 and it’s dated Dec 4, 2009. Quite old and everyone thinks there’s a newer one:

So I emailed the people at Qt who first of all wanted to know my licence ID which I don’t have… guess they saw me as a commercial user. So after explaining what I needed, just to be able to help translators handle *.TS files in their preferred translation tool I received some useful hints:

Qt comes with a localization tool, Qt Linguist, which has the best
support for translating Qt applications. The latest released
Qt Linguist version can be downloaded and installed with Qt 5.8:
https://www.qt.io/download/

The page has the following instructions that you might find useful:

The "native" tool for translating TS files is Qt Linguist. It is pretty
self-explanatory and comes with documentation. If you prefer to use another
tool (most probably because of better support for translation memory), you
might need to convert the TS files to and from some other format:
$ lconvert .ts -o .po
$  .po
$ lconvert -locations relative .po -o .ts
XLIFF might also work for your tool.
Note: Always use the latest stable linguist tools available. Also, 3rd party
tools like ts2po were known to cause trouble.

This seems to be quite helpful and makes me think there is a command line possibility for batch converting which is probably attractive for localization engineers, but also confirms the observations from Evzen that the only way to get the latest version of the tool is to take the entire package to get the latest Qt Linguist version. So I followed the link and navigated to and chose the online installer which was a 17Mb download, but this then runs through an installation process which takes up 20Gb of disk space based on me selecting the 5.8 version mentioned above. Once this was complete I could open and run Qt Creator. What I couldn’t do is easily see how to get at Qt Linguist so I looked up in the manual and found this:

Qt Linguist is a tool for adding translations to Qt applications. Once you 
have installed Qt, you can start Qt Linguist in the same way as any other 
application on the development host.

For someone not familiar with Qt this isn’t too helpful. So I searched for linguist.exe in windows which is the program in the Google Code Archive and found 5 instances of it in my new Qt folder, all of which started Qt Linguist version 5.8. So that worked, and now I can run the latest version, but I needed a pretty hefty download and install just to get it. But now I can open my source *.TS file in Qt Linguist and it looks like this:

I’m not sure there is a visible difference between the way in which I want to use this tool for conversion only in version 4.6 compared to version 5.8, but there may well be bug fixes and improvements in the latest build so if you handle these files a lot it makes sense to take the latest one. I just wish that Qt provided an installer for the Qt Linguist version as a standalone tool as available in the Google Code Archive because then it would be lot easier for translators who really don’t need, or want, the other tools.

***Update***

Also adding to this post that as mentioned in the comments below Evzen had found a bug report that I didn’t read in which there is a link to a separate github repository containing the installers for Qt Linguist itself. Clearly much easier than the convoluted process I just went through but still an unofficial solution. They do seem to have recognised the need to support translators with this build and the bug report is an enhancement request to provide the separate installers officially. But the rest of the article is hopefully still useful and it might be useful for the Qt guys to read this too in case anyone else asks them the same question I did, and in case they need more information to support the enhancement request.

***end***

Going back to the SDL Community I also read another good tip, this time from Christine Bruckner who advised that she converts the *.TS files to *.PO as opposed to *.XLF because this way she can use embedded content rules to handle embedded content. Qt Linguist is capable of doing both so you can decide for yourself. There are advantages and disadvantages of them all. Using my simple two string test file as an example I made a few simple observations below.

Exporting to XLF

The first thing is that the languages are recognised if you use *.XLF. They are not in *.PO or *.TS using a custom XML filetype:

The second thing I thought would be helpful is that the statuses of the segments can be mapped. Using the defaults I see this in Studio:

So using the defaults “Signed off” and “Draft” compares to “Accepted/Correct” and “Not accepted” in Qt Linguist:

I could change this and map something different but it works for me. However, one thing I did notice is that whilst Studio uses the XLIFF attribute to determine the status, Qt uses them in the export file but ignores them in the import file as it wants to see the optional “approved” attribute on the trans-unit. So it expects to see something like this:

Studio doesn’t use this optional attribute so the file will always come back into Qt Linguist with the “Not accepted” status and will have to be updated in there. If anyone found a workaround to this in Studio other than running a regex search/replace on the final XLIFF perhaps share it here.

The other useful feature is that the “translatorcomment” is also visible with the XLIFF filetype:

Exporting to PO

The obvious advantage here is the ability to handle embedded content. I think it’s pretty common in *.TS files to have placeholders throughout the strings and these can be handled quite easily in both the out of the box PO filetype in Studio and also the PO filetype on the SDL Appstore (created by SuperText). The Studio PO filetype will represent the strings as follows:

Interesting that they are given an AT status, although the segment translation status is the same as for XLIFF as well as the comment being shown in the comments view. The AppStore PO does not extract the comments so it’s worth noting this, although I imagine the SuperText guys could enhance it if they see the need, but it also uses the AT status. In truth this is probably a more accurate reflection of the translation origin seeing as it’s come from another tool with no match value provided.

The other difference is that the “approved” status used in Qt is supported much better through the use of the PO filetype as this returns the target file like this:

So for me, using *.PO is a better bilingual filetype to use when working with these Qt files because of the work that will be saved in not having to manually approve all the translations you are already happy with in Qt Linguist and also in being able to handle any embedded content.

Custom XML

I’m going to mention this one but in reality I think the best solution here is to ask a developer to create a bilingual filetype to support *.TS files. The format is very simple and it’s probably not a difficult thing to do. The benefit is that there would be no need to go through all this hassle of getting hold of Qt Linguist in the first place if you happen to be working for a client who doesn’t export the files for you as *.PO or *.XLF. I think a variant of the existing PO filetypes would probably be a very good starting point as you’d have the framework already in place.

But as a monolingual filetype, if you are fortunate to have a file that is prepared in a way to support you handling the *.TS files in this way you could also create a nice preview and then work something like this:

In this example I only extract the segments that need to be translated using the same rules I mentioned at the start of the article, so you only see one segment in the Editor. But then I created a custom preview using XSLT to display in realtime the “source”, “translation” and “translatorcomment” for the whole file. This could be a very nice solution giving you the full context in one view that you don’t get from the other filetypes especially if a bilingual XML filetype was created by a developer. But even like this I think it works quite nicely and you could do a better job of the preview to make it easier to read.

Using SDL Passolo

I’m adding SDL Passolo into here after Hans Pich mentioned them in the comments and after Daniel Brockmann saw this as worth mentioning because of the improved features you can have with Passolo. Now, Passolo out of the box won’t do a much better job than the solutions I have already covered, but there is a plugin available on the appstore for Passolo called SDL Passolo add-in for Qt® developed by Henk Boxma. This is a paid plugin for the full version of Passolo but if you are a translator receiving Translator Bundles for translation with the Free Edition of Passolo then you may come across this option for these types of files if your client is using Passolo for their Qt translation projects. As a translator there is no additional cost for you so you just need to open the bundle and work on the files, only the creator of the Passolo bundles incurs the cost of the plugin.

Henk describes the reasons you would use this in the manual and it does identify more complexity than I have dealt with in this article:

Strings in TS files often do not have unique string identifiers. It is not possible to do a reliable alignment, because a different sorting order of strings in the translated file will result in misalignments.
It is possible to define numerus forms in TS files, like for example singular, paucal and plural. The Passolo XML parser will not detect this and simply concatenate all forms to one string.
The translator may provide length variants for a translation. For example a short and long form. Qt® will select the translation that fits best, based on the available control size.

So a few more options here that will not be handled at all with the three solutions I have discussed so far and of course you also have the Passolo benefit of preview capabilities for the UI files. I think anyone working on a large Qt project should probably consider the use of Passolo with this plugin because it may be the only way to really handle the files correctly, other than using Qt Linguist for the translation. Specialist software translation isn’t something I’ve addressed in this blog so far so perhaps it’s long overdue! The workflow (taken from the free manual provided by Henk) is as follows:

Take a look at this site if you want to learn more about handling *.TS files in Passolo.

Conclusion

I hope this article is going to be useful for anyone handling Qt Linguist files and I’d welcome any feedback from experienced users who already handle them as it would be interesting to see how this could be done better. In the meantime I hope this provides three ways to handle basic files coming from a Qt Project, one way to handle them in a more professional software localization tool, and an explanation of the elusive Qt Linguist… at least an explanation from someone who only spent an hour or so trying to find it and has no experience of how the application is used in practice. In fact if I managed to get this far I hope it’s set a good example for others and they won’t be put off by the initial barriers posed by unfamiliarity. There’s an answer for everything.

↧

Iris Optical Character Recognition

August 17, 2017, 2:14 pm

≫ Next: Double vision!!

≪ Previous: Cutie Cat?

I’m back on the topic of PDF support! I have written about this a few times in the past with “I thought Studio could handle a PDF?” and “Handling PDFs… is there a best way?“, and this could give people the impression I’m a fan of translating PDF files. But I’m not! If I was asked to handle PDF files for translation I’d do everything I could to get hold of the original source file that was used to create the PDF because this is always going to be a better solution. But the reality of life for many translators is that getting the original source file is not always an option. I was fortunate enough to be able to attend the FIT Conference in Brisbane a few weeks ago and I was surprised at how many freelance translators and agencies I met dealt with large volumes of PDF files from all over the world, often coming from hospitals where the content was a mixture of typed and handwritten material, and almost always on a 24-hr turnaround. The process of dealing with these files is really tricky and normally involves using Optical Character Recognition (OCR) software such as Abbyy Finereader to get the content into Microsoft Word and then a tidy up exercise in Word. All of this takes so long it’s sometimes easier to just recreate the files in Word and translate them as you go! Translate in Word…sacrilege to my ears! But this is reality and looking at some of the examples of files I was given there are times when I think I’d even recommend working that way!

But there were files I saw that looked as though they should be possible to handle in a proper translation environment. We tested a few and the results were more often than not pretty poor. So even though we could open them up it was still better to take the DOCX that Studio creates when you open a PDF and then tidy up the Word file for translation. At least this is some progress… now we’re able to handle the content in a translation environment and not have to recreate the entire file. But it would be even better if the OCR software could make a better job of it. And this is where I want to get to… better OCR!

SDL Trados Studio 2017 continued to provide the same PDF filetype that uses technology from SolidDocuments in earlier versions of Studio, and this does a fairly good job of extracting the translatable text with OCR for many files. But it could use improvement. SDL Trados Studio 2017 SR1 has introduced another option for OCR using a software called ReadIris that is part of the Canon Group.

Out of the box, according to the documentation, Iris supports 134 languages for OCR which is pretty impressive. They don’t quite match the languages supported by Studio however, but a rough count and compare suggests there are some 95 shared languages… and they even support Haitian Creole which Studio does not as we know Still impressive however and it easily beats the 14 languages supported by Solid Documents in Studio 2017 prior to the introduction of Iris. Additionally this opens the possibilities for handling scanned PDF files in Asian languages, Arabic, Hebrew and many others that were previously difficult, if not impossible, to handle.

Using the new options

So let’s take a look at where you can find this new option and how you use it. First of all you need to go to your options:

File -> Options -> File Types -> PDF

Then navigate down to “Converter“. Down near the bottom you’ll see the “Recognize PDF text” group as shown below and the option to activate this new feature is at the end:

Check the box and you’ll be presented with this screen:

It’s an App! You may be wondering why you need to do this and why it was not just integrated into Studio? The reason is simple… not everyone will want this option and the underlying software requires a 150Mb download which would have increased the size of the Studio installer to over half a gigabyte. So it was made optional. If you want it you click on the “Visit AppStore” link in the message above, or the one I just wrote, and download and install the plugin just as you would any plugin from the appstore. If you don’t do this then Studio won’t be using the software. There are no warnings, and the option remains checked, but you won’t be using it. So when I open the Chinese PDF I just created by copying some text as an image and saving it to a PDF all I’ll get is this:

None of the text is extracted for translation at all. But if I install the plugin and try again I see this:

Now we’re cooking! Would be useful to get rid of the tags though as these seem to be aesthetic only, just colours and font changes where the OCR picked up a few minor differences and then introduced tags to control them. As these are formatting tags only I could just ignore then, or press Ctrl+Shift+H to hide them in the editor. But if I want to remove them altogether I can do this with another app. called Cleanup Tasks that I have written about before. These three options do the job for this file:

Now I have this and can translate without any tags at all:

Nice… and if all of that sounds complicated it wasn’t really. I created a short…ish video below putting this all together so you have an idea of how it works.

Approx. length : 16.26 mins

After all of that I don’t want you to get the impression I’m a converted believer in the possibilities of PDF translation… I’m not. We’re unlikely to see the back of PDFs for translation any time soon, so I am happy to see the technology to support this workflow improving all the time. I also don’t want to give the impression this is going to help with every PDF you ever see. It won’t! The problems of PDF quality don’t go away because of the way they been created in the first place, so source is always best. You’re also quite likely to find PDFs you can’t handle even with Iris, and you might even find that the more basic option without Iris does a better job of your PDF conversion. So it’s horses for courses… you have the tools and can apply the most appropriate one for your job.

If you have any questions after reading this post or watching the video then I’d recommend you visit the SDL Community and ask in there… or just post into the comments below.

↧

Double vision!!

November 24, 2017, 1:24 pm

≫ Next: Priorities… paths… filetypes….

≪ Previous: Iris Optical Character Recognition

There are well over 200 applications in the SDL AppStore and the vast majority are free. I think many users only look at the free apps, and I couldn’t blame them for that as I sometimes do the same thing when it comes to mobile apps. But every now and again I find something that I would have to pay for but it just looks too useful to ignore. The same logic applies to the SDL AppStore and there are some developers creating some marvellous solutions that are not free. So this is the first of a number of articles I’m planning to write about the paid applications, some of them costing only a few euros and others a little more. Are they worth the money? I think the developers deserve to be paid for the effort they’ve gone to but I’ll let you be the judge of that and I’ll begin by explaining why this article is called double vision!!

From time to time I see translators asking how they can get target documents (the translated version) that are fully formatted but contain the source and the target text… so doubling up on the text that’s required. I’ve seen all kinds of workarounds ranging from copy and paste to using an auto hotkey script that grabs the text from the source segment and pastes it into the target every time you confirm a translation. It’s a bit of an odd requirement but since we do see it, it’s good to know there is a way to handle it. But perhaps a better way to handle it now would be to use the “RyS Enhanced Target Document Generator” app from the SDL AppStore?

RyS Enhanced Target Document Generator

The solution provided by this app is a little similar to the auto hotkey approach except that there are two main differences:

You can handle the entire file or project in one operation, and
you have the ability to “pair” your work at segment or paragraph level

The application is priced at 380 RMB on the developer’s website, he’s based in China, and this equates to around €50.That sounds quite a lot for an app but if you do a lot of work with clients asking for this type of format then I imagine the time you save as well as the reduced stress would make it money well spent. Given that I’ve tried to help translators achieve a format like this in the past is enough for me to know that I’d pay the money because even after managing it once there’s no shortcut the next time! The additional effort may even be worth a better rate to cover your costs.

Now I can imagine that many of you are already asking yourselves how would this work with XML files, tables, Excel etc… in fact you may just be asking how does it work with anything other than Microsoft word? So I did a few simple tests with some very simple files just to see what they’d look like. But first let me explain how it works.

The application has been created using the Studio Batch Task API so when you install it you’ll see this new task “Generate a source-target-paired target document” added into your list of Studio Batch Tasks.

All you do is run the task which can be run on a single file, any number of selected files, or the whole Project. The task itself is very basic and brings you to one settings screen. The licensing part is something you’ll see on all RyCAT applications and this is quite interesting for a couple of reasons:

I haven’t seen any other developer licence a plugin in this way
I have to run Studio as an administrator or it won’t verify the licence key… pretty annoying but it may be because I installed Studio as an administrator in the first place

The rest of the screen is a pair of options. You either choose to create your target translation in the native format with the source added at paragraph level, or at segment level:

What difference do these options make? The image below shows the original source test file I created (very basic indeed) and it contains a paragraph (with three sentences), a numbered list and a table, then I showed the effect on this file when you generate the target using both options. I also added a little basic formatting so I had some tags in the source as well:

The results are probably not surprising but there were a few things to note, some that may require resolution in a future build and some that deliberately work this way:

the original SDLXLIFF files are backed up safely in the same folder as the target SDLXLIFF files so you can easily restore them if needed. I think an undo/restore feature for large projects would be nice, perhaps another batch task
the sentence based pairing actually breaks up the paragraph to separate lines. This isn’t what I expected to see, although it does make things more clear. I expected to see the paragraph still being a paragraph with EN/ZH, EN/ZH, EN/ZH pairings in one line similar to the source. The HTML file I tested later did this as expected
The copied source seems to have taken over the properties of the bold tags in the first segment

If I look at the paragraph in Studio it’s clear why:

First of all none of the tags from the source are taken over to the target when the source is copied. This is actually deliberate and makes sense because in some cases you may not be able to save the target file due to tag errors. Secondly I could have handled the tags better and just moved the opening tag to only capture the appropriate Chinese (I’m assuming it’s the appropriate Chinese… all courtesy of Baidu who I hope do a good MT for these simple strings and some (hopefully sensible) logic on my part looking for consistency. However, all in all I think it’s pretty good and I can see how anyone being asked for this sort of document would find this useful.

But what about other formats? I tested DOCX, XLSX, IDML, XML and HTML. Of these the cleanest results using this very simple example where I didn’t have to worry about page formatting, or different objects containing translatable text or anything like that were Word, InDesign and HTML. The XML looked to be the most worrying if it needed to render this paired formatting in another application; but I did this one again after correcting the tags in Studio as follows:

So I moved the opening tag into the correct place and when I inspect the target file the result was actually quite sensible with just the extended length in each element due to the addition of the source text and no additional tags. I doubt there are going to be too many requests for this with XML files but it was good to see how it actually worked.

The format that demonstrated the most amount of work you’d have to tackle was actually Excel and this was because of cell sizes. It’s also worth noting, as I have not mentioned it before, that if you choose the paragraph based pairing when you run the batch task then the entire source text is copied into the first target segment of each paragraph. This looks a little odd, but makes sense when you think how the application works:

In Excel each cell is a paragraph so all the three source segments making up the first cell get copied into the first segment. It does look odd but is correct when you see the results in Excel. Here’s all three examples for Excel so you have a better idea:

You’ll see what I mean about having to tidy up the document quite a bit to ensure it’s readable since the cells don’t resize to support the increased text. In the case of the sentence based pairing a single line in the original turns into six lines with five of them hidden from view unless you expand the editing window as I have done here. Nevertheless, the app has delivered exactly what’s intended, so hats off for the implementation of this sometimes requested review document.

The developer, RyCAT, has actually got 12 apps on the appstore altogether and they are all paid ranging in price from around 5 EU upwards. I’d encourage you to take a look and you might find there are actually some interesting things there for you too. Review the Machine Translation apps too if you use the other alternatives for Google or Microsoft Translator as the developer has provided some interesting features to help you get more from the Machine Translation results:

Quite a prolific developer with some novel approaches to a number of well trodden processes… a great example of the sort of things that can be done using the Studio APIs.

↧

Priorities… paths… filetypes….

January 2, 2018, 4:10 pm

≫ Next: Wot! No target!!

≪ Previous: Double vision!!

At the beginning of each year we probably all review our priorities for the New Year ahead so we have a well balanced start… use that gym membership properly, study for a new language, get accredited in some new skill, stop eating chocolate… although that may be going just a bit too far, everything is fine with a little moderation! I have to admit that moderating chocolate isn’t, and may never be, one of my strong points even though it’s on my list again this year! But the idea of looking at our priorities and setting them up appropriately is a good one so I thought I’d start off 2018 with a short article explaining why this is even important when using SDL Trados Studio, particularly because I see new users struggling with, or just not being aware of, the concepts around the prioritisation of filetypes. If you don’t understand them then you can find code doesn’t get tagged correctly despite you setting it up, or non-translatable text is always getting extracted for translation even though you’re sure you excluded it, or even files being completely mishandled.

Filetype locations

So, what are the things you need to understand when working with filetypes, and I mean any filetype? The first thing is where they are located. There are essentially three places:

your File -> Options -> Filetypes,
your Project Settings -> Filetypes and
your File -> Setup -> Project Templates -> “Template” -> Edit -> Filetypes.

They look the same, and probably originated from the same place, but they behave differently. So first of all you should learn the difference between them. I’d recommend you review this short article called “Tea and Settings” which is a brilliant explanation from Jerzy Czopik on the fundamental difference between the first two. The short version is this:

All projects are created using filetypes in your File -> Options (unless you use Project Templates)
Once the Project has been created, any new files added to that Project will use the filetype settings in that Project, your Project Settings, and not File -> Options

So where do Project Templates come into it? These are just a way to save different types of settings so you don’t have to keep editing your File -> Options. Instead of editing them all the time you can save your settings with any name you like as a Project Template and then when you create your new Project you can just call up the template you wanted to use. I have covered this topic before so perhaps review “Keep Calm and use your Project Templates…” for a bit more detail. When it comes to filetypes Project Templates really come into their own because as we’re about to see Studio doesn’t always make things simple… so I’ll come back to this in a while.

Prioritising Filetypes

I’m guessing you already knew all about the things I have mentioned so far, didn’t you? You might even know what I’m going to tell you next but another guess would be most of you reading this article won’t. Using the correct location described above isn’t enough because you also need to make sure that the filetype in that location is actually going to be used when you create your Project. Confused? I’m not surprised, and for the most part you might have never noticed this at all and never thought it to be a problem… in fact it probably wasn’t a problem for the majority. But if you work with Office files, especially since SDL Trados Studio 2017, then this is a very important piece of information to understand.

The order of the filetypes in your settings matters because Studio always uses the first one it comes to

What do I mean by that? In the screenshot below I have taken my defaults and highlighted two areas:

the Microsoft Office Filetypes
The “Move Up” and “Move Down” buttons

I chose these particular filetypes because for three Microsoft applications, Word, Excel and Powerpoint we have a bewildering 13 different filetypes! Now, there are good reasons for this technically, but it would be so much better if Studio didn’t care and just used the correct one in the same way Microsoft Office does, wouldn’t it? For major differences it does, so it can tell the difference between a DOC and a DOCX, or an XLS and an XLSX for example. But it can’t tell the difference between the two different DOCX filetypes or the three different XLSX filetypes. So this is where you have to provide a little help if it’s important to you. You can do this in two ways:

Deactivate the filetypes you don’t wish to use, or
Use the “Move Up” and “Move Down” buttons to put the one you want to use above the rest

I just used the words “So this is where you have to provide a little help if it’s important to you.“. Why would it be important? Let’s say you wanted to handle an Excel file that had been prepared as a bilingual Excel file (review “Bilingual Excel… and stuff!” for more information on this filetype). This file is just an XLSX file, so you can make all the changes you like to the filetype settings to configure the Bilingual Excel filetype but Studio won’t ever see them with the default settings because it will open the Excel file using the “Microsoft Excel 2007-2016” filetype as it’s the first one on the list. Another example, let’s say you were preparing projects with DOCX files that would be handled by Translators using SDL Trados Studio 2014 and it was important for them to be able to save the target files (assuming it was appropriate as the newer filetype is much better of course!) then the DOCX file would always be prepared with the “Microsoft Word 2007-2016” filetype and the “Microsoft Word 2007-2013” would never be used at all… your Studio 2014 translators would be able to open the files to work on them but they’d see error messages in Studio (just informational but worrying nonetheless) and they would not be able to save the target files.

So to set up your options to deal with just these two scenarios you would either do this where you only activate the filetype you want (I highlighted the filetypes that could be used for DOCX and XLSX):

Or you can prioritise the filetypes to be used first like this for example:

Now they are at the top of their group and will always be used first.

That all seems simple enough, but it means you have to do this every time you want to achieve something different, and if you forget to change your settings back you’ll get unexpected results, and as we know from “Tea and Settings” it means you have to start over. So this is where your Project Templates can come in. If you always create Studio Projects and don’t use the Single Document workflow (see ““Open Document”… or did you mean “Create a single file Project”” for more details, it’s a little old and we have the drag and drop approach in Studio 2017 but the principles are sound I think) then you have a consistent approach for your work. If you need to use the “Bilingual Excel” filetype then just create a Template that has your filetype settings configured to achieve that for example. You can have as many Project Templates as you like so it makes sense to use them to save having to keep changing your settings.

There are, as always, reasons why this approach would not work for everyone:

You want to use the single document approach because you like organising your work with the SDLXLIFF files always saved in the same folder as your clients folders, and
Your project consists of both a bilingual Excel file and a monolingual Excel file

The latter approach means you create your project with all the monolingual Excel files first (for example) and then change the filetype settings in your Project Settings (as we now know) and add the bilingual files to your existing project afterwards, or you add the files in two goes when you create the Project. Add the monolingual files first, and then change your filetype settings here in the Project Wizard after adding the monolingual files:

If you’re using the single document approach then you can still change the settings by using the “Advanced” button when you open the file and this will take you to your Options so you can change the filetype settings to suit:

On a final note it’s worth pointing out that whilst I only used DOCX and XLSX as an example the same thing could apply to any of the filetypes in Studio, even if you just wanted different settings for different customers. So the solutions are all there and you can work any way you prefer, but it’s important to understand how file selection works and which options in which location are being used because then you’ll always understand the fundamental mechanisms that dictate where changes need to be made and why. Once you understand this it’s not too tricky either!

↧

Wot! No target!!

January 9, 2019, 10:53 am

≫ Next: The versatile regex based text filter in Trados Studio…

≪ Previous: Priorities… paths… filetypes….

The origin of Chad (if you’re British) or Kilroy (if you’re American) seems largely supposition. The most likely story I could find, or rather the one I like the most, is that it was created by the late cartoonist George Edward Chatterton ‘Chat’ in 1937 to advertise dance events at a local RAF (Royal Air Force) base. After that Chad is remembered for bringing attention to any shortages, or shortcomings, in wartime Britain with messages like Wot! No eggs!!, and Wot! No fags!!. It’s not used a lot these days, but for those of us aware of the symbolism it’s probably a fitting exclamation when you can’t save your target file after completing a translation in Trados Studio! At least that would be the polite exclamation since this is one of the most frustrating scenarios you may come across!

At the start of this article I fully intended this to be a simple description of the problems around saving the target file, but like so many things I write it hasn’t turned out that way! But I found it a useful exercise so I hope you will too. So, let’s start simple despite that introduction because the reasons for this problem usually boil down to one or more of these three things:

Not preparing the project so it’s suitable for sharing
Corruption of a project file
A problem with the source file or the Studio filetype

Not Preparing the Project so it’s Suitable for Sharing

I’m only going to address the most common issues that relate to how the project has been prepared. This usually comes down to one or more of these things:

not embedding the source file into the SDLXLIFF when it’s prepared
not having the same filetype that was used to create the SDLXLIFF when it’s prepared

These two things can be explained further, so if you’re still interested in more detail keep reading!

1. not embedding the source file into the SDLXLIFF when it’s prepared

This could be something you are not even aware of, but by default Studio will try to embed the source file into the SDLXLIFF itself whenever you create a project. It’s controlled with this option in your Filetype options for SDLXLIFF:

As long as the source file is no larger than 20Mb it will be automatically embedded. If the source file is larger than this, or this setting was changed to reduce the size of the files being sent out, then the source file will not be embedded. The effect this has on the SDLXLIFF is exactly the same whether you create a full project in Studio or if you just use the “translate single file” approach. In case you are under the illusion that you don’t work with projects note that the “translate single file” approach still creates a project… it has less capability compared to a “standard Studio project” but it’s still a project. I’ve written various articles on this topic in the past if you want to delve into this topic more. But I don’t wish to digress… so the main difference where the SDLXLIFF includes the source file because it’s been embedded is that you can see it within the internal-file element as an encoded text. Open an SDLXLIFF with a text editor and you’ll see what I mean. This one is based on a small XML file I prepared with the default settings shown above:

The SDLXLIFF for the same file prepared by setting the Maximum embedded file size to zero is shown below, and it displays references to the source file in the external-file element as opposed to the actual file itself:

There are a few more small differences but the main point I wanted to demonstrate is this concept of the source file actually being inside the SDLXLIFF as opposed to just being referred to in the second example above. Why is this important? It’s important because if the SDLXLIFF files are shared on their own and the source file has not been embedded into the file then you will see something like this when you open the SDLXLIFF in Studio on another computer:

This message will appear whether you received these files as part of a project package or as single SDLXLIFF files. It’s purely a question asking you if you would like to go and find the file that has been referenced in the external-file element. If you also have the original source file then you can say “Yes” and that “might” (I’ll come back to this point) allow you to save the target file from your project. If you say “No” then you are going to be seeing these sorts of things in Studio when you get around to saving the target:

Note that you have two messages in the QA window now, one relating to the lack of a dependency file when you initially opened the file and the second relating to an inability to save the target file when you tried it because of the first error. Hopefully this is clear by now that all these errors are to be expected because the project has not been created in a way that would allow Studio to work correctly, and you have not helped because you didn’t have the original source file available when needed. These are not bugs at all because in this scenario you are dealing with an incomplete Studio project. It’s unfortunate that the error message looks like something that is a problem with Studio because in this scenario a less emotive experience could be enjoyed if the messages were more explanatory and didn’t include the visual stimulations they do!

I’d like to say that always using a Project Package to share work would provide a resolution to this situation of the source file not being embedded in the SDLXLIFF files, but it won’t. The source files are not included with the project package at all, even if the files are not embedded. The only way to address this when using project packages is to either add the original source files as reference files, or send them separately. I think there should be an option to include them when the project package is being created, particularly as this can also affect the ability to preview the files even if you don’t need to save the target… but then preview is another story!

Exceptions

As always there are some exceptions around the embedding of files. If the source files are text files (TXT) or PO files for example, then they are never embedded. So, for these formats you won’t find the reference element referring to the source file at all, and for the most part you shouldn’t have problems saving target files. However, the next section could influence that if the filetype that was used to prepare the file in the first place is not available.

2. not having the same filetype that was used to create the SDLXLIFF when it’s prepared

In the previous section I said “that “might” allow you to save the target file from your project.”. The reason I said this is that if the files were prepared using the out of the box filetypes for Studio then you should be able to successfully save the target file. But if they were prepared using a custom filetype, either by using the filetype features in Studio for this, or through the use of a plugin created using the API, then you still won’t be able to save the target unless you also have this filetype available to you in your project settings. You will get a message something like this… which is at least a little less alarming than the message about the dependency files in the previous section:

This one can normally be resolved by using project packages because the custom filetype should be included in the project and therefore available for use. However, this won’t work if the filetype is a custom plugin that you might find on the SDL AppStore for example, or if the filetype is not available at all in the version of Studio used to open the package. For example, one of my test projects uses the following out of the box filetypes in Studio 2019:

If I create a project package and send this to a user of Studio 2014 then they will see something like this:

Note that the Excel file and the Word file are unidentified in Studio 2014. This doesn’t mean I can’t translate the files and create a return package because I can. But it does mean that I won’t be able to save the target file, or preview the files where it would be possible with an older version. The same problem works the other way… if I create a project in Studio 2014 using filetypes that have been deprecated in Studio 2019 then I will have a similar situation. In both scenarios 2014 -> 2019 and vice versa, I’ll get the information message telling me Studio can’t save the target file and why.

The path to project success

I wrote the title to this section right at the beginning with an idea in my head to create a flowchart, or something like that, which would show you the best way to work and always achieve success. But as I worked through all the different scenarios, and I’m pretty sure there will be others I missed, it’s clear that the only way to ensure you will avoid this all of the time depends on understanding the variables in your translation process.

Is everyone using the same version of Studio?
Are you working with custom filetypes?
Were your custom filetypes created using the API or using the UI in Studio (custom XML or regex-based text files for example)?
Do you have a Professional licence (to be able to create a Project Package) or a Freelance licence (only work with the SDLXLIFF files)?
Do you need the translator (or person you are collaborating with) to be able to create a target file or will you do this yourself?

Once you have identified the variables you will know the following:

what instructions to provide the recipient of your files
how to prepare the project in the first place
whether you need to provide additional resources in a project package
what to expect from the recipient of your files

In fact, not too many things. However, you do need to understand them if you want to have a successful workflow and I have a suspicion that the out of the box courses won’t prepare you for this. It’s more likely to be an on the job experience!

Corruption of a Project File

The second one doesn’t happen very often and I’m not really sure how this happens. When referring to a bilingual file in the project (an SDLXLIFF) I could hazard a guess that one reason is it’s related to security software (anti-virus checks, clean-up tools etc.) running while Studio has a file open for translation, or perhaps something happened to the file when transferring from one machine to another, maybe even between a MAC and a Windows system… or with a dodgy connection between drives… or even something in an email transfer. But I’m guessing. These sorts of problems are not anything we can address easily (other than to recommend you zip up project files before sharing) as they occur outside of Studio and are related to factors beyond our control.

Another reason could be a bug within Studio affecting you in two ways. The first is that a project where the file could be successfully saved at the start suddenly fails at some point during the project. I’m not sure what causes this either… it might be due to problems associated with security software again that corrupts a bilingual file (an SDLXLIFF), or perhaps the SDLPROJ file which contains metadata helping to manage the project resources… but at least this one is easily resolved. You can prepare the project again from the original source file (if you have it) and pretranslate the project from your TM (or use PerfectMatch… although I’ve found PerfectMatch can sometimes reproduce the same issue perfectly!!), then save the target file.

A problem with the source file or the Studio filetype

If you can’t save the target file immediately after creating the project, which is the second way a bug within Studio can affect you, then the problem is either with the source file or the Studio filetype and in these cases you should not try to complete the translation until you have resolved this problem as it may be wasted effort. If it’s an Office file, Word, Excel or PowerPoint, then you can try using a different version of the filetype as there are several to choose from and create the project again! If the problem is with the source file then you can try and identify the problematic part of the file using the tried “divide and conquer” methodology which is a very quick way to find parts of the document you may be able to remove before translation. This and a few more interesting methods for resolving filetype problems are detailed in this old, but very useful KB article.

Needless to say, if you have problems with the source filetype or the Studio filetypes you should report them through support or the SDL Community… this article refers.

↧

The versatile regex based text filter in Trados Studio…

March 16, 2020, 10:49 am

≫ Next: Some you win… some you lose

≪ Previous: Wot! No target!!

After attending the xl8cluj conference in Romania a few weeks ago, which was an excellent, and very technical conference for translators, I thought it was about time I wrote an article around the things you can do with the Regular Expression Delimited Text filter since it is so useful for solving all kinds of tasks related to text based files that don’t fit any of the out of the box formats available in the product. Files such as software string files and csv files are common examples of where understanding how to work with this customisable file type can yield many benefits. So this article is food for thought and a few things that might be helpful to you in the future. It’s also pretty long (I’m not kidding!), so maybe grab a cup of coffee before you start to go through it!

CSV FILES

But Studio can handle a CSV file out of the box! Well that’s true, but only if you have one column containing the source and one containing the target. If you have a monolingual file containing text for translation that just happens to be separated by commas then the out of the box file type isn’t helpful. I actually discussed this at length with a few users in Cluj as there are obviously some good workarounds for this and I imagine this is what most people do already when handling a file like this:

convert to Excel
import to Excel

Both of these are good workarounds, but the first one is a very tedious process if you have hundreds or even thousands of these files to contend with, and the second one would fail if the CSV files were formatted differently because you couldn’t easily get the target CSV files back out later.

Now having said this you only have to do a simple search for “batch convert csv to excel” in google and you’ll get loads of free options to make this easy for you. But if I do that I can’t show you some really useful features of the Regular Expression Delimited Text filter which could be useful for other tasks… so instead let’s pretend I didn’t say that!

Step 1: Create the filetype

To begin with I go to File -> Options -> File Types and click on New… then select the Regular Expression Delimited Text.

This opens up the File Type Information pane and I complete these four fields (not all mandatory but they are useful):

File type name: this provides a unique name to the file type.
File type identifier: (not mandatory) this allows me to be sure I have used the correct file type when preparing projects as it shows up in the Files View after preparing my project and also in the orange tab at the top of each open file in Studio when I use TagID mode.
File dialog wildcard expression: you need to make sure this says *.csv if you want it to be used to open a CSV file.
Description: (not mandatory) this is just useful, especially if you create a lot of custom file types, especially if you share with others, as you can make a note of what the file type does.

Then you click on Finish and you should have your new file type ready to go… now that was easy!!

Step 2: Previewing your genius

I wanted to add this in because this feature in Studio is a fantastic time saver when you are working on your ingenious creation. You just select one of your CSV files and click on Preview after each change to the settings in your file type:

this is your new file type that should now be visible in your list.
this is the Preview feature. Just Browse… for your test file and then click on Preview.

When I do this I can see that I have opened up all the content of my file for translation, but it’s not segmented as I’d like on the comma:

So the next step has to be to create the rules ‘ll need to segment the file.

Step 3: Segmenting the text

Normally, when you need to segment your text you’d think about creating a custom segmentation rule in your TM since this is what drives your segmentation. You could do this of course, but file types also assist in the segmentation of your file and this case I think it’s easier to manage in this way. If nothing else it means you don’t have to use a different TM whenever you’re handling CSV files.

So, how do we go about this? Well, the way I’m going to tackle it is by making the comma a non-translatable tag and set it as external. This is pretty simple and I just do this:

Open the new file type you created and click on Inline tags.
Add a new rule.
Specify that the Rule Type is a Placeholder, and use a comma as the Opening rule.
Click on Advanced…
Then set the rule to Exclude

That was pretty simple too… now we can preview the test file again by clicking on Preview (see how fast this test is!):

In general that seems pretty good… until we scroll down a little and now I see that segments 54 to 57 should actually be in one segment, and segments 59 and 60 should also be in one segment. To understand why, and to illustrate the problem we have to solve, we need to look at the source file:

006,William-Adolphe Bouguereau,French,(1825-1905),"A Girl in Peasant Costume, Seated, Arms Folded, Holding a Ball of Wool and Knitting Needles in her Right Hand",1875,"1,305 €"

On inspecting the CSV file we can see that there are some lines, enclosed by double quotes, and this is because the use of the quotes tells a parser that can read CSV files not to separate on commas when they are within these quotes. So simply using a comma as the rule to segment on is not enough for my files. I need to be a little smarter. To do this I need to create a regular expression that will only find commas where I need to segment. So, this is what I did:

[^”]	Match anything apart from a quote
*	Keep matching anything apart from a quote
“	until you get to a quote
\B	but only where the quote isn’t on a word boundary (commas and end of line are not recognised chars for a word boundary – not in \w)

This gives me this:

[^”]*”\B

Now, I want to find a comma that doesn’t fall within this search pattern. So to do this I need to enclose this within what’s referred to as a negative lookahead:

(?![^”]*”\B)

A negative lookahead is just an assertion, it doesn’t actually match anything. But if I add my comma because this is what I want to find, then I can now find commas but only when they’re not followed by what’s in the lookahead:

,(?![^”]*”\B)

Apologies if that was a little hard to follow… it’s a good example of why it’s important to learn a little about regular expressions. If this is all new to you I’d recommend you start now as there are so many applications in a translation tool for using them and you can get a lot of benefits from this. But I digress… back to our file type. I can now replace the comma I used previously with my new expression like this:

And this time when I preview the file I see something like this:

That’s much better… only spoiled by the double quotes that have been included as translatable text. But that’s easily solved by adding one more rule with a single double quote as a non-translatable placeable, and setting this as external so it’s removed from view:

And that’s it… for the sample files I used this does the job nicely and I can handle as many as I like without having to do any conversions at all. If you want to work through this example, here’s a test file you can copy/paste to create a CSV like this one:

SKU,Name,Nationality,Lived,Artwork,Year,Est. Value
001,Leon Bakst,Russian,(1866-1924),Portrait of Virginia Zucchi,1917,"12,250 €"
002,Sir Max Beerbohm,British,(1872-1956),The Encaenia of 1908,1908,"17,100 €"
003,Ivan Yakovlevich Bilibin,Russian,(1876-1942),Design for the Costume of Babarikha (the Matchmaker) in Rimsky-Korsakov's Opera 'Tsar Sultan,1928,"12,000 €"
004,Richard Parkes Bonington,British,(1802-1828),Shipping Off the Kent Coast,1825,"7,500 €"
005,François Bonvin,French,(1817-1887),"A Seated Woman, Sewing by a Table",1848,"10,250 €"
006,William-Adolphe Bouguereau,French,(1825-1905),"A Girl in Peasant Costume, Seated, Arms Folded, Holding a Ball of Wool and Knitting Needles in her Right Hand",1875,"1,305 €"
007,Ford Madox Brown,British,(1821-1893),Study for a Greyhound,1850,"25,950 €"
008,Alexander Pavlovich Bryulov,Russian,(1798-1877),"Portrait of Marie-Amélie, Queen of the French",1860,"8,400 €"
009,Paul Cézanne,French,(1839-1906),"Studies of a Child's Head, a Woman's Head, a Spoon, and a Longcase Clock",1872,"32,350 €"
010,Jean-Baptiste Camille Corot,French,(1796-1875),Civita Castellana: A Woodland Stream in a Rocky Gully,1826,"12,750 €"

Now we can take a look at some software string files that are also not handled out of the box.

SOFTWARE STRING FILES

These are file types I see coming up all the time in the forums in some form or another. Unfortunately they are often the most inconsistent files in terms of the syntax being used, but we can work around this easily enough using our rules. So, what do these file types look like? Most of the time they are key-value pair files, so I’ll use these as an example and I’m pretty sure you’ll be able to adapt the rules to suit any variants to this on your own… but if you can’t you can always ask for help in the SDL Community where you’ll find plenty of help from the many smart users in there:

Apple define their strings files like this:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

These have three components to them:

a comment enclosed with the /* and */ syntax.
a key enclosed in double quotes preceding the equals sign
a value enclosed in double quote after the equals sign

The ideal way to handle these files is to use SDL Passolo where the file preparation is a breeze and you can export to SDLXLIFF to translate in Studio afterwards if you prefer. Using the DSI Viewer from the SDL AppStore means you can see the comment and the key for each value being translated as you work… very neat and simple:

But… if you’re a Studio user without access to Passolo, and you’ve been asked to handle a file like this, which we see happening all the time, then here’s a solution using the Regular Expression Delimited Text file type.

Step 1: Create the filetype

This is exactly same as we did before, except this time you probably have to use *.strings as the File dialog wildcard expression.

Step 2: Previewing your genius

We can see here in our preview pane that the entire contents of the file are being extracted and segmented on the basis of Studio default rules:

So if we want to be able to see all of this information then we need to try and do a couple of things:

extract the comment and lock it so we can still see it, but ignore it during translation
segment the key-value pair so they are on separate lines
lock the key so it can be seen but ignored during translation

Step 3: Extract the comment

This pretty straightforward, we just create an Inline tag rule using a Tag pair like this:

I removed some of the steps this time on the basis you would have no problem doing this after following the more detailed steps for the CSV file type above. Hopefully this also helps to see how simple this can be for all kinds of text based filetypes.

I could set this rule as Include under the Advanced… options so this now gets me this when I preview:

You can see that the comments are visible, in their own segment even if there is a period at the end of the comment, and also protected so you won’t translate them.

Important note:

However, in practice I had a small error repeating this for the other rules when I tried to lock the key in the same way I tackled the first rule above. So instead I took a different approach and set all the rules as translatable instead (which will remove the locked status in the image above) and used the formatting feature to colour the text red… I’ll explain why I did this shortly.

Step 4: Segment the key-value pairs

To tackle this I created two new rules, this simple tag pair rule to extract the key string and also colour it red as I eventually did for the comment:

^”	Opening: Match a double quote at the start of the line
“\s	Closing: Match a double quote followed by a space

That should allow me to extract the key string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

Then I created another tag pair rule to extract the value string which is actually the one I want to translate, but didn’t colour the text:

=\s”	Match an equals sign followed by a space and a double quote
“;$	Match a double quote followed by a semi-colon at the end of the line

That should allow me to extract the key string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

This nicely previews like this:

You can see that the text I need to translate is black, and the comment and the key string are in red, but visible to me while translating.

Step 5: Filter out and lock the non-translatable segments

Now, why did I colour them red apart from the obvious reason which is to be able to distinguish them from the translatable text? Well, if I open one of these apple strings files in the Studio Editor I can now use the Community Advanced Display Filter to filter on the red coloured text, like this:

So now I’m only displaying the segments I don’t want to translate. Next I just copy source to target, change the status to translated and lock them. I can now clear the filter to see this:

Perfect… I now get these benefits:

my analysis will only include the translatable text
when I confirm a segment I will only ever move to the next segment for translation
I can always read the comments
I can always see the key string

… and I get one little annoyance! The segment with text between tags (the comment) retains the red colour when I lock the segment whilst the other segment does not. If I did this again I’d use grey as the colour as opposed to red because I find it distracting… but I’m leaving this here because you may also come across a similar problem as me.

WHAT ELSE?

Well, the apple strings file was just one typical example, so here’s a few more (just the settings and what you should get) so you have some idea of how to use the rules for these sort of files that we do see quite often in the community forums.

Another way to handle our apple strings example

If you’re only interested in the translatable text and don’t want to see the comments or key strings at all then you can also handle this using the Document Structure node in the file type settings by telling it exactly what you want to extract in the first place :

“.+=\s”	Match a double quote, keep matching any character until it’s possible to match an equals sign followed by a space and a double quote
“;$	Match a double quote followed by a semi-colon at the end of the line

That should allow me to extract the value string in the highlighted text below:

/* Question in confirmation panel for quitting. */
"Confirm Quit" = "Are you sure you want to quit?";

/* Message when user tries to close unsaved document */
"Close or Save" = "Save changes before closing?";

This nicely previews like this:

LNG (Language Resource Files) and PHP (Array Files)

These types of files are used by various software applications and I have seen them (rightly or wrongly) with different file type endings, so it’s important to note the ending when you create your file type as you’ll need to use this in the File dialog wildcard expression which you’ll recall from step 1 in the CSV example. Typical examples I’ve come across are things like *.lng, *.ini, *.php, *.txt. The main thing is that the format of the text in the file could be something like this where you’re interested in getting at the highlighted text only:

lng file example

[trPrint]
TR_About="&About..."
TR_FormCaption="Find Text..."
TR_SaveFilePositions="&Remember editing positions"

or something like this:

PHP array file

<?php
/* en.php - english language file */
$messages['hello'] = 'Hello';
$messages['signup'] = 'Sign up for free';
?>

All of these sort of files follow the same basic principle (as far as we’re concerned for the file type creation) and can be handled easily using the Document Structure node in the file type settings as we did earlier for the simplified apple strings file type. For the language resource file, lng, I could use something like this:

.+=”	Keep matching any character until it’s possible to match an equals sign followed by a double quote
“$	Match a double quote at the end of the line

Which should get me:

Not bad… but I could improve on this and also protect the accelerator keys you can see in the text (& symbol) which will also help me with QA to avoid these important missing tags. To do this I just add a simple placeholder rule in the Inline tags and make sure the Inline tag behaviour is set to Include:

Now I have this:

So that was simple enough… and what about the PHP array file. A very similar task, and I could solve it with these opening and closing patterns in the Document structure node:

.+\s’	Keep matching any character until it’s possible to match a space followed by a single quote
‘;$	Match a single quote followed by a semi-colon at the end of the line

Which should get me:

So very similar and very straightforward.

Final Words

If you have a text based file type like these and after reading this post are still having problems then feel free to share a snippet and if I can do it I’ll add your example to the list. This sort of thing comes up so often I think the more we have as examples the better.

I also had some thoughts around what’s lacking with the current features for handling these file types in Studio and if we don’t see them in the product in the future we might take a look at handling them through the SDL AppStore:

ability to define a pattern you can assign as a comment
ability to define a pattern you can assign as Document Structure Information
ability to define source and target patterns in case the file you have is multilingual/bilingual

If you have any other thoughts of your own feel free to add them and we can consider these as well. In the meantime I’ve added these to the SDL Ideas site… so go and vote today!

↧